About The Great Library of Chatham High Street
Update: Matthew Povey, 23rd November 2025.
I skipped writing updates over the last few months but a lot has changed here. Around the summer, I introduced a catalogue to the site. To build this, I used LLMs to extract metadata from every one of the episode transcripts and build a database of events, figures, books and guests that are mentioned in the podcast. This project has always been a mixture of building something interesting for myself and fans of the podcast, but it's also an ongoing experiment in what it is possible for an individual to build using AI. When I started the project, it was just about cost effective to generate the transcripts using OpenAI Whisper running on several machines in my house (two old laptops and a workstation). It eventually became practical to add Q&A tools that employ LLMs with a vector database (RAG) to ask questions about episodes. But until recently, processing every episode (which represents 40m+ tokens) to extract metadata or summarize across the multiple episodes in a series was expensive. It's testament to how quickly AI is progressing that the cost has dropped sufficiently to do much more with the data.
Part of the reason this has been fruitful for me is that this general approach of using AI (and AI technologies like vector databases) to extract information from unstructured data is applicable across a wide range of domains. All large organizations have vast depths of unstructured data in which information is trapped. These tools can unlock it and make it useful.
But perhaps more important to me is not what users see at the front-end, but the tools I was able to build with AI on the backend. Pulling metadata from podcast transcripts is not a deterministic exercise. Winston Churchill might be identified in numerous different ways (Churcill, Winston Churchill, Sir Winston Churchill, Winston Leonard Spencer Churchill) and a database is not much use when an entity like that exists in many different forms. But AI code generation created an opportunity.
Data cleansing is the process of improving the quality of data. It means removing junk and consolidating entites into what is called master data. In the past I would have used a Jupyter notebook to do a lot of this. Myriad commercial tools exist to assist with the process. But with AI code generation, I was able to build domain specific data consolidation and cleansing tools in a weekend and continually improve those tools as I went along. Initially they allowed me to manually select multiple entities and consolidate them into one. At this point, I have a combination of that, vectorized representations of the columns in the database across which I can do automated similarity searches, and LLM based tooling that can make educated guesses about what an entity is and further improve the data by adding additional metadata. This has let me create a database that is by no means perfect but able to provide a detailed and powerful catalogue of every episode, figure, event, guest, book and author in the archive, and tag them all with additional metadata like the era(s) to which they are relevant.
The general approach is widely applicable and the tools available to hobbyists and organizations are only improving rapidly. It is interesting to apply this approach to a history podcast because so much historical information in locked up in archives that in the past could only be made useful by application of vast amounts of human labour. These tools offer the possibility of unlocking those (and many other archives). The study of history is going to make huge strides thanks to these tools.I'm still enjoying building improvements and enhancements to this site. Expect more in the coming months. For now, enjoy the new look and feel and hopefully it is making finding the episode you wanted to listen to again, or a new one about a subject that has piqued your interest, easier to find. Have fun.
Update: Matthew Povey, 12th January 2025.
I built the original Rest Is History Interrogator in 2023, it was partly an experiment to see how effective code generation could be in building hobby projects. It proved very effective but the original version ended up being a bit of a mess and hard to maintain. I built simple AI functionality into that version but didn't release it because the API costs at the time were too high to justify. Since then, the cost of AI inference has fallen dramatically to the point where it's reasonable to make the functionality available. It's still a fairly naive RAG implementation but works well enough.
This version is a mostly complete rewrite of the original and a demonstration of how much easier it is to build applications with the tools that are now available. Code generation is now sufficiently effective that large parts of application functionality can be built without writing any code at all. Indeed, there are parts of the application code that I've barely even looked at.
We are very nearly at the point where any individual can build tools which make their lives or jobs easier which is a very exciting prospect.
Major Changes - 12th January 2025
- Improved the semantic search by moving from a flat file vector store to ChromaDB
- Finally added the ability to ask questions with AI. I use the free tier of SambaNova Cloud to get high speed inference
- Re-ran transcriptions using a newer version of OpenAI Whisper (much faster as I now have better hardware)
- Moved from whoosh to ElasticSearch for the free text search
- Added a simple chat interface
Still to come (maybe, when I have time)
- Re-process the transcripts to clean up proper nouns that Whisper gets wrong
- Diarized transcripts to properly distinguish between Dom, Tom and guests
- A more comprehensive database of episodes and a way to search it
About page from the March 2023 version:
I'm Matthew. Like many others, podcasts have mostly replaced the radio as the primary way I listen to audio for entertainment, professional information and news.There are a few podcasts, including The Rest is History for which I have listened to many if not all episodes.
Some, like EconTalk have transcripts available to read and search, but most do not. This makes discovering whether someone said what you think they did or told a story the way you remember it difficult. Building a searchable database of transcripts of any podcast I might fancy has long been a back-burner project which, like most of my back-burner projects I didn't have any real expectation of doing. It's not that I couldn't do it, but that the knowledge I would need to acquire to do it wasn't all that useful to me outside of the specific project and so the investment of time was difficult to justify.
That changed when large language models (LLMs) started to become effective ways to help build - or just build - code. The project became more managable still when OpenAI released their Whisper project which offered the ability to do high-quality transcriptions of any audio in many languages on consumer grade computer hardware. Then Georgi Gerganov worked minor miracles to port that to C++ dramatically increasing the performance and hence reducing the compute and time required to transcribe. His whisper.cpp project meant I could get 400 episodes of The Rest is History transcribed using 3 computers (an old i7 server, an i5 Mac Mini and a M1 Macbook Air) in my house over about 72 hours. As this was winter, I even got to keep my office warm by doing it...
The combination of Whisper.cpp, GitHub CoPilot and the original ChatGPT got the first version of the application built much faster that I could have done it myself and without having to spend money on transcription services from Google GCP or Amazon AWS. With the launch of GPT4 it's even easier to get write code that works well enough to be useful. Anyone with experience building web-based applications professionally will notice all the signs that I don't do that in the quality of the HTML.
I've now added some further functionality to complement the simple search functionality. I've used the OpenAI embeddings service to create a better (or different at any rate) version of the search which allows for more free-form (semantic in the industry jargon) searches. The original would find nothing for a search of "the wrong shoes affecting global politics" or "Cromwell's behaviour in Ireland" but the new version does. While this is an improvement, it's really just the groundwork to allow a LLM to search the podcast and answer questions about it. That already works rather well but I need to figure out how to avoid it costing me a fortune.
This whole thing has been a bit of fun and an experiment is learning how much a below average developer can get done using the tools that are now available to us. I hope you find it useful.