Discover more from networked
Careless Whisper (Usage Can Blow Out Your API Limits)
Fun new experiments with cost savings on OpenAI.
TLDR: I launched an app
Last week I released a podcast summarizer app. Built using Python, Streamlit, and OpenAI, the app does the following:
It asks the user to input a podcast episode page, YouTube link, or any generic audio URL.
It asks for the user’s OpenAI API key.
It downloads the audio file, splits it into chunks small enough to send to OpenAI’s Whisper speech-to-text API, and aggregates the transcribed chunks back into one full transcript.
It sends the transcript — either as one big string or in chunks, depending on the length of the transcript — to OpenAI’s ChatGPT API and receives a summary of the transcript.
It then displays that summary to the user.
Pretty basic, right? The main value proposition here is that it cuts out the bulk of the manual busywork previously required to get from Point A (“here’s a podcast episode I want to summarize”) to Point B (“here’s the summary”).
In an effort to figure out how to reduce transcription costs, which constitute the bulk of the expense of generating these summaries, I’ve just now launched a new tool alongside the summarizer app which analyzes the effect of audio modifications on transcript accuracy: https://podcast.streamlit.app/whisper-diagnostics. Please check it out and let me know what you think!
Read on for the gory technical details
I built the podcast summarizer app for fun, and also because I think it’s decently useful for some real-world use cases. But one glaring obstacle to getting any real users (other than myself) to use it is the fact that they have to enter their own OpenAI API key to do so. (This is necessary, of course, so I’m not on the hook financially for generating every random user’s podcast summaries.) This is a subpar user experience for several reasons:
Most people don’t have an OpenAI API key, so at minimum they’ll have to spend a few minutes signing up for an account, setting up a billing method, and generating a key.
Even once they’ve created a key, it’s, shall we say, less than ideal to enter it in a form on some random web site created by a person they’ve never met before. (Hi! You can trust me. But still, as a general matter you probably shouldn’t.)
Obviously, it would be better to provide the app’s functionality without requiring this painful step. So I briefly looked into setting up a simple Stripe payment flow. That probably wouldn’t be that much less painful than finding your OpenAI API key, but at the very least entering a credit card number on a web site in exchange for obtaining a service is a UX pattern most people are familiar with. In any case, there was no easy and secure way to implement Stripe on Streamlit and I was too lazy to figure out Flask instead, so I was back to square one.
This got me thinking: if I can’t charge users using Stripe, and it’s too annoying to expect them to have their API key handy, could I just…offer this service for free?1 So I started digging into the costs.
In theory, there are three main cost categories for my podcast summarization app, or any other site with similar functionality:
The web server powering the back end for the web site
The OpenAI Whisper transcription costs
The OpenAI ChatGPT summarization costs
#1 is actually free with Streamlit. (Thanks, Snowflake!) But #2 and #3 is where it gets really interesting.
Let’s start with #3: summarization. OpenAI’s ChatGPT pricing has become somewhat convoluted these days, mainly because there are two main chat models available (v3.5 and v4), each of which has two possible context window sizes (4K and 16K, and 8K and 32K, respectively). These are all priced differently and — to make things even more complex — input tokens (i.e. the queries you send to ChatGPT) are priced differently than output tokens (the responses you receive from ChatGPT).
A quick aside on terms here. ChatGPT 4 is OpenAI’s newest, most powerful large language model (LLM), the successor to 3.5. Context windows refer to the total amount of “memory” ChatGPT has: that is, how many words you write to, and receive from, ChatGPT before it starts forgetting the things you both wrote at the very start of the conversation. And you can think of tokens as a slightly shorter version of a word: an average word takes about 1.33 tokens.
Now, what my app does is take (potentially long and wordy) podcast episodes, transcribe them, and turn them into short, punchy summaries. What this means is that the number of input tokens (i.e. the transcript) sent to the ChatGPT endpoint will be much, much larger than the number of output tokens (i.e. the summary) received from it. So for the sake of simplicity, let’s ignore the output token pricing and focus on input tokens only.2
Depending on which model and context window size you use, the price ranges anywhere from $0.0015 per 1,000 tokens to $0.06 per 1,000 tokens — a 40x range! So right off the bat this creates a very large cost-savings incentive to use the 4K context window on the 3.5 model if at all possible, as this has the cheapest input token rate.
But what does this mean in practice? Well, the average American English person speaks about 150 words per minute (although, obviously, this varies widely by person and the context of the conversation — more on this later). This comes out to 9,000 words or 12,000 tokens per hour, if speaking continuously. So if we apply the ChatGPT input token prices to this estimate of one hour of audio, we’d expect to pay anywhere from $0.018 to $0.72 to summarize the transcript of a one-hour podcast episode.
Let’s go with the $0.018 estimate: about $0.02 to summarize a one-hour podcast episode transcript. That’s pretty cheap. So how about the cost of obtaining the transcript in the first place?
This brings us to cost category #2 from above. Compared to ChatGPT’s pricing, OpenAI’s pricing for its speech-to-text (transcription) model, Whisper, is very straightforward: it’s $0.006 per minute of audio. So one hour of audio would cost $0.36 to transcribe. Right away you can see the gigantic gap: transcription costs about 20x summarization for the same content. To put it another way: if you’re looking for cost savings in a podcast summarizer app, it’s best to focus your energy on the transcription side, rather than summarization.
Remember, the Whisper API is priced by the minute, not by word, megabyte, or anything else. So anything you can do to arbitrage this rate — that is, to stuff as much information as possible into as short an audio file as possible — saves money.
How can we do this? So far, I’ve focused on two main theories3:
Speeding up the audio
To test these methodologies, I have developed the tool I mentioned above, which analyzes the effect of audio modifications on transcript accuracy:
This tool uses the Levenshtein ratio as a quality metric to compare a transcript from modified audio to the baseline transcript from the unmodified audio file. A Levenshtein ratio of 1 indicates that two texts are identical: the lower the ratio, the bigger the difference between the texts. So if the transcript from a highly sped-up audio file has a Levenshtein ratio close to 1 when compared against the standard-speed transcript, this would imply that we could achieve similar accuracy at potentially much lower cost.
What I have found is that, on certain audio files, the Levenshtein ratio between the original transcript and one taken from a 2x speed audio file is in the 0.98 range — a pretty unbelievable level of accuracy when you consider the halved transcription cost. In most cases, however, such as in this extremely fast-talking YouTube video below, accuracy drops considerably at higher audio speeds:
Feel free to run your own analyses too. (Unfortunately, given the number of API calls these experiments cost, you’ll have to use your own OpenAI API key.) By default, your results will automatically be added into the database (ahem, Google Sheet) of test results which is publicly viewable for analysis at https://podcast.streamlit.app/whisper-tests:
Assumptions, caveats, and areas of further research
There are a number of assumptions and caveats to keep in mind. First, I’m assuming the untouched audio file is the gold standard. That is, because I’m using the Levenshtein ratio to compare transcripts generated from the modified (sped-up, silence-removed, etc.) versions of the audio file against the transcript from the original, untouched one, this presumes that the original transcript is the most accurate one. In initial testing, however, this does not appear to necessarily hold true in all cases, so I’d be curious to see other researchers dig deeper into this area (for example, by using manual human transcriptions as the gold standard).
Whisper transcriptions are also not deterministic: transcribing the exact same audio multiple times will often result in variations. This, of course, underscores the fragility of using these auto-generated transcripts as a gold standard and also adds noise to any comparative dataset.
There is a lot of potential for further research here. Just a few examples:
The Whisper API has a temperature parameter that might help make these transcripts more deterministic.
FFmpeg, which I’m using to speed up the audio and remove silences, has settings that can be tweaked to potentially produce higher-quality and/or shorter audio files.
The cost savings of transcribing a sped-up audio file may be at least partially offset by the additional compute processing required to manipulate these audio files in the first place. (Offhand, I’d assume those costs are de minimis, but I could be wrong.)
The open dataset being generated by this tool — which is visualized in an interactive scatterplot as well — contains both estimated word count per minute (WPM) and overall audio duration in seconds.4 Either, or both, of these metrics could be a useful way to bucket and measure accuracy: my intuition, for example, is that audio with lower WPMs would (on average) result in relatively higher Levenshtein ratios at high speeds (e.g. 2x, 3x, etc.) than those with higher WPMs would. Similarly, it’s possible that longer audio files generally have higher Levenshtein ratios to begin with, as there’s proportionally less noise the more data you have to work with.
And of course, other transcription and summarization models — for example, offerings from AWS, Azure, and GCP — are worth exploring as well.
Lastly, since the end goal here is to produce accurate summaries, not simply the intermediate transcript, a more complete analysis would then summarize each of these transcripts and come up with a way to identify the best one. Defining “best” could itself be left up to ChatGPT: one could feed it the original, untouched transcript alongside multiple summaries (each generated from various audio speeds at very low temperatures) and ask it which one was the best. Or, more expensively, human ratings could be used. If you find any of these areas interesting to explore, reply to this newsletter and I’m happy to brainstorm!
Another thing we’ll ignore here for the sake of simplicity — although we absolutely shouldn’t if it ever came time to produce a more robust analysis — is the additional cost due to recursive summarization. Recursive summarization refers to any time that the episode transcript is too long to send as one piece in a given context window.
For example, a one-hour podcast may have approximately 9,000 words — that is, 12,000 tokens. With ChatGPT 4’s 16K context window, the entire transcript could be sent as one text string, leaving about 4,000 tokens (or 3,000 words) of space for the summary in response. But given that the price for this query is ~40x/token as expensive as the 4K context window under the ChatGPT 3.5 model, I’d much prefer to use the latter.
Of course, a 12,000-token transcript is far too large to be sent to an endpoint with a 4K context window, so the transcript would have to be broken up into multiple chunks, each of which would be sent to ChatGPT for summarization separately. Then, these individual chunk summaries would be sent as a new combined query, resulting in a final summary compiled from these intermediate ones. (In some cases, even this combined query may be too long, requiring another level of chunked summarization — hence the term “recursive.”)
In short, using a smaller context window can often result in having to send more tokens overall to ChatGPT, because of these intermediate summarization steps. This increases cost as the context window gets smaller, although this increase should generally be more than offset by the decrease in API pricing for these smaller context windows. Relatedly, recursively summarizing a transcript is lossy: information is lost each time you do it (akin to playing the children’s game of telephone), so smaller context windows may also decrease the quality of the final summary. Finally, smaller context windows also mean the app takes more time from the moment where the user submits the request to the point where a summary appears.
Of course, the cost savings of using an older ChatGPT model and a smaller context window are so enormous that these deteriorated cost, quality, and time factors are likely worth suffering anyway. But it’s important not to forget they exist.
In theory, prompting the Whisper API to leave filler words out of the transcript could also save a little on cost, but A) this would only impact summarization costs, which are already much lower than transcription, and B) OpenAI appears to remove filler words by default already.
Turns out most podcasts have a WPM rate closer to 200 than 150.