Whisper Audio to Text: Transcribing Long Audio Files with Python

As a software engineer, I often need to convert audio to text. OpenAI's Whisper model is great for this, but it has limitations with long files. The file size limit for the OpenAI Whisper model is 25 MB. Here's how I solved this problem using Python.

The Challenge with Whisper Audio to Text Conversion

OpenAI's Whisper model is efficient for audio to text conversion, but it has a file size limit. This becomes an issue when transcribing long audio files like podcasts or interviews.

The Solution: Chunking the Audio

To overcome Whisper's file size limit, we can break the audio file into smaller chunks, transcribe each chunk, and then combine the results. Here's how to do it with Python.

Python Code for Whisper Audio to Text Conversion

First, let's import the necessary libraries:

import openai
import os
from pydub import AudioSegment
from pydub.utils import make_chunks

Next, we'll create a function to transcribe individual chunks:

def transcribe_chunk(chunk, chunk_number):
    chunk_name = f"temp_chunk_{chunk_number}.wav"
    chunk.export(chunk_name, format="wav")
 
    with open(chunk_name, "rb") as audio_file:
        transcript = openai.audio.transcriptions.create(model="whisper-1", file=audio_file)
 
    os.remove(chunk_name)
    return transcript.text

This function takes an audio chunk, saves it as a temporary WAV file, uses Whisper for transcription, and then removes the temporary file.

Now, let's create the main function for transcribing long audio:

def transcribe_long_audio(file_path, chunk_length_ms=120000):  # 2 minutes chunks
    audio = AudioSegment.from_mp3(file_path)
    chunks = make_chunks(audio, chunk_length_ms)
 
    full_transcript = ""
    for i, chunk in enumerate(chunks):
        print(f"Transcribing chunk {i+1} of {len(chunks)}...")
        chunk_transcript = transcribe_chunk(chunk, i)
        full_transcript += chunk_transcript + " "
 
    return full_transcript.strip()

This function breaks the audio into 2-minute chunks, transcribes each chunk, and combines the results.

Here's how to use the function:

# Usage
mp3_file_path = "./file.mp3"
transcription = transcribe_long_audio(mp3_file_path)
print(transcription)
 
# Save the transcription to a file
with open("transcription.txt", "w") as f:
    f.write(transcription)

Key Points for Whisper Audio to Text Conversion

API Key: Set your OpenAI API key as an environment variable for security.
File Formats: This script is for MP3 files. Modify AudioSegment.from_mp3() for other formats.
Chunk Size: The script uses 2-minute chunks. Adjust if needed, but stay within Whisper's limits.
Cost: Each chunk requires an API call. Monitor your OpenAI usage for long transcriptions.
Accuracy: Chunking may occasionally split words. Post-processing might be necessary for perfect accuracy.