Using Gemini to Transcribe Audio and Video into Subtitles | pyVideoTrans-Open Source Video Translation Tool -pyvideotrans.com github.com/jianchang512/pyvideotrans

Gemini is a powerful AI model that can handle various types of content, including text, images, audio, and video. It can be used for free on the web with almost no restrictions, except that you need to use a VPN.

Gemini is well-suited for speech-to-text tasks. It supports many languages, including some lesser-known ones, and the recognition results are quite good.

If you want Gemini to directly generate SRT subtitle files, you need to use specific prompts. Below is a prompt that you can copy and use to have Gemini transcribe and output SRT subtitles.

Speech-to-Text Prompt

You are a professional subtitle transcription assistant. Your task is to transcribe the file I provide into text and format the transcription results into an SRT subtitle file that conforms to the EBU-STL standard. The specific requirements are as follows:

## Each subtitle block must be strictly output in the following structure:

[Line Number]
[Timecode Line]
[Text Line]
[Empty Line]

**Explanation of the structure**
- [Line Number] is the sequence number of the subtitle block, starting from 1, such as 1, 2, etc.
- [Timecode Line] is the timestamp, in the format HH:MM:SS,FFF --> HH:MM:SS,FFF, indicating the start and end time of the subtitle (FFF represents 3 milliseconds, such as 000 to 999). If you cannot accurately calculate the time, you can reasonably estimate it based on the audio content, ensuring that the time interval is logically reasonable.
- [Text Line] is the transcribed text content.
- [Empty Line] is the separator between subtitle blocks, ensuring that there is an empty line after each subtitle block.

## Restrictions
During output, you must strictly adhere to the above format, do not omit any parts, and do not add any extra text or comments.
The duration of each subtitle block should be controlled between 3-15 seconds as much as possible, and specifically divided according to the speech rate and semantics.

Now, please transcribe according to the file I provide and output the subtitle content in the above format.

How to use

Using Gemini requires your own VPN

Open the Gemini website and log in, https://aistudio.google.com/app
Select the model on the right, Gemini 2.0 Flash is sufficient, but choosing a "Thinking" model with a thinking process will yield better results.

Enter the prompt and upload the file, as shown below

The result after transcription is as follows, which looks pretty good

Extension

If you need to translate the subtitles, you can also ask him to translate the subtitles into xx language in the prompt, or ask for bilingual subtitles to be output side-by-side.

Shortcomings

Gemini's biggest shortcoming is that the timestamps are not very accurate. Perhaps this problem can be solved with subsequent optimizations in newer versions.

Currently, if you want to solve this problem, you can only use VAD to segment the audio before transcription, and then transcribe the segments one by one, and then reassemble the transcription results into SRT, which is too inefficient manually.

It is recommended to use the Audio and Video to Subtitles function in the free tool pyVideoTrans and select Gemini AI, which will be done automatically. You only need to select the audio and video to be transcribed.

Download address: https://pyvideotans.com

Speech-to-Text Prompt ​

How to use ​

Extension ​

Shortcomings ​

Speech-to-Text Prompt

How to use

Extension

Shortcomings