More games at WuGames.ioSponsoredDiscover free browser games — play instantly, no download, no sign-up.Play

Auto Subtitle Generator

Free AI captions in your browser: video to SRT & WebVTT, word-level timestamps, 99 languages, line-length and CPS conditioning. Private, offline, no upload.

Upload
Drag & drop video here or click to browse
MP4, WebM, MOV, MKV, AVI, M4V and more (max 200MB)

About Auto Subtitle Generator

Manually transcribing a 30-minute video into subtitles takes a trained captioner about 90 minutes; YouTube's auto-captions get there but require an unlisted upload and English-only refinement; Rev charges $1.50/minute and Otter caps free tier at 300 minutes/month. This tool runs OpenAI's Whisper (the same multilingual model used by professional transcription services) entirely in your browser via WebAssembly — your video never leaves your device, no quota, no subscription. It extracts audio via ffmpeg.wasm, feeds it to Whisper for 99-language speech-to-text with millisecond timestamps, then formats as standard SRT (universal player support) or WebVTT (HTML5/YouTube native). Privacy-critical for confidential footage, interviews under NDA, or legal/medical content.

How does it work?

The tool extracts audio from your video, then uses OpenAI's Whisper AI model (running locally in your browser via WebAssembly) to transcribe speech into text with timestamps. Finally, it formats the transcription into industry-standard SRT or VTT subtitle files.

What video formats are supported?

MP4, WebM, MOV, MKV, AVI, M4V, WMV, FLV, 3GP, OGV and MPEG/MPG are supported — ffmpeg.wasm demuxes them all to extract the audio track. The maximum file size is 200MB, which keeps the decoded audio comfortably inside the browser's WebAssembly memory limit (long 4K files can blow past it otherwise).

Which AI model should I choose?

Tiny is fastest and works well for clear speech. Base offers a good balance of speed and accuracy. Small is most accurate but slower and requires more memory. Start with Tiny for testing.

What's the difference between SRT and VTT?

SRT (SubRip) is the most widely supported format, compatible with most video players and platforms. VTT (WebVTT) is the web standard for HTML5 video and the format YouTube prefers; the spec also allows cue settings and ::cue styling, though this tool emits plain, unstyled cues you can style later in your player or CSS. There is also a plain TXT option that exports just the transcript with no timecodes. All three share the same wrapped text; only SRT and VTT carry timing.

Is my video uploaded to a server?

No. All processing happens locally in your browser using WebAssembly. Your video never leaves your device, ensuring complete privacy.

How accurate is Whisper compared to human transcription?

Whisper Small reaches roughly 95-97% word accuracy on clean English audio, comparable to a budget human transcriber. Tiny drops to about 85-90% — fine for rough drafts but you'll want to edit. Accuracy plummets with: heavy accents, multiple overlapping speakers, background music/noise, technical jargon, and quiet/distant microphones. For broadcast quality (99%+), use Whisper as a first pass then human-edit, which still saves about 70% of the time versus typing from scratch.

Auto Subtitle Generator — Free AI captions in your browser: video to SRT & WebVTT, word-level timestamps, 99 languages, line-length and CPS condit
Auto Subtitle Generator

Why is it so slow on long videos?

Whisper processes audio at roughly 0.5-3x real-time speed depending on your CPU and chosen model. A 10-minute video might take 3-8 minutes with Tiny on a modern laptop, or 15-30 minutes with Small. There's no GPU acceleration in browser-based Whisper yet (Apple's WebGPU support is still maturing). For 30+ minute videos, expect to leave the tab open for a while. The model downloads once and is cached, so subsequent runs skip that step.

Can it handle multiple speakers or speaker diarization?

Whisper itself doesn't do speaker diarization (labeling 'Speaker 1' vs 'Speaker 2'). It transcribes speech sequentially without identifying who's talking. For meetings, podcasts, or interviews requiring speaker labels, you'd need a post-processing step using pyannote or AWS Transcribe. The SRT/VTT output here is a continuous stream of timestamped lines — perfect for single-presenter content like lectures, tutorials, vlogs, narrated documentaries.

How well does it handle non-English languages?

Whisper supports 99 languages with varying accuracy. Top-tier (95%+ on Small): English, Spanish, French, German, Italian, Portuguese, Japanese. Good (85-92%): Chinese, Korean, Russian, Arabic, Hindi, Vietnamese. Set 'Language' to your specific language for best results — 'Auto Detect' adds a probabilistic first pass that occasionally misclassifies (especially with very short clips or code-switching). For mixed-language content, run separate passes per language section.

Will the subtitles sync correctly when I burn them into my video?

Yes — both SRT and VTT use absolute timestamps (HH:MM:SS,mmm) measured from the start of your audio. Drop the SRT into HandBrake, DaVinci Resolve, Premiere, FFmpeg, or any video player and timing will be exact to the millisecond. To hard burn-in (open captions) with FFmpeg use the subtitles filter: ffmpeg -i in.mp4 -vf "subtitles=subs.srt" out.mp4. To mux as soft closed captions into an MP4 instead, use -c:s mov_text (-c:s webvtt for WebM/HLS). One pitfall: timestamps are wall-clock seconds, so on drop-frame 29.97/59.94 timelines the SMPTE timecode and the SRT clock drift apart over long durations — conform your NLE's project frame rate to the source before relying on frame-exact cue starts.

How do I keep captions broadcast-compliant (chars-per-line / CPS)?

Raw Whisper dumps a whole sentence into one cue, which QC will bounce. BBC, Netflix and EBU guidelines cap each line at roughly 37-42 characters, allow at most two lines, and keep reading speed under about 17-20 characters per second (CPS). Set 'Max Characters Per Line' (default 42) and the tool greedily word-wraps every long Whisper chunk into a compliant one- or two-line cue on word boundaries — no mid-word breaks. It also clamps each cue's end to the real media duration so no subtitle runs past EOF, which strict validators and some players reject. For CEA-608/708 (the line-21 captions baked into broadcast streams) you still need an encoder like CCExtractor, but SRT/VTT is the interchange format every caption pipeline ingests.

How do I re-sync subtitles after trimming the video?

Use the 'Start Offset' field. After your editor trims, say, 5 seconds off the head of the timeline, set the offset to -5 and regenerate (or +3 if you added a 3-second intro card). Every timestamp shifts by that amount and is clamped at 0 so nothing goes negative, and the tail is clamped to the media duration. This is the bulk-shift you'd otherwise do in Aegisub or Subtitle Edit, done in-tool before you even export — no round-trip to a separate subtitle editor.