Audio

Converter

Text

AI Speech to Text

Free Whisper AI speech to text in your browser. Transcribe audio & video to SRT, VTT, JSON. Runs on-device with WebGPU/WASM, no upload, unlimited.

AI transcription powered by OpenAI Whisper. All processing happens in your browser - your audio never leaves your device.

Detecting...

Select Audio or Video File

Drag & drop an audio or video file here

Supports MP3, WAV, M4A, MP4, WebM, and more

Transcript

Select

Copy

Download text

Transcript is editable — your corrections are included in every export (TXT, SRT, VTT, JSON, MD, CSV).

Timestamped Segments

Copy

Download text

Click any segment text to edit it. Edits update the transcript and all subtitle exports.

Have feedback? Report bugs, suggest features, or share your thoughts — we read them all

About AI Speech to Text

This AI-powered transcription tool uses OpenAI's Whisper model to convert speech to text with high accuracy. Unlike cloud-based services, all processing happens directly in your browser using WebGPU/WebAssembly - your audio files are never uploaded to any server, ensuring complete privacy.

How does browser-based transcription work?

The tool uses Transformers.js to run OpenAI's Whisper model directly in your browser. When you first transcribe, the AI model is downloaded and cached in your browser. All audio processing and transcription happens locally on your device using your CPU/GPU, without sending any data to external servers.

Which model size should I choose?

There are three model options:
- Tiny (~40MB): Fastest to load and process. Good for clear audio with minimal background noise.
- Base (~75MB): Balanced option with better accuracy than Tiny.
- Small (~250MB): Best accuracy, especially for challenging audio with accents or background noise. Recommended for important transcriptions.

Larger models provide better accuracy but require more download time and processing power.

What languages are supported?

Whisper supports over 99 languages including English, Spanish, French, German, Chinese, Japanese, Korean, Arabic, Hindi, Portuguese, Russian, Vietnamese, and many more. You can either select the language manually for better accuracy, or let the AI auto-detect the language.

What audio and video formats are supported?

All common audio formats are supported: MP3, WAV, M4A, AAC, FLAC, OGG, OPUS, and WebA. Video files are also supported - the audio track is automatically extracted from MP4, WebM, MKV, AVI, MOV, and other video formats.

How accurate is the transcription?

Whisper provides state-of-the-art accuracy for automatic speech recognition. Results are best for:
- Clear recordings with minimal background noise
- Native speakers with standard accents
- Single-speaker audio

Accuracy may vary for:
- Heavy accents or dialects
- Multiple overlapping speakers
- Poor audio quality or heavy background noise
- Technical jargon or uncommon words

Can I get timestamps and subtitles?

Yes! Enable 'Include timestamps' to get timestamped segments perfect for creating subtitles. You can download the transcript as an SRT file ready for video editing. Enable 'Word-level timestamps' for even more precise timing of individual words.

Why is processing slow on my device?

Transcription speed depends on your hardware. Modern devices with WebGPU support (Chrome 113+) will be significantly faster. To improve performance:
- Use Chrome or Edge browser for WebGPU acceleration
- Close other tabs and applications
- Use the Tiny model for faster processing
- Desktop/laptop computers are faster than mobile devices
- Audio files up to 10 minutes work best

AI Speech to Text — Free Whisper AI speech to text in your browser. Transcribe audio & video to SRT, VTT, JSON. Runs on-device with WebGPU/W — **AI Speech to Text**

Is my audio data private?

Absolutely. Unlike cloud transcription services, your audio never leaves your device. All AI processing happens locally in your browser using WebGPU or WebAssembly. No audio is uploaded, stored, or processed on any server. When you close the page, all data is cleared from memory.

What's the maximum file size and duration?

Maximum file size is 100MB. For optimal performance, we recommend audio files under 10 minutes. Longer files can be processed but may take significantly more time and memory. If you have longer recordings, consider splitting them into smaller segments.

Exactly which model and weights does this run?

It runs OpenAI's Whisper via Transformers.js using the open ONNX-community weights: onnx-community/whisper-tiny, whisper-base, and whisper-small. On WebGPU the model runs in fp32 for best accuracy; on WebAssembly (CPU) it runs q8 (8-bit quantized) so it loads and runs on lower-powered devices. The q8 build trades a small amount of accuracy for speed and memory, which is why a larger model size helps on noisy or accented audio.

Can I edit the transcript before exporting?

Yes. The transcript box and each timestamped segment are fully editable. Correct names, jargon, and punctuation directly, and every export — TXT, SRT, VTT, JSON, Markdown, and CSV, plus the segment download — reflects your edits instead of the raw model output. Editing the transcript text updates the full-text exports; editing a segment updates that subtitle cue and re-syncs the full transcript.

What export formats and segment schema are available?

Six formats: TXT (plain text), SRT and WebVTT (timestamped subtitle cues), Markdown (text plus a timestamped segment list), CSV (index, start_seconds, end_seconds, text), and JSON. The JSON schema is { language, text, segments: [{ start, end, text }], words: [{ start, end, text }] | null, exported_at, tool }. Word-level timings populate the JSON 'words' array and are kept separate from the sentence-level segment list so SRT/VTT stay readable.

Does it work offline, and is the model cached?

The first transcription downloads the chosen Whisper model once; the browser caches it (HTTP cache / Cache Storage). After that, transcription works without re-downloading the model and continues to run entirely on-device. Nothing — not your audio, not your transcript — is ever uploaded to a server; all inference happens locally in your browser via WebGPU or WebAssembly.

How accurate is it — can I publish the output directly?

Treat the output as a fast first draft, not a finished product. Word error rate varies with model size, background noise, accents, overlapping speakers, and technical jargon, and the WASM build is quantized (q8). Always proofread and verify before publishing subtitles or deliverables — which is exactly why the transcript and segments are editable and the corrected version is what gets exported.