AI Vocal Remover

On-device U-Net AI vocal remover: split any song into vocal & instrumental WAV stems. No upload, karaoke & acapella ready, true-peak clip-safety check.

This tool uses a deep neural network (~68MB) to separate vocals from music. The model runs entirely in your browser - no uploads required. The model downloads automatically when you start processing.

Select Audio or Video File

Drag & drop an audio or video file here

or click to browse

Stem	Sample Peak	True Peak (dBTP)	Clip Safety
Vocal Track (Acapella)	-	-	-
Instrumental Track (Karaoke)	-	-	-

Information

Have feedback? Report bugs, suggest features, or share your thoughts — we read them all

About the AI Vocal Remover

This AI vocal remover separates a stereo song into two stems — vocals (acapella) and instrumental (karaoke) — using a deep neural network that runs entirely inside your browser. The same family of source-separation models powers commercial products like LALAL.AI, Moises, Vocalremover.org, and Audio Shake; the open-source baseline that this tool is built from descends from Deezer's Spleeter (Hennequin et al., 2019) and Facebook AI Research's Demucs (Défossez et al., 2019). No audio is uploaded — the model executes locally on your CPU, GPU (WebGL) or modern GPU (WebGPU). Once the ~68 MB model file has been downloaded and cached, the tool runs offline.

AI separation is a real upgrade over the old phase-cancellation trick. Phase cancellation works only on songs where the vocal sits exactly in the centre of a stereo mix, by inverting one channel and adding it to the other. It cancels the centre, taking the vocal with it — but it also cancels every other centred element (kick drum, bass, snare), and most modern recordings reverberate, double-track, or pan the vocal slightly off-centre, so the trick fails. Modern source-separation networks instead learn the spectral signature of vocals from thousands of paired examples and can lift singing out of a mix even when it has reverb, doubles, harmonies, autotune, or panning.

Useful applications: making karaoke / minus-one tracks, isolating an acapella for remixing, sampling vocals for music production, transcribing lyrics that are buried under a busy mix, dialogue cleanup in podcasts and video, language-learning by hearing a song's lyrics in isolation, and academic study of vocal performance. The tool accepts audio (MP3, WAV, FLAC, OGG, M4A, AAC, OPUS) and video (MP4, MKV, MOV, WebM, AVI) — for video the audio track is automatically extracted via the Web Audio API. The neural network runs at 44.1 kHz, so 48/96 kHz sources are resampled to 44.1 kHz for inference and the stems are exported at 44.1 kHz; you choose the WAV bit depth (16-bit, 24-bit, or 32-bit float). Convert to MP3 afterwards in any editor if you need smaller files.

On copyright: the tool is free, but the audio you process is not. Separating a copyrighted song does not give you the right to release the resulting vocal or instrumental commercially, distribute it, sell it, or upload it to a service. Use it for songs you wrote yourself, songs you have explicit permission to remix, or for genuine fair-use scenarios (transcription, education, research, parody as defined in your jurisdiction). The brazilian DMCA-equivalent, EU directive 2019/790, the UK CDPA, and US copyright law all apply to AI-extracted stems exactly as they apply to the original recording.

Privacy is by design. Your audio is decoded by the browser, the AI inference runs locally on your device's compute resources, and the resulting stems are encoded back to WAV in your browser. The page itself uses TensorFlow.js with WebGPU when available; weights download once over HTTPS and are cached. We don't see, store, log, or share your audio.

How the separation works

Source separation is the inverse problem of mixing: given a mixture x = vocals + instrumental, recover the two component signals. The classical 1990s approach was independent component analysis (ICA), which works only when the sources are statistically independent and the mixing is fixed and linear — neither assumption holds for music. Modern deep-learning systems learn the separation directly from data: they observe thousands of paired (mixture, vocal, instrumental) examples and learn to map a mixture spectrogram to per-source spectrograms.

The standard pipeline begins with a Short-Time Fourier Transform (STFT) of the input. Typical settings are an FFT size of 4096 samples and a hop size of 1024 samples (75 % overlap), giving a complex spectrogram with one column every ~23 ms at 44.1 kHz. The magnitude spectrogram is fed through a U-Net — a fully convolutional encoder–decoder with skip connections — that outputs two soft frequency masks: one for vocals, one for instrumental. Each mask is multiplied with the input spectrogram and inverse-STFT'd to recover a time-domain signal. The original phase is reused; the vocals get the same phase as the mixture at each frequency, which is a small approximation but sounds good in practice.

Spleeter (Hennequin, Cournou, Defossez & Moussallam, 2019, Deezer) was a milestone open-source release: a U-Net trained on 25 000 songs giving 2-stem (vocal/instrumental), 4-stem (vocal/drums/bass/other), and 5-stem (adds piano) separation. The 2-stem model is small enough for browser inference. Demucs (Défossez et al., 2019; Hybrid Demucs 2021) raised the bar by working in the time domain with a Wave-U-Net architecture and later combining waveform and spectrogram branches; it set the state of the art on the MUSDB18 benchmark. Hybrid Transformer Demucs (HTDemucs, 2023) added a Transformer block in the bottleneck. The MDX series (Music Demixing Challenge, 2021–2023) at ISMIR has been the public benchmark.

The accuracy metric used in source-separation papers is SDR (Signal-to-Distortion Ratio) in decibels — higher is better. Spleeter reports ~6.6 dB vocal SDR on MUSDB18; Demucs v3 reports ~9.0 dB; HTDemucs and the MDX-23 winners cluster around 9.5–10 dB. For context, audible quality starts to feel 'commercial-grade' at SDR > 7 dB on clean studio recordings. Live recordings, very dense mixes, heavy autotune, and unusual genres (classical opera, throat singing, some metal subgenres) score noticeably lower than the benchmark average.

In this browser tool, the 4-second audio buffer is split into overlapping chunks, each chunk is run through the U-Net, and the chunk outputs are crossfaded together so seams aren't audible. WebGPU acceleration (Chrome 113+, Edge 113+) gives 5–10× throughput compared to WebGL; on a modern desktop a 3-minute song separates in 30–60 seconds with WebGPU and 2–3 minutes with WebGL. CPU-only fallback is much slower (10–15 minutes) but always works. The U-Net runs at 44.1 kHz, so the stems are exported as 44.1 kHz stereo WAV (16-bit, 24-bit, or 32-bit float, your choice); 48/96 kHz masters are resampled to 44.1 kHz for inference — pick 24-bit or 32-bit float to keep full headroom on hot stems.

Accuracy and what to expect

Quality varies markedly by source material. For modern professionally-mixed pop, rock, R&B, hip-hop, and electronic — clean lead vocal, separated mix bus, clear stereo imaging — you can expect a clean instrumental with at most a faint vocal residue ('ghosting') in quiet passages. Vocal stems will sound like a high-quality acapella with maybe a touch of room reverb. This is the operating envelope where AI separators shine and where Spleeter / Demucs / HTDemucs benchmark scores were measured.

Quality drops on live recordings (audience bleed, room reverb leaks vocal energy into the instrumental stem), heavy autotune (formant-shifted vocals confuse the network), genres with strong overlap between voice and instrument timbre (a-cappella backing vocals, choir, throat singing), very old or low-fidelity recordings (mono, AM-radio bandwidth, vinyl crackle), and tracks where instruments mimic the human voice frequency range (saxophone, distorted lead guitar, spoken-word samples). Bossa nova and MPB recordings often work well because the vocal is mixed prominent and clear; samba and pagode with heavy percussion and many backing voices are harder.

Failure modes you will hear: vocal bleed in the instrumental during sibilants ('s' / 't' sounds, which span a wide frequency range), drum hits mistakenly classified as vocal transients, phase-y or watery artifacts on long sustained notes, and reduced stereo width on the instrumental because the network sometimes folds slight panning information into the vocal mask. None of these are bugs in the tool — they are inherent limits of two-stem separation. If you need cleaner results on a hard track, paid commercial services (LALAL.AI, Moises, Audio Shake) use larger ensembles of bigger models and can do somewhat better, but they too have these failure modes.

Separation works best on professionally-mixed studio recordings; live and lo-fi recordings have audible bleed.
Heavy autotune, vocoder, talkbox, or formant-shifted voices may be partially classified as instrumental.
Backing vocals and choirs are often left in the vocal stem; complete vocal removal in dense harmonies is unreliable.
Sibilants ('s', 'sh', 't') often leave a faint hiss in the instrumental track.
Sustained notes and long reverb tails may have slight phase artifacts after separation.
Maximum file size is 100 MB; very long audio (over 30 minutes) is rejected to prevent browser memory issues.
Stems are 44.1 kHz WAV (the model's inference rate); 48/96 kHz sources are resampled. Choose 24-bit or 32-bit float for headroom; convert to MP3/AAC yourself for smaller files.
Copyright applies to extracted stems exactly as it applies to the source — check rights before publishing or commercial use.
Browser requirements: Chrome / Edge for WebGPU acceleration; Firefox / Safari fall back to slower WebGL or CPU.

Glossary

Source separation: The signal-processing problem of recovering individual source signals (vocals, drums, bass, ...) from a recorded mixture. The inverse of mixing.
Stem: An individual source track within a mix. Two-stem separation splits into vocals + instrumental; four-stem splits into vocals + drums + bass + other.
U-Net: A fully convolutional encoder–decoder neural-network architecture (Ronneberger et al., 2015) with skip connections from the encoder to the decoder. Originally designed for biomedical image segmentation, now standard for source separation in the spectrogram domain.
Frequency mask: A 2D matrix the same shape as a spectrogram, with values typically in [0, 1], that says how much of each frequency at each time belongs to a given source. Multiplying the mixture spectrogram by the mask isolates that source.
Time-frequency domain: Representing audio as a 2D matrix where one axis is time and the other is frequency, produced by a Short-Time Fourier Transform. The natural representation for spectral source-separation methods.
Spleeter: Open-source 2-, 4-, and 5-stem source separator released by Deezer in 2019. The first widely-usable browser-friendly stem separator and a common baseline.
Demucs / HTDemucs: Facebook AI Research's open-source separator, originally a Wave-U-Net (time-domain), then hybrid waveform + spectrogram (Hybrid Demucs), then with a Transformer block (Hybrid Transformer Demucs / HTDemucs).
SDR (Signal-to-Distortion Ratio): Standard objective quality metric for source separation, in dB. Higher means a cleaner stem. Pop/rock SDR > 7 dB sounds commercial-grade; > 9 dB is benchmark-leading.
MUSDB18: Public dataset of 150 multitrack songs (100 train, 50 test) used as the standard benchmark for source separation. Each song is split into vocal, drums, bass, and other stems.

Frequently Asked Questions

How does the AI remove vocals?

It runs a U-Net deep neural network in your browser. The audio is converted to a spectrogram via STFT, the network outputs a frequency mask predicting which time-frequency cells contain vocal energy, the mixture is multiplied by the mask, and the result is inverse-STFT'd back to a time-domain WAV. The architecture descends from Spleeter / Demucs and is trained on MUSDB18-style paired data.

How long does separation take?

On a modern desktop with WebGPU (Chrome / Edge 113+), a 3-minute song separates in roughly 30–60 seconds. With WebGL it is 2–3× slower. CPU fallback is 10–15 minutes for a 3-minute song. Mobile devices are slower than desktops; longer files are processed in chunks with a progress bar.

What sample rate and bit depth are the stems?

The U-Net runs at 44.1 kHz, so stems are exported as 44.1 kHz stereo WAV — 48 kHz / 96 kHz masters are resampled to 44.1 kHz for inference (we no longer pretend the output keeps the source rate). You choose the bit depth: 16-bit for small files, 24-bit for studio headroom, or 32-bit float for zero clipping. If you need smaller files, convert the WAV to MP3 or AAC afterwards in any editor.

Is this AI separator better than phase cancellation?

Yes, dramatically. Phase cancellation only works on perfectly centred vocals in a clean stereo mix and also cancels other centred sources (bass, kick drum, snare). The AI looks at the actual spectral content of voice versus instruments and works on stereo, mono, panned, doubled, harmonised, and reverberant vocals — all of which break phase cancellation.

What model is used? Spleeter? Demucs?

The browser model is in the same family as Spleeter (Deezer, 2019) and Hybrid Demucs (Facebook AI, 2019–2023): a U-Net operating on STFT spectrograms, trained on MUSDB18-style paired data. We picked a model that is small enough (~68 MB) to download and run in a browser via TensorFlow.js, with WebGPU acceleration when available.

Why does the instrumental still have a faint vocal?

Soft separation always leaves residue — the network has to choose, frame by frame, how much energy in each frequency bin belongs to vocals. Sibilants, breaths, and very soft sustained notes often share frequency bands with cymbals, hi-hats, and other percussion, so the network can't tell them apart cleanly. Larger paid models can do somewhat better but never zero residue.

AI Vocal Remover — On-device U-Net AI vocal remover: split any song into vocal & instrumental WAV stems. No upload, karaoke & acapella read — **AI Vocal Remover**

Is my audio uploaded to your server?

No. All processing — decoding, STFT, neural-network inference, inverse STFT, WAV encoding — runs locally in your browser via TensorFlow.js. The only network traffic is the one-time download of the model weights (~68 MB, cached). Your audio bytes never leave your device.

Can I use the extracted stems commercially?

Only if you have rights to the underlying song. Extracting an instrumental from a copyrighted recording does not transfer any copyright — releasing the result commercially is the same as releasing the original recording without a licence. For royalty-free use you need a song you wrote, a song you have a licence for, or a Creative Commons / public-domain song.

Why does the model sometimes output a quiet vocal even on instrumental-only mode?

Because it estimates the vocal mask first and subtracts; if the network is uncertain about a region, both the 'vocal' and 'instrumental' outputs can contain a soft remainder. This is by design (it preserves total energy). For absolute silence you'd need to gate the residue or use a more aggressive post-processing step.

What's the maximum file size and length?

Maximum 100 MB and 30 minutes per file. The hard cap exists to prevent browser memory crashes — even with chunked processing, very long audio can exhaust the WebGPU heap. For longer recordings, split with any audio editor first and process each segment.

Why is my vocal stem clipping, and how do I export it safely?

Soft-mask separation computes the vocal as mixture minus instrumental, and that subtraction routinely overshoots 0 dBFS — especially at the inter-sample (true-peak) level. A 16-bit WAV hard-clips that overshoot and the stem becomes unusable in a DAW. The tool measures each stem's sample peak and 4x-oversampled true peak (dBTP) after separation and shows a green 'Safe' / red 'Clip risk' badge. Export 24-bit or 32-bit float to keep the overshoot losslessly, or tick 'Normalize to -1 dBTP' to scale the stem to a safe ceiling before download.

Can I separate drums or bass too (4-stem or 5-stem)?

This tool currently runs a 2-stem model (vocal + instrumental) for size and speed reasons. The Spleeter and Demucs models also offer 4-stem and 5-stem versions if you run them locally with a Python install. We may add a 4-stem option in a future release.

The tool is slow or crashing. What do I do?

Close other browser tabs, prefer Chrome or Edge for WebGPU acceleration, ensure your browser is up to date, try a shorter file first to confirm the pipeline works, and process on a desktop rather than mobile if possible. WebGPU users on integrated GPUs may need to enable hardware acceleration in browser settings.

References & academic sources

Hennequin, R., Khlif, A., Voituret, F., & Moussallam, M.. (2020). Spleeter: A Fast and Efficient Music Source Separation Tool with Pre-trained Models Journal of Open Source Software (Deezer Research).
Défossez, A., Usunier, N., Bottou, L., & Bach, F.. (2019). Music Source Separation in the Waveform Domain (Demucs) Facebook AI Research.
Rouard, S., Massa, F., & Défossez, A.. (2023). Hybrid Transformers for Music Source Separation (HTDemucs) Meta AI / IEEE ICASSP.
Rafii, Z., Liutkus, A., Stöter, F.-R., Mimilakis, S. I., & Bittner, R.. (2017). MUSDB18 — a corpus for music separation Zenodo / SiSEC.
Mitsufuji, Y., Fabbro, G., Uhlich, S., et al.. (2023). Music Demixing Challenge (MDX) — ISMIR / Sony ISMIR / Sony AI.
Ronneberger, O., Fischer, P., & Brox, T.. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation MICCAI.

Last reviewed: 2026-05-08· Reviewed by WuTools Audio Engineering Team