Vocal Isolation

The first step in creating a karaoke video is to isolate the vocals from the instruments in a song. Unless you can get official stems, the way to go is to use vocal reduction/isolation utilities, which often include multiple AI models you can try. The most effective models for isolating vocals from instruments currently include Ultimate Vocal Remover (UVR), MDX 23c 8K, Mel-RoFormer and Demucs. Each model has strengths and weaknesses. The current consensus is that Mel-RoFormer is the most versatile and effective at vocal isolation (June 18, 2024).

Sourcing the original audio

In order to isolate an instrumental track, you must first source the original audio if you don’t already have a digital copy of the song. For best results, use lossless audio, such as FLAC or WAV. If not available, use as high of quality as you can get.

If you need to pull the audio from YouTube, you’ll get the best quality by searching for “Song Artist - Song Name (Topic)”. These are the official releases from the artists, though YouTube does still re-encode them to fairly low bitrate (~128 kbps). For example, see this video for a Marcy Playground release and note the channel name displays “Marcy Playground - Topic”: https://youtu.be/ytK71-PoVG8

Creating stems

In most cases, we recommend using one of the following two utilities for vocal removal:

  • Ultimate Vocal Remover (UVR): If you have a PC or laptop with a good video card or CPU, you can use this utility to run separation for free on your own system. It will take several seconds per track to run on GPU, or several minutes per track on CPU.

  • x-minus.pro / UVR Online: This is the recommended online tool, because it keeps up very well with the models available. It does have a monthly fee to use all the functionality, but it is very low.

Here’s a comparison of these two and some other similar utilities. These are by no means a comprehensive set of options, but represent what we believe to be the best available in terms of convenience and quality.

Feature

audio-separator

Google Colab [1]

MVSEP

The Tüül

UVR

x-minus

How to run

Local [2]

Web

Web

Web

Local

Web

Ease of Use Rank [3]

6 [4]

5

3

1

4

2

Cost

Free

Free [5]

$0.00 - $0.30 / track [6]

Free

Free

$0.00 - $0.01 / minute [7]

Model selection

Med

Low [8]

High

None

High

High

Speed of model updates

Med

Fast [9]

Med [10]

Slow

Med

Fast

Parameter customization

Low

Low [11]

High

None

High

Med

Notes

Vocal Isolation Models

Most of the available tools share a lot of separation models. Here’s an overview of some of the best ones. This below guide to the relative strengths and weaknesses of each vocal isolation method was written by Peareoke. New models are being added constantly each with its own merits. If you’re not sure which is the most effective currently, just ask and someone will let you know.

Mel-RoFormer and BS-RoFormer

These are currently the best models for making karaoke instrumentals. These models were created by cross-training with other previous AIs (including the ones listed below) using new datasets.

If you want to include background vocals from the song in your karaoke video, try using mel-roformer (kar) or uvr bve2. Roformer Kar will include the backing vocals with the instrumental track when separating. BVE2 will try to isolate just the background vocals so you will have to use an audio editor to merge the sound.

UVR

UVR was the first model to gain widespread use. The vast majority of karaoke tracks on YouTube made more than a year ago were created using UVR5 or a previous iteration. It is a very good starting place for folks intimidated by the huge variety of models as it almost always produces a usable result. While it’s not currently the most effective, that may change in the future as it gets updated frequently.

MDX

MDX is another great option for vocal isolation. Like UVR it separates a track into an instrumental and vocal stem. MDX models also iterate frequently. As of June 4th, 2024, mdx 23c 8k is the most effective MDX model.

demucs

Demucs, created by Facebook parent company Meta, has the ability to separate a track into as many as six stems: bass, drums, guitar, vocals, piano, other (piano not available for demucs on x-minus, and there is no six-stem option there either, only four, missing guitar and piano). Unfortunately the component parts do not isolate as cleanly as other AI models at this time. There are instances where it can still be effective to add sound back into the mix that might be missing from an isolation using another method.