Once you’ve prepared your audio, you’ll put it inside the program’s directory structure. In my case, it was /dataset_raw/elvis/ Then you’ll have to run a few commands in this order to start training the model. “svc pre-resample” converts your audio to mono 44.1khz files. Following that, “svc pre-config” downloads a few configuration files and puts it in the correct directory. “svc pre-hubert” downloads and runs a speech model pre-training. It contains guidelines so that you get a predictable output when creating your own model in the last step.
This last step is “svc train -t”. It starts the training and opens up a browser window with the TensorBoard. With the TensorBoard, you can keep track of the progress of your model. Once you are satisfied with the results, you can stop the training. The progress is measured in steps. In the configuration files, you can change how often you want to write the model to disk. For Elvis, i wanted to have a copy after every 100 steps and was ultimately satisfied at 211k steps.
After van Voorst ran 211,000 steps of training, the Elvis AI voice model was ready for action. Next, van Voorst shared the model with others online. There I Ruined It creator Dustin Ballard downloaded the Elvis vocal model—people frequently share them through Discord communities of like-minded voice-cloning hobbyists—and his part of the work began.
To craft the song, Ballard opened a conventional music workstation app, such as Pro Tools, and imported an instrumental backing track for the Elvis hit Don’t Be Cruel, played by human musicians. Next, Ballard sang the lyrics of Baby Got Back to the tune of Don’t Be Cruel, recording his performance. He repeated the same with any backing vocals in the song. Next, he ran his recorded vocals through van Voorst’s Elvis AI model using so-vits-svc, making them sound like Elvis singing them instead.
To make the song sound authentic and as close to the original record as possible, van Voorst said, it’s best to not make any use of modern techniques like pitch correction or time stretching. “Phrasing and timing the vocal during recording is the best way to make sure it sounds natural,” he said, pointing out some telltale signs in the Baby Got Back AI song. “I hear some remnants of a time stretching feature being used on the word ‘sprung’ and a little bit of pitch correction, but otherwise it sounds very natural.”