This is HN, so I'm surprised that no one in the comments section has run this locally. :)
Following the instructions in their repo (and moving the checkpoints/ and resources/ folder into the "nested" openvoice subfolder), I managed to get the Gradio demo running. Simple enough.
It appears to be quicker than XTTS2 on my machine (RTX 3090), and utilizes approximately 1.5GB of VRAM. The Gradio demo is limited to 200 characters, perhaps for resource usage concerns, but it seems to run at around 8x realtime (8 seconds of speech for about 1 second of processing time.)
EDIT: patched the Gradio demo for longer text; it's way faster than that. One minute of speech only took ~4 seconds to render. Default voice sample, reading this very comment: https://voca.ro/18JIHDs4vI1v
I had to write out acronyms -- XTTS2 to "ex tee tee ess two", for example.
The voice clarity is better than XTTS2, too, but the speech can sound a bit stilted and, well, robotic/TTS-esque compared to it. The cloning consistency is definitely a step above XTTS2 in my experience -- XTTS2 would sometimes have random pitch shifts or plosives/babble in the middle of speech.
I am trying to run it locally but it doesn't quite work for me.
I was able to run the demos allright, but when trying to use another reference speaker (in demo_part1), the result doesn't sound at all like the source (it's just a random male voice).
I'm also trying to produce French output, using a reference audio file in French for the base speaker, and a text in French. This triggers an error in api.py line 75 that the source language is not accepted.
Indeed, in api.py line 45 the only two source languages allowed are English and Chineese; simply adding French to language_marks in api.py line 43 avoids errors but produces a weird/unintelligible result with a super heavy English accent and pronunciation.
I guess one would need to generate source_se again, and probably mess with config.json and checkpoint.pth as well, but I could not find instructions on how to do this...?
Edit -- tried again on https://app.myshell.ai/ The result sounds French alright, but still nothing like the original reference. It would be absolutely impossible to confuse one with the other, even for someone who didn't know the person very well.
I played with it some more and I have to agree. For actual voice _cloning_, XTTS2 sounds much, much closer to the original speaker. But the resulting output is also much more unpredictable and sometimes downright glitchy compared to OpenVoice. XTTS2 also tries to "act out" the implied emotion/tone/pitch/cadence in the input text, for better or worse.
But my use case is just to have a nice-sounding local TTS engine, and current text-to-phoneme conversion quirks aside, OpenVoice seems promising. It's fast, too.
> but when trying to use another reference speaker (in demo_part1), the result doesn’t sound at all like the source
I’ve noticed the same thing and I wonder if there is maybe some undocumented information about what makes a good voice sample for cloning, perhaps in terms of what you might call “phonemic inventory”. The reference sample seems really dense.
> Indeed, in api.py line 45 the only two source languages allowed are English and Chinese
If you look at the code, outside of what the model does it relies on the surrounding infrastructure converting the input text to the international phonetic alphabet (IPA) as part of the process, and only has that implemented for English and Mandarin (though cleaners.py has broken references to routines for Japanese and Korean.
Give https://github.com/aedocw/epub2tts a look, the latest update enables use of MS Edge cloud-based TTS so you don't need a local GPU and the quality is excellent.
I want to try chaining XTTS2 with something like RVCProject. The idea is to generate the speech in one step, then clone a voice in the audio domain in a second step.
I have got to build or buy a new computer capable of playing with all this cool shit. I built my last "gaming" PC in 2016, so its hardware isn't really ideal for AI shenanigans, and my Macbook for work is an increasingly crusty 2019 model, so that's out too.
Yeah, I could rent time on a server, but that's not as cool as just having a box in my house that I could use to play with local models. Feels like I'm missing a wave of fun stuff to experiment with, but hardware is expensive!
> its hardware isn't really ideal for AI shenanigans
FWIW, I was in the same boat as you and decided to start cheap, old game machines can handle AI shenanigans just fine wirh the right GPU. I use a 2017 workstation (Zen1) and an Nvidia P40 from around the same time, which can be had for <$200 on ebay/Amazon. The P40 has 24GB VRAM, which is more than enough for a good chunk of quantized LLMs or diffusion models, and is in the same perf ballpark as the free Colab tensor hardware.
If you're just dipping your toes without committing, I'd recommend that route. The P40 is a data center card and expects higher airflow than desktop GPUs, so you probably have to buy a "blow kit" or 3D-print a fan shroud and ensure they fit inside your case. This will be another $30-$50. The bigger the fan, the quieter it can run. If you already have a high-end gamer PC/workstation from 2016, you can dive into local AI for $250 all-in.
Edit: didn't realize how cheap P40s now are! I bought mine a while back.
Mac Studio or macbook pro if you want to run the larger models. Otherwise just a gaming pc with an rtx 4090 or a used rtx 3090 if you want something cheaper. A used dual 3090 can also be a good deal, but that is more in the build it yourself category than off the shelf.
I went the 4090 route myself recently, and I feel like all should be warned - memory is a major bottleneck. For a lot of tasks, folks may get more mileage out of multiple 3090s if they can get them set up to run parallel.
Still waiting on being able to afford the next 4090 + egpu case et al. There are a lot of things this rig struggles with running OOM, even on inference with some of the more recent SD models.
Sorry if this is a silly question - I was never a Mac user, but I quick googled Mac Studio and it seems it's just the computer. Can I plug it to any monitor / use any keyboard and mouse, or do I need to use everything from Apple with it?
You can, but with some caveats. Not all screen resolutions work well with MacOS, though using BetterDisplay it will still usually work. If you want touch id, it's better to get the Magic Keyboard with touch id.
Any monitor and keyboard will work, however Apple keyboards have a couple extra keys not present on Windows keyboards so require some key remapping to allow access to all typical shortcut key combinations.
I'm in exactly the same boat. Yeah ofc you can run LMs on cloud servers but my dream project would be to construct a new gaming PC (mine is too old) and serve a LM on it, then serve an AI agent app which I can talk to from anywhere.
Has anyone had luck buying used GPUs, or is that something I should avoid?
I bought some used GPUs during the last mining thing. They all worked fine except for some oddball Dell models that the seller was obviously trying to fix a problem on (and they took them back without question, even paying return shipping).
And old mining GPUs are A-OK, generally: Despite warnings from the peanut gallery for over over a decade that mining ruins video cards, this has never really been the case. Profitable miners have always tended to treat these things very carefully, undervolt (and often, underclock) them, and pay attention to them so they could be run as cool and inexpensively as possible. Killing cards is bad for profits, so they aimed towards keeping them alive.
GPUs that were used for gaming are also OK, usually. They'll have fewer hours of hard[er] work on them, but will have more thermal cycles as gaming tends to be much more intermittent than continuous mining is.
The usual caveats apply as when buying anything else (used, "new", or whatever) from randos on teh Interwebz. (And fans eventually die, and so do thermal interfaces (pads and thermal compound), but those are all easily replaceable by anyone with a small toolkit and half a brain worth of wit.)
Following the instructions in their repo (and moving the checkpoints/ and resources/ folder into the "nested" openvoice subfolder), I managed to get the Gradio demo running. Simple enough.
It appears to be quicker than XTTS2 on my machine (RTX 3090), and utilizes approximately 1.5GB of VRAM. The Gradio demo is limited to 200 characters, perhaps for resource usage concerns, but it seems to run at around 8x realtime (8 seconds of speech for about 1 second of processing time.)
EDIT: patched the Gradio demo for longer text; it's way faster than that. One minute of speech only took ~4 seconds to render. Default voice sample, reading this very comment: https://voca.ro/18JIHDs4vI1v I had to write out acronyms -- XTTS2 to "ex tee tee ess two", for example.
The voice clarity is better than XTTS2, too, but the speech can sound a bit stilted and, well, robotic/TTS-esque compared to it. The cloning consistency is definitely a step above XTTS2 in my experience -- XTTS2 would sometimes have random pitch shifts or plosives/babble in the middle of speech.