Does anyone know which local models are doing the “opposite”: Identify a voice w...

Teleoflexuous · on March 29, 2024

Whisper doesn't, but WhisperX <https://github.com/m-bain/whisperX/> does. I am using it right now and it's perfectly serviceable.

For reference, I'm transcribing research-related podcasts, meaning speech doesn't overlap a lot, which would be a problem for WhisperX from what I understand. There's also a lot of accents, which are straining on Whisper (though it's also doing well), but surely help WhisperX. It did have issues with figuring out the number of speakers on it's own, but that wasn't a problem for my use case.

joshspankit · on March 29, 2024

WhisperX does diarization, but I don’t see any mention of it fulfilling my ask which makes me think I didn’t communicate it well.

Here’s an example for clarity:

1. AI is trained on the voice of a podcast host. As a side effect it now (presumably) has all the information it needs to replicate the voice

2. All the past podcasts can be processed with the AI comparing the detected voice against the known voice which leads to highly-accurate labelling of that person

3. Probably a nice side bonus: if two people with different registers are speaking over each other the AI could separate them out. “That’s clearly person A and the other one is clearly person C”

c0brac0bra · on March 29, 2024

You can check out PicoVoice Eagle (paid product): https://picovoice.ai/docs/eagle/

You pass N number of PCM frames through their trainer and once you reach a certain percentage you can extract an embedding you can save.

Then you can identify audio against the set of identified speakers and it will return percentage matches for each.

Drakim · on March 29, 2024

On my wishlist would be a local model that can generate new voices based on descriptions such as "rough detective-like hard boiled man" or "old fatherly grampa"

mattferderer · on March 29, 2024

You might be interested in this cool app that Microsoft made that I don't think I've seen anyone talk about anywhere called Speech Studio. https://speech.microsoft.com/

I don't recall their voices being the most descriptive but they had a lot. They also let layout a bunch of text & have different voices speak each line just like a movie script.

satvikpendem · on March 29, 2024

Whisper can do diarization but not sure it will "remember" the voices well enough. You might simply have to stitch all the recordings together, run it through Whisper to get the diarized transcript, then process that how you want.

beardedwizard · on March 29, 2024

Whisper does not support diarization. There are a number of projects that try to add it.

c0brac0bra · on March 29, 2024

Picovoice says they do this but it's a paid product. It supposedly runs on the device but you still need a key and have to pay per minute.