Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone know which local models are doing the “opposite”: Identify a voice well enough to do speaker diarization across multiple recordings?


Whisper doesn't, but WhisperX <https://github.com/m-bain/whisperX/> does. I am using it right now and it's perfectly serviceable.

For reference, I'm transcribing research-related podcasts, meaning speech doesn't overlap a lot, which would be a problem for WhisperX from what I understand. There's also a lot of accents, which are straining on Whisper (though it's also doing well), but surely help WhisperX. It did have issues with figuring out the number of speakers on it's own, but that wasn't a problem for my use case.


WhisperX does diarization, but I don’t see any mention of it fulfilling my ask which makes me think I didn’t communicate it well.

Here’s an example for clarity:

1. AI is trained on the voice of a podcast host. As a side effect it now (presumably) has all the information it needs to replicate the voice

2. All the past podcasts can be processed with the AI comparing the detected voice against the known voice which leads to highly-accurate labelling of that person

3. Probably a nice side bonus: if two people with different registers are speaking over each other the AI could separate them out. “That’s clearly person A and the other one is clearly person C”


You can check out PicoVoice Eagle (paid product): https://picovoice.ai/docs/eagle/

You pass N number of PCM frames through their trainer and once you reach a certain percentage you can extract an embedding you can save.

Then you can identify audio against the set of identified speakers and it will return percentage matches for each.


On my wishlist would be a local model that can generate new voices based on descriptions such as "rough detective-like hard boiled man" or "old fatherly grampa"


You might be interested in this cool app that Microsoft made that I don't think I've seen anyone talk about anywhere called Speech Studio. https://speech.microsoft.com/

I don't recall their voices being the most descriptive but they had a lot. They also let layout a bunch of text & have different voices speak each line just like a movie script.


Whisper can do diarization but not sure it will "remember" the voices well enough. You might simply have to stitch all the recordings together, run it through Whisper to get the diarized transcript, then process that how you want.


Whisper does not support diarization. There are a number of projects that try to add it.


Picovoice says they do this but it's a paid product. It supposedly runs on the device but you still need a key and have to pay per minute.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: