For reference, I'm transcribing research-related podcasts, meaning speech doesn't overlap a lot, which would be a problem for WhisperX from what I understand. There's also a lot of accents, which are straining on Whisper (though it's also doing well), but surely help WhisperX. It did have issues with figuring out the number of speakers on it's own, but that wasn't a problem for my use case.
WhisperX does diarization, but I don’t see any mention of it fulfilling my ask which makes me think I didn’t communicate it well.
Here’s an example for clarity:
1. AI is trained on the voice of a podcast host. As a side effect it now (presumably) has all the information it needs to replicate the voice
2. All the past podcasts can be processed with the AI comparing the detected voice against the known voice which leads to highly-accurate labelling of that person
3. Probably a nice side bonus: if two people with different registers are speaking over each other the AI could separate them out. “That’s clearly person A and the other one is clearly person C”
On my wishlist would be a local model that can generate new voices based on descriptions such as "rough detective-like hard boiled man" or "old fatherly grampa"
You might be interested in this cool app that Microsoft made that I don't think I've seen anyone talk about anywhere called Speech Studio. https://speech.microsoft.com/
I don't recall their voices being the most descriptive but they had a lot. They also let layout a bunch of text & have different voices speak each line just like a movie script.
Whisper can do diarization but not sure it will "remember" the voices well enough. You might simply have to stitch all the recordings together, run it through Whisper to get the diarized transcript, then process that how you want.