9
3day
0

Can anyone recommend a workflow/tool(s) for syncing a plaintext diarized transcript to audio to obtain high-quality subtitles?

The MLP wiki has high quality diarized transcripts that look like this:

Pinkie Pie: I'm awake! I'm awake! What time is it?! Did we sleep through the test?! [snores]
[beeping stops]
Rarity: No, but school starts in thirty minutes!
Sunset Shimmer: [sighs] How's everybody feeling about our test?
Fluttershy: Even after our all-night study session, I still don't know the difference between vaporization and sublimation.

Ideally, I'd like to have a tool that I can feed this to which will spit out some synced subs. The exact per-character diarization isn't actually important, since I'll certainly strip out the character names (and probably the [SDH things]) to avoid problems with alignment and they won't be in the final subtitles; rather, I want to make sure that the boundaries between speaker utterances are respected.

MFA seems like it could work, but I'm unsure of how best to preprocess the transcript/audio to get good results. I tried aligning with and without the built-in segment command as well as bumping the beam and beam-retry values with less-than-stellar results.

I'm also aware of some commercial services that offer this functionality (Descript and YouTube), but I'm looking for a solution I can run locally.

Any pointers would be greatly appreciated! stalin-heart


Sort of a separate question, but is there a tool that will allow for precise line-splitting when using word-timestamped transcripts (e.g. the JSON output of Whisper)? It seems like it should be fairly straightforward, but it doesn't seem like SubtitleEdit can do it and I had trouble finding a tool that can handle it. Would be a really nice feature, since splitting lines is probably the most tedious task when dealing with automatic transcripts.