Can anyone recommend a workflow/tool(s) for syncing a plaintext diarized transcript to audio to obtain high-quality subtitles?
The MLP wiki has high quality diarized transcripts that look like this:
Pinkie Pie: I'm awake! I'm awake! What time is it?! Did we sleep through the test?! [snores]
[beeping stops] Rarity: No, but school starts in thirty minutes! Sunset Shimmer: [sighs] How's everybody feeling about our test? Fluttershy: Even after our all-night study session, I still don't know the difference between vaporization and sublimation.
Ideally, I'd like to have a tool that I can feed this to which will spit out some synced subs. The exact per-character diarization isn't actually important, since I'll certainly strip out the character names (and probably the [SDH things]) to avoid problems with alignment and they won't be in the final subtitles; rather, I want to make sure that the boundaries between speaker utterances are respected.
MFA seems like it could work, but I'm unsure of how best to preprocess the transcript/audio to get good results. I tried aligning with and without the built-in segment command as well as bumping the beam and beam-retry values with less-than-stellar results.
I'm also aware of some commercial services that offer this functionality (Descript and YouTube), but I'm looking for a solution I can run locally.
Any pointers would be greatly appreciated!
Sort of a separate question, but is there a tool that will allow for precise line-splitting when using word-timestamped transcripts (e.g. the JSON output of Whisper)? It seems like it should be fairly straightforward, but it doesn't seem like SubtitleEdit can do it and I had trouble finding a tool that can handle it. Would be a really nice feature, since splitting lines is probably the most tedious task when dealing with automatic transcripts.
AernaLingus in askchapo
Can anyone recommend a workflow/tool(s) for syncing a plaintext diarized transcript to audio to obtain high-quality subtitles?
The MLP wiki has high quality diarized transcripts that look like this:
Ideally, I'd like to have a tool that I can feed this to which will spit out some synced subs. The exact per-character diarization isn't actually important, since I'll certainly strip out the character names (and probably the [SDH things]) to avoid problems with alignment and they won't be in the final subtitles; rather, I want to make sure that the boundaries between speaker utterances are respected.
MFA seems like it could work, but I'm unsure of how best to preprocess the transcript/audio to get good results. I tried aligning with and without the built-in
segmentcommand as well as bumping thebeamandbeam-retryvalues with less-than-stellar results.I'm also aware of some commercial services that offer this functionality (Descript and YouTube), but I'm looking for a solution I can run locally.
Any pointers would be greatly appreciated!
Sort of a separate question, but is there a tool that will allow for precise line-splitting when using word-timestamped transcripts (e.g. the JSON output of Whisper)? It seems like it should be fairly straightforward, but it doesn't seem like SubtitleEdit can do it and I had trouble finding a tool that can handle it. Would be a really nice feature, since splitting lines is probably the most tedious task when dealing with automatic transcripts.