07:04 | <bakkot> | openai announced a new Whisper model, but still no streaming support :( all the services which offer real-time transcription, including those which just wrap whisper with some hacks, are basically garbage relative to whisper. google has a new model (Chirp) this year, but it doesn't work with real-time transcription either. whisper gets near-perfect transcriptions for content like our meetings, everyone else misses one word in five. but using whisper without any of the streaming hacks (which lower quality a lot) means transcriptions will necessarily be 30 seconds behind (+ time to transcribe and network latency, so in practice more like 40 seconds). I don't think automatic transcription is going to be viable until something in this landscape changes. (cc littledan) I might set up a separate 40-second-latency transcription notes doc at the next meeting to help with fixing things the human transcriptionists miss. anyone happen to have played with any other promising real-time transcription services recently? |
11:03 | <littledan> |
|
11:04 | <littledan> | Thanks for the report, bakkot . Let's check up on this again at the end of next year. |
11:04 | <littledan> | sounds like we need to repeat the budget request for transcriptionists |
11:04 | <littledan> | Are you saying we get good accuracy with a 40-second delay? |
11:04 | <ryzokuken> | it would be accurate, yeah IIUC |
11:05 | <ryzokuken> | the 40 second delay is whisper's only shortcoming |
11:06 | <ryzokuken> | actually, I haven't tried it myself. Wonder how well it does with various accents |
11:07 | <ryzokuken> | they have an example with a pretty thick accent though, fun |
11:07 | <ryzokuken> | https://openai.com/research/whisper |
15:11 | <bakkot> | right. Whisper is very accurate in my tests, but it fundamentally operates on 30-second chunks of audio and takes a little while to run (say 10 seconds per chunk), so trying to stream it to the notes doc would mean that every 30 seconds we get a high-quality transcript of the portion of the meeting starting 40 seconds ago and running through 10 seconds ago. I haven't actually set that up but I expect it to work. |
15:12 | <bakkot> | unfortunately 40 seconds of lag is a lot of lag |
15:25 | <Michael Ficarra> |
That would be so helpful! |
15:45 | <Michael Ficarra> | bakkot: What will this cost per meeting? Even if it's only like $10, we should lump in that funding with the transcription costs. |
15:47 | <bakkot> | for actual Whisper it'll be free; it runs locally |
15:49 | <bakkot> | I could maybe cut a few seconds of lag off by using the API which would cost ~$6/meeting (somehow the API manages to be substantially faster than running locally), but the difference between 35 seconds and 40 seconds probably isn't worth worrying about |
18:11 | <shu> | if the model is in fact pretty much perfect in terms of accuracy, why not record the meeting and postprocess for transcription? then delete the recording afterwards |
18:12 | <Michael Ficarra> | shu: some people like to edit the notes immediately after speaking |
18:12 | <shu> | but is that because the accuracy is in doubt? |
18:13 | <Michael Ficarra> | I think some people make minor rephrasings, remove stumbles, etc |
18:13 | <shu> | fair enough |
18:22 | <Michael Ficarra> | personally, if the transcription is very accurate, I would be fine waiting until the end of the day (or week) to do my reviews |
18:23 | <Michael Ficarra> | but having the two docs sounds like a great compromise |
18:25 | <Michael Ficarra> | the worst part about reviewing notes for me is when the notes are either incomprehensible (our previous automatic transcription) or missing entire sentences (human transcription) and I can't remember what was said |
18:25 | <Michael Ficarra> | having a more accurate document to refer to would be so helpful for that |
18:38 | <bakkot> | the computer-generated transcripts are also missing paragraph breaks and speaker assignments, and you really want to do those in real time |
19:02 | <Ashley Claymore> | Another thing that we try and edit in live are when people post code snippets into TCQ or matrix. As the verbatim transcription of only the audio without the code can be almost meaningless |
19:40 | <Michael Ficarra> | oh yeah, speaker attribution is actually pretty tricky to do after the fact |
20:29 | <Ashley Claymore> | Simply get everyone to commit to saying their acronym at the start of each sentence |
20:29 | <Ashley Claymore> | 😅 |