07:04
<bakkot>

openai announced a new Whisper model, but still no streaming support :(

all the services which offer real-time transcription, including those which just wrap whisper with some hacks, are basically garbage relative to whisper. google has a new model (Chirp) this year, but it doesn't work with real-time transcription either.

whisper gets near-perfect transcriptions for content like our meetings, everyone else misses one word in five. but using whisper without any of the streaming hacks (which lower quality a lot) means transcriptions will necessarily be 30 seconds behind (+ time to transcribe and network latency, so in practice more like 40 seconds).

I don't think automatic transcription is going to be viable until something in this landscape changes. (cc littledan)

I might set up a separate 40-second-latency transcription notes doc at the next meeting to help with fixing things the human transcriptionists miss.

anyone happen to have played with any other promising real-time transcription services recently?

11:03
<littledan>

openai announced a new Whisper model, but still no streaming support :(

all the services which offer real-time transcription, including those which just wrap whisper with some hacks, are basically garbage relative to whisper. google has a new model (Chirp) this year, but it doesn't work with real-time transcription either.

whisper gets near-perfect transcriptions for content like our meetings, everyone else misses one word in five. but using whisper without any of the streaming hacks (which lower quality a lot) means transcriptions will necessarily be 30 seconds behind (+ time to transcribe and network latency, so in practice more like 40 seconds).

I don't think automatic transcription is going to be viable until something in this landscape changes. (cc littledan)

I might set up a separate 40-second-latency transcription notes doc at the next meeting to help with fixing things the human transcriptionists miss.

anyone happen to have played with any other promising real-time transcription services recently?

saminahusain:
11:04
<littledan>
Thanks for the report, bakkot . Let's check up on this again at the end of next year.
11:04
<littledan>
sounds like we need to repeat the budget request for transcriptionists
11:04
<littledan>
Are you saying we get good accuracy with a 40-second delay?
11:04
<ryzokuken>
it would be accurate, yeah IIUC
11:05
<ryzokuken>
the 40 second delay is whisper's only shortcoming
11:06
<ryzokuken>
actually, I haven't tried it myself. Wonder how well it does with various accents
11:07
<ryzokuken>
they have an example with a pretty thick accent though, fun
11:07
<ryzokuken>
https://openai.com/research/whisper
15:11
<bakkot>
right. Whisper is very accurate in my tests, but it fundamentally operates on 30-second chunks of audio and takes a little while to run (say 10 seconds per chunk), so trying to stream it to the notes doc would mean that every 30 seconds we get a high-quality transcript of the portion of the meeting starting 40 seconds ago and running through 10 seconds ago. I haven't actually set that up but I expect it to work.
15:12
<bakkot>
unfortunately 40 seconds of lag is a lot of lag
15:25
<Michael Ficarra>

I might set up a separate 40-second-latency transcription notes doc at the next meeting to help with fixing things the human transcriptionists miss.

That would be so helpful!

15:45
<Michael Ficarra>
bakkot: What will this cost per meeting? Even if it's only like $10, we should lump in that funding with the transcription costs.
15:47
<bakkot>
for actual Whisper it'll be free; it runs locally
15:49
<bakkot>
I could maybe cut a few seconds of lag off by using the API which would cost ~$6/meeting (somehow the API manages to be substantially faster than running locally), but the difference between 35 seconds and 40 seconds probably isn't worth worrying about
18:11
<shu>
if the model is in fact pretty much perfect in terms of accuracy, why not record the meeting and postprocess for transcription? then delete the recording afterwards
18:12
<Michael Ficarra>
shu: some people like to edit the notes immediately after speaking
18:12
<shu>
but is that because the accuracy is in doubt?
18:13
<Michael Ficarra>
I think some people make minor rephrasings, remove stumbles, etc
18:13
<shu>
fair enough
18:22
<Michael Ficarra>
personally, if the transcription is very accurate, I would be fine waiting until the end of the day (or week) to do my reviews
18:23
<Michael Ficarra>
but having the two docs sounds like a great compromise
18:25
<Michael Ficarra>
the worst part about reviewing notes for me is when the notes are either incomprehensible (our previous automatic transcription) or missing entire sentences (human transcription) and I can't remember what was said
18:25
<Michael Ficarra>
having a more accurate document to refer to would be so helpful for that
18:38
<bakkot>
the computer-generated transcripts are also missing paragraph breaks and speaker assignments, and you really want to do those in real time
19:02
<Ashley Claymore>
Another thing that we try and edit in live are when people post code snippets into TCQ or matrix. As the verbatim transcription of only the audio without the code can be almost meaningless 
19:40
<Michael Ficarra>
oh yeah, speaker attribution is actually pretty tricky to do after the fact
20:29
<Ashley Claymore>
Simply get everyone to commit to saying their acronym at the start of each sentence 
20:29
<Ashley Claymore>
😅