So, speaker labels. The part of a transcript that says who said what, turning one undifferentiated wall of text into something that reads like a script. Easy enough when you’re allowed to ship the audio off to a big cloud model. The interesting question, and the one we actually cared about, is how well it holds up when the audio is never allowed to leave the machine. Let me walk you through what we found…
What it’s actually doing under the hood
Three things happen, in order:
- Spotting the speech. First it finds the bits where someone is genuinely talking and ignores the silence (this one is called voice activity detection, if you’re collecting the jargon).
- Taking a voiceprint. For each chunk of speech it works out a sort of numeric fingerprint of the voice, how it sounds.
- Grouping them up. Then it groups the chunks whose fingerprints look alike. Same fingerprint, same speaker. Different fingerprint, different speaker.
Out the other end you get the transcript split into turns, each tagged with a speaker. None of those three steps needs a server somewhere. Spotting speech is cheap, the voiceprint models are small (tens of megabytes), and the grouping is just maths over a transcript sized list. A normal laptop does the lot, on audio that stays on the disk. That is what we run.
The labels are numbers, not names
One thing to be clear about up front: it gives you “Speaker 1”, “Speaker 2” and so on, not “Sam from accounts”. It is separating the voices, not recognising who they belong to. It has no idea who anyone is, and honestly that is the right default for the meetings this is built for.
There is one nice touch worth a mention. When you record a call, your voice comes in on the microphone and the other side comes in through your speakers, and we keep those two apart. So you will see labels like “Speaker 1 (you)” for your side and “Speaker 2 (remote)” for the people on the call, with the numbers kept low within each side.
It works out how many speakers there are on its own, up to a sensible cap (eight by default). You do not have to tell it “there are four people in this meeting”, though there are meetings where you will wish you could.
What it gets right
For a normal meeting, a handful of people in a room or on a call, with reasonable audio, it does the job:
- It holds a speaker’s label steady across the whole thing. “Speaker 2” at minute 3 is still “Speaker 2” at minute 47.
- It separates the voices cleanly when they genuinely sound different.
- It copes with someone dropping off and rejoining, they get their old label back.
The thing that matters here is not whether one single half second got the perfect label, it is whether the labels stay consistent across the meeting. A write up built on a transcript where the same person keeps the same label reads correctly even when the labels are anonymous. A transcript where one person bounces between three labels is a pain no matter how good the words are.
Where it slips
I am not going to pretend it is flawless, because no diarisation is, cloud or otherwise. The hard cases are much the same everywhere:
- Lots of short overlapping noises. “Yeah.” “Right.” “Mm-hmm.” Too short to fingerprint properly.
- Two genuinely similar voices. Same pitch, same accent, siblings.
- A racket of background noise with other voices in it. A cafe with a telly on.
- Heavily compressed phone audio, a speakerphone two rooms away.
- One person who shifts their voice a lot. Sometimes it reads the two registers as two different people.
Sending the audio to the cloud does not make these go away. The big remote models are bigger, but they trip on the same things. So the realistic expectation, anywhere, is that most segments land right and a few need a human eye, which brings me to the actual point.
The label is not the record, the recording is
Here is the thing the marketing usually skips. In a real meeting record, the name on a segment matters less than your being able to go and check what was actually said.
If a write up mentions something “Speaker 2” said, the worth of that line is in your being able to get back to the moment it came from. So we keep a link between the write up and the transcript: highlight a line in the document and it points you to the part of the transcript behind it, and the other way round too.
Now let me be straight about how that link works, because it is easy to oversell. It matches on the actual words the two sentences share, the specific terms and names and numbers, so it is a strong best guess at where a line came from, not a guaranteed exact mapping. In practice that is exactly what you want when a label looks off: jump to that segment, play the recording back (it is right there on your machine) and check it with your own ears. The label is how you navigate. The recording and the transcript are the record.
What to actually test
If you are sizing up any offline tool’s speaker labelling, do not take my word for it, run a real recording of your own through it and check three things:
- Does one person keep one label from start to finish?
- Do the short interjections get labelled, or swallowed into the previous speaker’s turn?
- When a label looks wrong, can you get to that exact moment and hear it for yourself?
The security notes cover the wider data handling. If you would rather just hear how it does on your own audio, download the trial and run a meeting you already have through it. The transcript tells you on the first read.
Hear about new releases
Whistle Enterprise runs entirely offline, so an email is how you hear about a new release. One email per release, unsubscribe in one click.
We only use your address to email you about new Whistle Enterprise releases. See our privacy notice.