Speaker identification when nothing leaves the device

Speaker identification, also called diarisation, is the part of a meeting transcription that says “who said what”. A transcript without speaker labels is one long block of text where everyone’s words are mixed in. A transcript with speaker labels is a script. The difference matters most for the meeting types that are also the most sensitive: client interviews, witness statements, board discussions, clinical consultations, multi-party negotiations.

This piece is for technical buyers, IT staff and anyone evaluating a meeting tool’s diarisation. The framing is “when nothing leaves the device” because most diarisation marketing assumes the audio gets sent to a cloud service that can run a large model. The interesting question is how diarisation works when that option is off the table.

What diarisation is doing

Three steps run in any reasonable diarisation pipeline.

Voice activity detection. Cut out silence. Find the bits of audio where someone is actually speaking.
Speaker embedding. For each speech segment, generate a numeric fingerprint that represents the speaker’s voice as the model heard it.
Clustering. Group segments by similarity. Segments with similar fingerprints get the same label. Segments with different fingerprints get different labels.

The output is a sequence of timestamped segments with a speaker tag attached to each. The transcript engine then aligns its words to those segments and the words inherit the tag.

None of these steps strictly need a remote server. Voice activity detection is small. Speaker embedding models are a few tens of megabytes. Clustering is a normal computer-science problem that takes milliseconds on a transcript-sized input. A laptop CPU can run all three in close to real time, on audio that never leaves the disk.

Whistle Enterprise does this. The transcription model and the speaker labelling both run on the user’s computer, against audio that came from the user’s microphone or a file the user supplied. The diarisation result is written into the same local workspace as the rest of the meeting artefacts.

What diarisation that runs locally gets right

For a standard meeting setup (two to six speakers in a single room or on a single call, audio quality reasonable, language consistent throughout) diarisation that runs locally works well. It will:

Separate the speakers cleanly when their voice profiles are different (different gender, different accent, different pitch register).
Hold a stable speaker label across the meeting, so “Speaker 2” said this thing at minute 3 and the same thing at minute 47.
Identify when one speaker stops and another starts, even when there is no pause between them.
Keep going if a speaker leaves and rejoins; the same person who spoke at minute 10 will get the same label at minute 50.

The quality that matters is not the absolute accuracy of any one segment but the consistency across the meeting. A document built on a transcript with consistent speaker labels reads correctly even if the labels are anonymous (“Speaker 1” not “Sam Smith”). A transcript where the same person bounces between three labels is hard to use no matter what other quality the transcription has.

Where it sometimes slips

There are meetings that any diarisation, local or otherwise, struggles with.

Many short overlapping interjections. “Yeah.” “Right.” “Mm-hmm.” Speech that is shorter than the embedding window is hard to fingerprint.
Two speakers with very similar voices. Twins, family members, two voices in the same gender and pitch range with similar accents.
Heavy background noise that contains other voices. A coffee shop with a TV on. A trade show floor.
Telephony audio with strong compression. A speakerphone two rooms away over a low-bitrate line.
A single speaker who shifts register dramatically. A doctor who switches between clinical voice and casual voice for the same patient. Sometimes diarisation labels the two registers as different speakers; sometimes it does not.

These failure modes do not get fixed by sending the audio to a cloud service. The cloud diarisation models are usually larger but the failure modes are similar. The realistic expectation for any diarisation, local or remote, is that 90-something percent of segments end up labelled correctly and the remaining few are correctable by hand against a recording you can play back.

Why traceability matters more than label accuracy

A small thing the marketing of meeting tools generally misses: in a real meeting record, the label name on a segment is less important than your ability to check what was actually said.

If a document Whistle Enterprise generates says “Speaker 2 raised the question of indemnity at the £5m point”, the value of that line depends on you being able to find the moment in the transcript and the recording where that was said. The speaker label is then a navigation tool, not a fact in itself.

Whistle Enterprise keeps that link explicit. Highlight any sentence in the generated document and the application shows the exact passage in the transcript it came from. Highlight a passage in the transcript and you see what was written about it. The mapping is bi-directional. If a speaker label is wrong on a particular segment, you can find that segment quickly, hear the recording, and resolve it.

This is the property that makes diarisation useful in regulated work even when it is not perfect. The label is not the record. The recording and the transcript are the record. The label is how you navigate to the right bit.

What to ask of a candidate tool

If you are evaluating an offline meeting tool’s diarisation, three things are worth checking on a real recording from your own work.

Does the same person hold the same label across the whole meeting?
Are short interjections labelled at all, or are they dropped into the previous speaker’s segment?
When a label is wrong, can you fix it quickly, and does the fix flow through to the generated document?

The security notes cover the wider data-handling story. If you would rather just hear how diarisation behaves on your own audio, download the trial and run a recording you already have through it. The transcript will tell you on the first read.

For procurement, finance, IT directors, ops teams, anyone responsible for tool TCO

What you stop paying when the meeting tool runs locally

Cloud meeting tools come with subscriptions, usage limits, outages and quiet vendor lock-in. A local tool removes all four. Here is what that means in practice.
For prospective buyers, anyone who has not downloaded the trial

What comes out of Whistle Enterprise: from audio to document

A walkthrough of what you actually receive when you run a meeting through Whistle Enterprise. Three artefacts, source traceability and the speaker labels that hold it all together.
For buyers comparing meeting tools, anyone who has been told the AI is the answer

How accurate are AI meeting notes

AI meeting tools transcribe and summarise with varying levels of accuracy. What affects the result, where the failure modes are, and how to evaluate a tool on real meetings.

Back to all articles