CauceOS
Start free
Voice diarization: how we know who speaks in couples sessions without getting it wrong
Psychology

Voice diarization: how we know who speaks in couples sessions without getting it wrong

In couples therapy, misattributing who said what makes any report useless. The technical challenge of identifying two voices on a video call — and why it matters.

Felix Gonzalez · Founder, CauceOS · 3 min read

Imagine receiving a report from a couples therapy session where every direct quote is attributed to the wrong person. "Luis said: 'I feel heard when he gives me space.'" "Valentina said: 'I don't know how to tell him it hurts when he comes home late.'"

The report would be useless. Worse than useless — it could distort the clinical impression the therapist builds about the couple's dynamic.

This is the diarization problem. And in sessions with multiple participants, especially couples sessions, it is one of the most important technical challenges we have had to solve.

What diarization is

Diarization is the process of identifying who is speaking at each moment of a conversation. The name comes from the Greek "diárion" (diary), because historically it was used to separate speaking turns in transcripts of meetings and interviews.

In a work meeting with ten people in a room, diarization is difficult but has room for error. If the system confuses two participants in a paragraph of meeting notes, the overall context usually saves the meaning.

In a couples therapy session, there is no room for error. Every sentence has a clinical author. "I feel abandoned" carries a radically different weight depending on who says it. The pursuer-distancer pattern Gottman describes requires knowing exactly which person adopts which conversational role at which moment.

The technical challenge of video sessions

In a physical room, diarization uses data from microphones at different positions — sound arrives with slight time offsets that allow triangulating from which direction each voice comes.

In a video call, the problem is different. Audio from all participants arrives mixed through the video platform. Each person has their own microphone, but the signal is combined before transmission. The echo cancellation systems used by video platforms to suppress feedback between microphone and speaker modify the acoustic signal in ways that complicate separation.

There are two signals that help diarization in video: voice biometrics (each person has unique acoustic characteristics — pitch, timbre, rhythm) and video activity (which camera is active at which moment). CauceOS uses both.

How we build voice profiles per participant

When the bot joins a session, it does not start diarizing immediately with full confidence. In the first few minutes, it builds a voice profile per participant.

Each time a participant speaks, the system extracts acoustic features from their voice — fundamental frequency, formants, rhythmic patterns — and associates them with that participant. As the session progresses, the profile becomes more precise.

In couples sessions, there is an additional complication: interruptions. When two people speak at the same time — which happens frequently in couples sessions, especially in moments of tension — the system has to make a decision about who to attribute the segment to. Our approach is conservative: when the system has low confidence in the attribution, it marks the segment as "overlap" rather than attributing it incorrectly.

The bilingual case

Bilingual sessions add another layer of complexity. In a session where the therapist speaks in Spanish and clients respond in English, the language shift can be an additional signal that helps diarization. But it can also confuse the system if one participant naturally code-switches in their responses.

Our solution is to treat language as one more feature of the voice profile — useful when consistent, ignored when it is not. The system prioritizes acoustic features over detected language.

Why we invested so much in this

When we talk to couples therapists about what they need from an assistance tool, the most common answer is not "better suggestions" or "richer reports." It is: "don't confuse who said what."

It is the baseline condition. If the system cannot correctly attribute speech, everything else — alerts, reports, pattern analysis — is built on sand.

We solved the diarization problem first because without it, nothing else makes clinical sense.


Do you work with couples sessions and want to understand how CauceOS handles diarization in your specific context? Write to us at hola@cauceos.com.

More in this category

Psychology

CauceOS · Newsletter

Get the next notes straight to your inbox

Reflections, practices, and updates from CauceOS. No spam. Unsubscribe anytime.

Want to try it?

Start free. Set up your framework in less than 2 minutes.

Start free

Keep reading