
Voice diarization: how we know who speaks in couples sessions without getting it wrong
In couples therapy, misattributing who said what makes any report useless. The technical challenge of identifying two voices on a video call — and why it matters.
Imagine receiving a report from a couples therapy session where every direct quote is attributed to the wrong person. "Luis said: 'I feel heard when he gives me space.'" "Valentina said: 'I don't know how to tell him it hurts when he comes home late.'"
The report would be useless. Worse than useless — it could distort the clinical impression the therapist builds about the couple's dynamic.
This is the diarization problem. And in sessions with multiple participants, especially couples sessions, it is one of the most important technical challenges we have had to solve.
What diarization is
Diarization is the process of identifying who is speaking at each moment of a conversation. The name comes from the Greek "diárion" (diary), because historically it was used to separate speaking turns in transcripts of meetings and interviews.
In a work meeting with ten people in a room, diarization is difficult but has room for error. If the system confuses two participants in a paragraph of meeting notes, the overall context usually saves the meaning.
In a couples therapy session, there is no room for error. Every sentence has a clinical author. "I feel abandoned" carries a radically different weight depending on who says it. The pursuer-distancer pattern Gottman describes requires knowing exactly which person adopts which conversational role at which moment.
The technical challenge of video sessions
In a physical room, diarization uses data from microphones at different positions — sound arrives with slight time offsets that allow triangulating from which direction each voice comes.
In a video call, the problem is different. Audio from all participants arrives mixed through the video platform. Each person has their own microphone, but the signal is combined before transmission. The echo cancellation systems used by video platforms to suppress feedback between microphone and speaker modify the acoustic signal in ways that complicate separation.
There are two signals that help diarization in video: voice biometrics (each person has unique acoustic characteristics — pitch, timbre, rhythm) and video activity (which camera is active at which moment). CauceOS uses both.
How we build voice profiles per participant
When the bot joins a session, it does not start diarizing immediately with full confidence. In the first few minutes, it builds a voice profile per participant.
Each time a participant speaks, the system extracts acoustic features from their voice — fundamental frequency, formants, rhythmic patterns — and associates them with that participant. As the session progresses, the profile becomes more precise.
In couples sessions, there is an additional complication: interruptions. When two people speak at the same time — which happens frequently in couples sessions, especially in moments of tension — the system has to make a decision about who to attribute the segment to. Our approach is conservative: when the system has low confidence in the attribution, it marks the segment as "overlap" rather than attributing it incorrectly.
The bilingual case
Bilingual sessions add another layer of complexity. In a session where the therapist speaks in Spanish and clients respond in English, the language shift can be an additional signal that helps diarization. But it can also confuse the system if one participant naturally code-switches in their responses.
Our solution is to treat language as one more feature of the voice profile — useful when consistent, ignored when it is not. The system prioritizes acoustic features over detected language.
Why we invested so much in this
When we talk to couples therapists about what they need from an assistance tool, the most common answer is not "better suggestions" or "richer reports." It is: "don't confuse who said what."
It is the baseline condition. If the system cannot correctly attribute speech, everything else — alerts, reports, pattern analysis — is built on sand.
We solved the diarization problem first because without it, nothing else makes clinical sense.
Do you work with couples sessions and want to understand how CauceOS handles diarization in your specific context? Write to us at hola@cauceos.com.
More in this category
PsychologyCauceOS · Newsletter
Get the next notes straight to your inbox
Reflections, practices, and updates from CauceOS. No spam. Unsubscribe anytime.
Keep reading
Psychologycouples therapy
The Gottman Four Horsemen in virtual sessions: intervening at the exact moment
Criticism, contempt, defensiveness, and stonewalling are the four patterns that best predict the dissolution of a relationship. Detecting them live during a virtual session gives the couples therapist an intervention window the human eye alone cannot always reach.
Psychologytherapeutic modalities
Therapeutic modalities in LATAM: what we discovered talking to 40 psychologists
We spoke with 40 psychologists from five Spanish-speaking countries to understand which therapeutic frameworks they actually use. What we found changed how we designed CauceOS.
Psychologycrisis detection
Early detection of crisis signals in therapy: how we assist the clinician without replacing them
How the co-pilot identifies, in real time, language associated with suicidal ideation, self-harm, and domestic violence — and why latency matters as much as sensitivity.