Research

How accurate is cross-language translation in real sessions? An internal study with 50 bilingual sessions

We measured translation accuracy across 50 bilingual ES↔EN sessions with human evaluation. 92% on neutral phrases, 84% on clinical terms, 78% on regional idioms. What we learned and what still needs work.

Felix Gonzalez · Founder, CauceOSMay 28, 2026 · 4 min read

Before launching CauceOS's bilingual mode as a product feature, we wanted to genuinely understand how well it works in the specific context we built it for: professional 1-on-1 sessions where two people speak different languages about clinical or HR topics.

Standard translation benchmarks measure accuracy on general text — news, books, casual conversation. None of them measure what we needed to measure: how translation performs when someone says "transference" in the context of a psychoanalytic session, or when a recruiter talks about "culture fit" in English and that needs to reach a Latin American candidate in Spanish with its meaning intact.

So we ran the study internally.

Methodology

We selected 50 bilingual sessions from our private beta, all with the ES↔EN pair, where the professional gave explicit consent for the session to be used for anonymized research purposes. Participants included psychologists, couples therapists, HR professionals, and coaches.

We hired four human evaluators: two native Spanish speakers with clinical psychology training, and two native English speakers with experience in clinical and HR contexts. Each evaluator reviewed translation segments independently, without seeing the others' evaluations.

For evaluation, we used two metrics:

BLEU score (Bilingual Evaluation Understudy) — a standard automatic metric that measures how much the produced translation overlaps with a reference translation produced by a human expert. BLEU ranges from 0 to 1, where 1 is a perfect translation.

Human evaluation on 1-5 scale — evaluators rated each segment on a scale where: 1 = incorrect translation that changes meaning, 3 = understandable translation but with loss of nuance, 5 = accurate translation that preserves both meaning and tone.

Results

Neutral phrases: 92% accuracy

General content phrases — greetings, opening questions, conversational transitions, time references — showed the best performance. Average BLEU: 0.84. Average human evaluation: 4.6 / 5.

In this segment, errors were minimal and primarily consisted of register differences (informal vs. formal) that did not affect meaning.

Clinical terms: 84% accuracy

This is where things got more interesting. The system handles standard terms from Anglo-Saxon clinical vocabulary that have direct Spanish equivalents well: "depression" → "depresión," "anxiety" → "ansiedad," "cognitive behavioral therapy" → "terapia cognitivo-conductual."

Errors appeared in three categories:

Culturally nuanced terms. "Resilience" in English carries bounce-back connotations that "resiliencia" in Spanish preserves, but "resistance" as a psychoanalytic concept ("resistencia" in Freud) does not carry the same weight across both traditions. The system handled these variably.

Therapeutic modality eponyms. "Gottman Four Horsemen" → "Los Cuatro Jinetes de Gottman" worked well. But "softened start-up" — a Gottman technical term — produced inconsistent translations ("inicio suavizado," "arranque gentil," "apertura suavizada") depending on sentence context.

English terms already adopted in Spanish clinical usage. In many Hispanic contexts, terms like "burnout," "coaching," and "mindfulness" are used directly in English. The system sometimes translated them when it should not have.

Regional idioms: 78% accuracy

This was the segment with the most variation. Regional idioms include expressions and vocabulary that vary significantly between Spanish-speaking countries.

"Estoy muy jodido" can mean "I am in a bad state" in a Hispanic clinical context — but the literal English translation has different connotations and can sound stronger than the original intention. The system produced understandable translations but with nuance loss in 22% of these segments.

Work idioms presented similar problems. "Hay que quemar las naves" (you need to commit with no way back) has no direct idiomatic equivalent in English that captures the same tone of determination in an HR context.

What this means for the product

The numbers are solid for production use in the first two segments. 92% accuracy on neutral phrases and 84% on clinical terms is sufficient for the professional to understand the essential message of what the client is saying — which is the goal of bilingual mode.

The 78% on regional idioms requires an advisory layer. In sessions where the client uses intense regional idioms — which happens less frequently but does happen — the professional should treat it as an approximation, not an exact transcription.

We implemented three improvements based on this study:

When the system detects a segment with low translation confidence, it marks it visually for the professional.
The clinical terms glossary is now modality-specific — Gottman terms are treated differently from psychoanalytic terms.
We added a mechanism for the professional to correct a specific translation, and that correction is applied as a preference in future sessions.

Important disclaimer

This is an internal study, not peer-reviewed, with a small sample (50 sessions) and a single language pair (ES↔EN). The conclusions are not generalizable to other language pairs or all clinical contexts. We publish it because we believe transparency about the real performance of systems is more useful to professionals than marketing claims without data.

As we expand bilingual mode to more language pairs, we will publish similar studies.

Want to participate in the cross-language translation beta program with your specific language pair? Write to us at hola@cauceos.com.

← Back to blog

Want to try it?

Start free. Set up your framework in less than 2 minutes.

Start free

Keep reading

Live vs post-session: why in-the-moment assistance wins (internal study)

Research

internal studies

Live vs post-session: why in-the-moment assistance wins (internal study)

We compared live-generated suggestions against post-session summaries across 84 sessions with 12 therapists. 73 percent of post-session suggestions arrive too late to impact the session where they were needed. Results, methodology, limitations.

Felix GonzalezMay 20, 2026 · 5 min read

Product

bilingual

How the bilingual co-pilot works (and why it matters for your next session)

A clear, jargon-free explanation of how CauceOS assists you live when two people speak different languages in the same session.

Felix GonzalezMay 14, 2026 · 5 min read

Psychology

therapeutic modalities

Therapeutic modalities in LATAM: what we discovered talking to 40 psychologists

We spoke with 40 psychologists from five Spanish-speaking countries to understand which therapeutic frameworks they actually use. What we found changed how we designed CauceOS.

Felix GonzalezMay 24, 2026 · 3 min read

How accurate is cross-language translation in real sessions? An internal study with 50 bilingual sessions

Methodology

Results

Neutral phrases: 92% accuracy

Clinical terms: 84% accuracy

Regional idioms: 78% accuracy

What this means for the product

Important disclaimer

More in this category

Live vs post-session: why in-the-moment assistance wins (internal study)

Get the next notes straight to your inbox

Want to try it?

Keep reading

Live vs post-session: why in-the-moment assistance wins (internal study)

How the bilingual co-pilot works (and why it matters for your next session)

Therapeutic modalities in LATAM: what we discovered talking to 40 psychologists