Enterprise - Krisp Blog

Audio-Only Turn-Taking Model v2

Krisp Engineering Team — Mon, 27 Oct 2025 13:19:26 +0000

Introducing Krisp’s Turn-Taking v2

We’ve already discussed the challenges of turn-taking in conversational AI in this blog post.
Now, we’re excited to announce our newest Turn-Taking model, available as part of Krisp’s VIVA SDK.

In this article, we’ll walk through the technology behind the new model and share our latest testing results. The new generation of models is more streamlined than ever—making it simple to integrate Voice Isolation, Turn-Taking, and VAD into your Voice AI pipelines.

If you’d like to see how Krisp’s VIVA SDK can enhance your Voice AI agent experience, apply now from our Developers page.

How the New Model Works

Our latest model predicts End-of-Turns using only audio input—perfect for real-time conversational systems like human-bot interactions.

Compared to v1, krisp-viva-tt-v2 represents a major step forward. It was trained on a more diverse and better-structured dataset, with richer data augmentations that help the model perform more reliably in real-world conditions.

Key Improvements in v2

Greater robustness in noisy environments
Higher accuracy when paired with Krisp’s Voice Isolation models
Faster and more stable turn detection in live conversations

Testing Results

Testing on Clean Audio

We evaluated both model versions on ~1800 audio samples from real conversations, including ~1000 “hold” cases and ~800 “shift” cases, with mild background noise.

Although the numerical difference between versions is small on this clean dataset, the results show that v2 achieves faster mean shift prediction time at the same false positive rate.

Model	Balanced Accuracy	AUC	F1 Score
krisp-viva-tt-v1	0.82	0.89	0.804
krisp-viva-tt-v2	0.823	0.904	0.813

Insight: Even in clean audio conditions, krisp-viva-tt-v2 offers slightly better prediction stability and overall performance.

Testing on Noisy Audio

Next, we evaluated the models on noisy audio mixes at 5 dB, 10 dB, and 15 dB noise levels. Two scenarios were tested:

Directly on the noisy dataset
On the same dataset after processing through the Krisp VIVA Voice Isolation model

In both scenarios, krisp-viva-tt-v2 consistently outperformed v1.

Model	Balanced Accuracy	AUC	F1 Score
krisp-viva-tt-v1	0.723	0.799	0.71
krisp-viva-tt-v2	0.768	0.842	0.757

Insight: krisp-viva-tt-v2 delivers up to a 6% improvement in F1 score under noisy conditions, demonstrating greater resilience in real-world environments.

Testing After Noise and Voice Removal

Finally, we tested both models on the same noisy dataset after applying background noise and voice removal with the krisp-viva-tel-v2 model.

Model	Balanced Accuracy	AUC	F1 Score
krisp-viva-tt-v1	0.787	0.854	0.775
krisp-viva-tt-v2	0.816	0.885	0.808

Insight: When combined with Krisp’s Voice Isolation technology, v2 achieves even greater accuracy and stability.

Conclusion

The new krisp-viva-tt-v2 model marks a significant leap forward in real-time conversation handling for Voice AI. With improved robustness against noise and smoother integration with Krisp’s other models, developers can now build faster, smarter, and more natural-sounding conversational agents.

Explore the VIVA SDK today and see how Krisp’s advanced models can elevate your Voice AI experience.

The post Audio-Only Turn-Taking Model v2 appeared first on Krisp.

Krisp Launches Accent Conversion for Africa

Krisp Team — Thu, 16 Oct 2025 13:10:04 +0000

The real-time Voice AI leader brings tested Accent Conversion technology to Africa, one of the world’s fastest-growing customer experience hubs

Durban, South Africa — October 16, 2025 — Krisp, the leader in real-time voice AI technology, today announced the launch of Accent Conversion for Africa, enabling clearer and more natural conversations across the country’s customer experience (CX) sector. As one of the world’s fastest-emerging outsourcing hubs, Africa has a highly skilled, English-speaking workforce with strong cultural alignment to Western markets, positioning it as a strategic bridge for CX operations across Africa and Europe. Krisp Accent Conversion for Africa supports African English accents, including South African, Ugandan, Kenyan, and Nigerian.

The launch builds on the success of Krisp Accent Conversion 3.7, which supports Indian, Pakistani, Filipino, and Latin American English accents. Powering CX operations at tier-1 banks, insurers, and BPOs worldwide, Krisp’s AI-powered solution continues to set the industry benchmark for speech clarity, phoneme precision, and naturalness. Krisp Accent Conversion for Africa delivers near-native comprehension between contact center agents and consumers, demonstrating a higher performance than both competitors and unprocessed voice.

“Even in the age of AI, human agents are at the front lines of every meaningful customer interaction and they deserve to be clearly understood,” said Davit Baghdasaryan, Co-Founder and CEO of Krisp. “As the CX industry evolves to become more AI-driven, one thing remains constant: human connection drives loyalty and trust. With Krisp, clarity becomes universal, not cultural, by removing accent bias and empowering every voice to connect globally.”

Advantages of Krisp Accent Conversion for Africa include:

Proven performance + measurable impact: Krisp is already trusted at scale, with 250,000+ enterprise seats deployed and 80B+ minutes processed monthly in real-time conversations. Customers using Krisp have seen +99% NPS from end-customers.
Eliminated accent bias: Krisp bridges clarity gaps across Africa’s diverse English accents and native languages.
Talent expansion + boosted retention: Krisp accent conversion expands access to CX jobs for agents who might otherwise be excluded and preserves agents’ authentic voices, building confidence and authenticity.
Cutting costs: Removes the need for expensive and limiting accent neutralization training.
Global competitiveness: Allows operators to hire broadly, without limitations due to accent, and compete more effectively with leading outsourcing hubs like India and the Philippines.

“By integrating Krisp’s AI platform, including Accent Conversion and noise cancellation, we’re amplifying the human touch at every interaction,” said Sudhir Agarwal, Founder & CEO of Everise. “Krisp’s technology has consistently outperformed in head-to-head evaluations across clarity, naturalness, and accent accuracy.”

Krisp’s mission is to enhance the productivity of every voice interaction, which includes eliminating bias and language barriers. By combining advanced voice AI with enterprise-scale reliability, Krisp enables global CX organizations to deliver consistent, high-quality interactions at every touchpoint.

To learn more, visit https://krisp.ai/contact-center/accent-conversion/

Media Contact

Molly Leahy

krispPR@walkersands.com

The post Krisp Launches Accent Conversion for Africa appeared first on Krisp.

Krisp Accent Conversion v3.7 Expands to Latin American Accent Pack

Krisp Team — Mon, 06 Oct 2025 16:38:01 +0000

Following the successful rollout of Accent Conversion v3.7 for Indian and Filipino accent packs, we are excited to announce that the Latin American (LatAm) accent pack has now been upgraded to v3.7.

This update introduces enhancements to speaker similarity, ensuring overall voice stability. As a result, the converted speech sounds closer to the original voice preserves the unique qualities of the original voice while remaining clear, stable, and easy to understand.

Key Improvements in LatAm v3.7

Speaker Similarity: Noticeably stronger preservation of the original speaker’s voice. Objective evaluations showed a 10% improvement in similarity compared to v3.5.
Voice Stability: More consistent pitch and tone throughout speech, eliminating artificial fluctuations and producing a smoother, more natural output.
Naturalness: With enhanced similarity and stability, converted speech is perceived as more human-like and fluid. Crowdsourced model comparisons demonstrated a 9% increase in naturalness scores for v3.7.

Evaluation Results

Our evaluation combined both objective metrics and subjective, crowdsourced testing to ensure robust validation:

37 real-world recordings were sampled for evaluation.
For the crowdsourced study, each recording received 40 independent votes, yielding a total of 1,480 votes and ensuring statistical confidence in the results.
The reported results represent aggregated averages across all recordings.

These findings consistently confirm noticable quality improvements delivered by v3.7.

Metric	LatAm AC 3.5	LatAm AC 3.7	Comment
Speaker Similarity (0 to 1)	0.7	0.77 (+10%)	Objective metric computing similarity between two voices. The higher, the better. 37 real-world audio recordings assessed by 30 participants
Crowdsourced Evaluation – “How natural does the voice sound?” (1 to 5)	3.35	3.44 (+9%)	37 real-world audio recordings assessed by 30 participants

Comparative audio samples

Listening Tip: For the most accurate and immersive comparison between Accent Conversion v3.5 and v3.7, we recommend using quality headphones.

This highlights improvements in clarity, naturalness, and speaker identity preservation that may not be as noticeable on laptop or mobile speakers.

#	Improvement Category	Original	Converted AC v3.5	Converted AC v3.7
1	Speaker Similarity, Voice Stability	https://krisp.ai/blog/wp-content/uploads/2025/10/example_1.wav	https://krisp.ai/blog/wp-content/uploads/2025/10/example_1-AR_LATAM_v1_6.wav	https://krisp.ai/blog/wp-content/uploads/2025/10/example_1-AR_LATAM_v1_7.wav
2	Speaker Similarity, Speech Naturalness	https://krisp.ai/blog/wp-content/uploads/2025/10/example_2.wav	https://krisp.ai/blog/wp-content/uploads/2025/10/example_2-AR_LATAM_v1_6.wav	https://krisp.ai/blog/wp-content/uploads/2025/10/example_2-AR_LATAM_v1_7.wav
3	Speaker Similarity	https://krisp.ai/blog/wp-content/uploads/2025/10/example_3.wav	https://krisp.ai/blog/wp-content/uploads/2025/10/example_3-AR_LATAM_v1_6.wav	https://krisp.ai/blog/wp-content/uploads/2025/10/example_3-AR_LATAM_v3_7.wav
4	Speaker Similarity, Speech Naturalness	https://krisp.ai/blog/wp-content/uploads/2025/10/example_4.wav	https://krisp.ai/blog/wp-content/uploads/2025/10/example_4-AR_LATAM_v1_6.wav	https://krisp.ai/blog/wp-content/uploads/2025/10/example_4-AR_LATAM_v1_7.wav
5	Speaker Similarity	https://krisp.ai/blog/wp-content/uploads/2025/10/example_5.wav	https://krisp.ai/blog/wp-content/uploads/2025/10/example_5-AR_LATAM_v1_6.wav	https://krisp.ai/blog/wp-content/uploads/2025/10/example_5-AR_LATAM_v3_7.wav
6	Speaker Similarity	https://krisp.ai/blog/wp-content/uploads/2025/10/example_6.wav	https://krisp.ai/blog/wp-content/uploads/2025/10/example_6-AR_LATAM_v1_6.wav	https://krisp.ai/blog/wp-content/uploads/2025/10/example_6-AR_LATAM_v3_7.wav

The post Krisp Accent Conversion v3.7 Expands to Latin American Accent Pack appeared first on Krisp.

Krisp Announces Accent Conversion for Africa

Krisp Team — Thu, 02 Oct 2025 13:02:10 +0000

South Africa is rapidly becoming one of the world’s fastest-growing hubs for customer experience delivery. With its highly-skilled, English-speaking workforce, strong cultural alignment with Western markets, and cost-effective operations, the country has become a go-to region for global enterprises and BPOs. The market is projected to continue expanding rapidly, with South Africa positioned as a global gateway for CX across Africa and Europe.

Krisp will make Accent Conversion for African English accents—including South African, Ugandan, Kenyan, and Nigerian—available in October 2025, following the release of its 3.7 model for Indian, Pakistani, Filipino, and Latin American support. It will be fully tested and available for live deployment.

Proven performance

Krisp’s Voice AI platform is already trusted at scale:

250,000+ enterprise seats deployed
80B+ minutes processed monthly in real-time conversations

Krisp’s Accent Conversion delivers measurable impact:

+99% NPS from end-customers
+26.1% in sales conversions
+14.8% in revenue per booking

Accent clarity is often the deciding factor in customer satisfaction, repeat business, and agent productivity. Krisp Accent Conversion for South Africa ensures conversations remain natural, inclusive, and crystal clear—unlocking the country’s full potential as a CX powerhouse.

Accent Conversion takes these advantages further by:

Eliminating accent bias: Bridges clarity gaps across South Africa’s diverse English accents and native languages.
Unlocking opportunity: Expands access to CX jobs for agents who might otherwise be excluded.
Boosting retention: Preserves agents’ natural voices, building confidence and authenticity.
Cutting costs: Removes the need for expensive and limiting accent neutralization training.
Scaling without limits: Allows operators to hire broadly, without limitations due to accent, and compete globally with India and the Philippines.

Krisp continues to set the standard for real-time Voice AI at enterprise scale, giving global operators technology they can trust to deliver measurable results.

Krisp Accent Conversion for South Africa will be available for deployment in October. To learn more, visit https://krisp.ai/contact-center/accent-conversion/

The post Krisp Announces Accent Conversion for Africa appeared first on Krisp.

Krisp Accent Conversion v3.7 Expands to Support Filipino Accents

Krisp Team — Thu, 25 Sep 2025 07:58:04 +0000

In August, we introduced Accent Conversion v3.7 with major improvements in naturalness, fluency, and voice stability, starting with the Indian accent pack. That release marked a turning point—showing how much closer we can get to native-like, stable, and intelligible speech for global contact centers.

Today, we’re excited to announce that the Filipino accent pack is now upgraded to v3.7.

Key Improvements in Filipino v3.7

Through rigorous testing and customer feedback, v3.7 shows clear gains over v3.5 across all major dimensions:

Naturalness: Converted speech is significantly more human-like and conversational. Crowdsourced model comparisons demonstrated a 32% stronger preference for v3.7 over the previous version.
Pronunciation Accuracy: Enhanced phoneme pronunciation and intelligibility, with a ~9.31% relative improvement in Phoneme Error Rate (PER) on customer datasets. This improvement is largely driven by incorporating more conversational data during training. Accent-specific gains include more native-like articulation of consonants such as “t,” “p,” and “r.”
Voice Stability: Greater consistency in pitch and tone throughout speech, reducing unnatural fluctuations. This contributes directly to more natural and stable-sounding output.
Speech & Audio Clarity: Clearer audio with fewer artifacts and distortions, particularly in cases of slurred or mumbled speech. Crowdsourced model comparisons showed a 37% stronger preference for v3.7 in terms of overall clarity and intelligibility.

Evaluation Results

For subjective and objective evaluations, 57 real-world recordings were sampled.

For the crowdsourced evaluation, each recording received exactly 40 independent votes to ensure statistical confidence, 2280 total votes.

The results shown in the table below represent aggregated averages across all recordings.

Metric	Filipino AC V3.5	Filipino AC V3.7	Comment
Crowdsourced Evaluation – “How natural does the voice sound?” (1 to 5)	3.56	3.71 (+4%)	57 real-world audio recordings assessed by 30 participants
Crowdsourced Models’ Comparison – Which option sounds more natural?	982	1298 (+32%)	57 real-world audio recording pairs were evaluated, with each pair assessed by 40 participants, in total 2280 voting
Crowdsourced Models’ Comparison – Which speech sounds more clear and intelligible?	961	1319 (+37%)	57 real-world audio recording pairs were evaluated, with each pair assessed by 40 participants, in total 2280 voting

Comparative audio samples

Listening Tip: For the most accurate and immersive comparison between v3.5 and v3.7 Accent Conversion, we recommend using quality headphones.

This helps highlight the improvements in clarity, naturalness, and speaker identity preservation that may be less perceptible on laptop or mobile speakers.

#	Improvement Category	Original	Converted AC v3.5	Converted AC v3.7
1	Speech Naturalness, Speech Clarity	https://krisp.ai/blog/wp-content/uploads/2025/09/PHI_Yeti_X_03_04_2025-Original.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/PHI_Yeti_X_03_04_2025-Original-AR_PH_v3_5.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/PHI_Yeti_X_03_04_2025-Original-AR_PH_v3_7.wav
2	Speech Naturalness, Less Accent Leakage	https://krisp.ai/blog/wp-content/uploads/2025/09/5_PHI_Jabra_Evolve_20MS_29_04_2025-Original.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/5_PHI_Jabra_Evolve_20MS_29_04_2025-Original-AR_PH_v3_5.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/5_PHI_Jabra_Evolve_20MS_29_04_2025-Original-AR_PH_v3_7.wav
3	Speech Naturalness Speech Clarity	https://krisp.ai/blog/wp-content/uploads/2025/09/2_PHI_Jabra_Evolve_20MS_29_04_2025-1-original-1.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/2_PHI_Jabra_Evolve_20MS_29_04_2025-1-AR_PH_v3_5.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/2_PHI_Evolve_20MS_29_04_2025-1-AR_PH_v3_7.wav
4	Speech Clarity, Better phonemes (we, for)	https://krisp.ai/blog/wp-content/uploads/2025/09/example_3.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/example_3-AR_PH_v3_5.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/example_3-AR_PH_v3_7.wav
5	Speech Naturalness Voice Stability	https://krisp.ai/blog/wp-content/uploads/2025/09/example_2.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/example_2-AR_PH_v3_5.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/example_2-AR_PH_v3_7.wav
6	Speech Clarity Speech Naturalness	https://krisp.ai/blog/wp-content/uploads/2025/09/example_4.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/example_4-AR_PH_v3_5.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/example_4-AR_PH_v3_7.wav
7	Speech Naturalness Speech Clarity	https://krisp.ai/blog/wp-content/uploads/2025/09/3_PHI_Jabra_Evolve_20MS_29_04_2025-Original.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/3_PHI_Jabra_Evolve_20MS_29_04_2025-Original-AR_PH_v3_5.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/3_PHI_Jabra_Evolve_20MS_29_04_2025-Original-AR_PH_v3_7.wav
8	Speech Clarity, Better Phonemes (check support)	https://krisp.ai/blog/wp-content/uploads/2025/09/PHI_RealtekAudioUSB_21.01-Original.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/PHI_RealtekAudioUSB_21.01-Original-AR_PH_v3_5.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/PHI_RealtekAudioUSB_21.01-Original-AR_PH_v3_7.wav
9	Speech Naturalness Less Accent Leackage	https://krisp.ai/blog/wp-content/uploads/2025/09/PHI_Jabra_Evolve_20MS_29_04_2025-2-Original.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/PHI_Jabra_Evolve_20MS_29_04_2025-2-Original-AR_PH_v3_5.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/PHI_Jabra_Evolve_20MS_29_04_2025-2-Original-AR_PH_v3_7.wav
10	Speech Clarity, Better Phonemes (questions)	https://krisp.ai/blog/wp-content/uploads/2025/09/example_6.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/example_6-AR_PH_v3_5.wav	https://krisp.ai/blog/wp-content/uploads/2025/09/example_6-AR_PH_v3_7.wav

The post Krisp Accent Conversion v3.7 Expands to Support Filipino Accents appeared first on Krisp.

Introducing Krisp Accent Conversion v3.7

Krisp Engineering Team — Thu, 07 Aug 2025 07:55:59 +0000

Krisp Accent Conversion v3, released in March 2025, marked a breakthrough moment in the evolution of our accent conversion technology. For the first time in two years, we felt the system was mature enough for wide-scale production use.

In May 2025, we released Accent Conversion v3.5, bringing a major quality upgrade — with ~20% improvement across key metrics for both Filipino and Indian accents (details here). Thanks to Krisp desktop application’s auto-update mechanism, the rollout reached 95% of users within 2 days, and the feedback was overwhelmingly positive, both from agents and customers, driving sentiment and business KPIs.

In July 2025, we expanded the offering to support the Latin American accent pack. The launch quickly gained traction with several large customers and is now deployed across thousands of agents.

Throughout this period, we’ve worked closely with partners, agents, and customers to deeply understand corner cases — especially for the Indian accent, which is the most challenging due to its vast regional variation and phonetic complexity. This close collaboration, combined with relentless efforts from the world-class research and engineering teams at Krisp, has culminated in another major step forward now.

Today, we’re launching Accent Conversion v3.7, delivering significant improvements in naturalness and voice stability. This release is currently focused on the Indian accent pack, with support for other accents rolling out soon.

The following sections summarize the key improvements, benchmarking methodology, and a side-by-side comparison of Accent Conversion v3.7 with v3.5.

Key Improvements in AC v3.7

Naturalness: The converted speech sounds even more human-like and natural, with much improved filler-sound handling. Here, expert-rated naturalness scores improved by +14%. Crowdsourced evaluations confirm it with a +6% gain.
Voice Stability: Enhanced consistency in pitch and tone throughout the utterance, helping avoid unnatural fluctuations, especially for thick accents. This contributed to improved naturalness and clarity scores across all metrics.
Speech & Audio Clarity: Improvements were noted in both intelligibility and the reduction of artifacts and distortions. Speech Clarity scores rose by 5% in expert assessments, with corresponding enhancements across Meta metrics.
Pronunciation Accuracy: There’s a gain in objective metrics as well, about a 4% relative improvement in Phoneme Error Rate (PER), which can be attributed to more conversational data inclusion in the training. Here, some noticeable accent-specific enhancements in phoneme pronunciation, such as more native-like articulation of “R” and “L”, contribute to a +5% increase in the Accent Conversion score.

Evaluation Results

For subjective and objective evaluations, 78 real-world recordings were sampled.

For the crowdsourced evaluation, each recording received exactly 30 independent votes to ensure statistical confidence, 2340 total votes.

The results shown in the table below represent aggregated averages across all recordings.

Metric	IN AC v3.5	IN AC v3.7	Comment
Expert Evaluation – Natural speech (1 to 5)	3.7	4.2 (+14%)	Speech sounds even more human-like, with much improved filler-sound handling
Expert Evaluation – Speech Clarity (1 to 5)	4.0	4.2 (+5%)	Speech is with fewer artifacts and clearer, especially in slurred and mumbling segments
Expert Evaluation – Accent Conversion (1 to 5)	4.3	4.5 (+5%)	Accent-specific enhancements in phoneme pronunciation, such as more native-like articulation of “R” and “L”
Crowdsourced Evaluation – “How natural does the voice sound?” (1 to 5)	3.4	3.6 (+6%)	78 real-world audio recordings assessed by 30 participants
Crowdsourced Models’ Comparison – Which option sounds more natural?	1242	1878 (+20%)	78 real-world audio recording pairs were evaluated, with each pair assessed by 40 participants
Meta Aesthetic – Natural speech (1 to 10)	5.6	5.8 (+4%)
Meta Aesthetic – Speech Clarity (1 to 10)	7.5	7.6 (+1%)

Comparative audio samples

Listening Tip: For the most accurate and immersive comparison between v3.5 and v3.7 Accent Conversion, we recommend using quality headphones.

This helps highlight the improvements in clarity, naturalness, and speaker identity preservation that may be less perceptible on laptop or mobile speakers.

#	Improvement Category	Original	Converted AC v3.5	Converted AC v3.7
1	Speech Naturalness
2	Speech Naturalness
3	Speech Naturalness Speech Clarity
4	Speech Clarity
5	Speech Clarity Speech Naturalness Voice Stability
6	Speech Clarity Speech Naturalness Voice Stability
7	Speech Naturalness Speech Clarity
8	Speech Naturalness Speech Clarity

Appendix

Subjective Evaluation

Our evaluation was conducted across two structured tracks: expert panel ratings and crowdsourced listener preferences, designed to capture both technical precision and human perception.

Real-world agent calls have been sampled to represent a diverse set of speakers and input conditions, including, but not limited to

Accent level – high, medium, low
Speech rates and fluency
Background conditions (quiet, noisy, multi-speaker environments)

Evaluators scored each recording across four qualitative dimensions using a 5-point Likert scale:

Score	Meaning
5	Excellent / Native-like
4	Very Good
3	Acceptable
2	Needs Improvement
1	Poor / Unintelligible

1. Expert Panel Evaluation

Six expert evaluators independently rated matching audio pairs — each pair consisting of the same original voice converted by AC v3.5 and AC v3.7.

To eliminate bias:

File names were anonymized (no version markers)
The order of samples was randomized
Scoring was blind and individual (no group discussion)

2. Crowdsourced Evaluation

To further simulate real-world user perception, a blind A/B test was run with a pairs of recordings: AC v3.5 vs. AC v3.7.
78 real-world audio recording pairs were evaluated, with each pair assessed by 40 participants, resulting in 3,120 votes overall.

Participants were asked the following question:
“Which option sounds more natural (i.e., more human-like)?”

Results:

Version 3.5 was selected 1242 times
Version 3.7 was selected 1878 times

Evaluation metrics

Accent Conversion performance was measured across four key dimensions. These were selected based on real-world call center priorities such as clarity, naturalness, and robustness.

Metric	Description
Accent Conversion	How effectively the speaker’s original accent is transformed into a neutral or target accent. High scores mean minimal accent leakage or trace of the original pronunciation.
Speech Clarity	Evaluates articulation, intelligibility, and absence of audio distortions, such as mumbling, muffling, or low vocal energy.
Natural Speech	Measures how closely the output resembles fluid, human-like speech, including natural variations in pitch, tone, rhythm, and intonation.
Pronunciation Accuracy	Measures how closely the converted speech matches standard American English pronunciation at the phoneme level. It evaluates whether individual sounds (vowels, consonants, syllables) are produced correctly and consistently, without distortion, misplacement, or omission, ensuring that the converted voice sounds intelligible and native-like to a U.S. listener.

Objective Evaluation

For objective evaluation, the same set of recordings was processed using the Meta Audiobox Aesthetics and captured metrics strongly correlated to Natural Speech and Speech Clarity. Additionally, to quantify how each system impacts phoneme accuracy, all recordings were also processed using the Facebook NN Phonemizer, which is strongly correlated with the Accent Conversion metric.

Objective Metric	Interpretation	Highly Correlated to Subjective Metric	What It Captures
Production Quality*	Higher is better	Speech Clarity	Fidelity, presence of audio artifacts, balance, and clarity of the output signal
Content Enjoyment*	Higher is better	Natural Speech	Perceived naturalness, fluidity, and enjoyment of listening — akin to human listening satisfaction
Phoneme Error Rate (PER)	Lower is better	Accent Conversion	Measures pronunciation distortion. Lower scores mean more accurate, intelligible speech with better articulation.

these metrics are derived from waveform-level analysis and do not require transcript or linguistic alignment, making them ideal for evaluating accent conversion outputs that vary in delivery and prosody.

The post Introducing Krisp Accent Conversion v3.7 appeared first on Krisp.

Audio-only, 6M weights Turn-Taking model for Voice AI Agents

Krisp Engineering Team — Mon, 04 Aug 2025 23:20:04 +0000

In this article we discuss an outstanding problem in today’s Voice AI Agents – turn-taking. We examine why it is a hard problem and present a solution in
Krisp’s VIVA SDK.
We also benchmark the Krisp solution against some of the established solutions in the market.

Note: The Turn-Taking model is included in the VIVA SDK offering at no additional charge.

What is turn-taking?

Turn-taking is the fundamental mechanism by which participants in a conversation coordinate who speaks when. While seemingly effortless in human interaction, in human to AI agent conversations modeling this process computationally is highly complex. In the context of Voice AI Agents (including voice assistants, customer support bots, and AI meeting agents), turn-taking decides when the agent should speak, listen, or remain silent.

Without effective turn-taking, even the most advanced dialogue systems can come across as unnatural, unresponsive, and frustrating to use. A precise and lightweight turn-taking model enables natural, seamless conversations by minimizing interruptions and awkward pauses while adapting in real time to human cues such as hesitations, prosody, and pauses.

In general, turn-taking includes the following tasks:

End-of-turn prediction – predicting when the current speaker is likely to finish their turn
Backchannel prediction – detecting moments where a listener may provide short verbal acknowledgments like “uh-huh”, “yeah”, etc. to show engagement, without intending to take over the speaking turn.

In this article, we present our first audio-based turn-taking model, which focuses on the end-of-turn prediction task using only audio input. We chose to release the audio-based turn-taking model first, as it enables faster response times and a lightweight solution compared to text-based models, which usually require large architectures and depend on the availability of a streamable ASR providing real-time, accurate transcriptions.

Approaches to Turn-Taking

Solutions to Turn Taking problem are usually implemented in AI models, which use audio and/or text representation.

1. Audio-based

Audio-based approaches rely on analyzing acoustic and prosodic features of speech. These features include, changes in pitch, energy levels, intonation, pauses and speaking rate. By detecting silence or overlapping speech, the system predicts when the user has finished speaking and when it is safe to respond. For example, a sudden drop in energy followed by a pause can be interpreted as a turn-ending cue. Such models are effective in real-time, low-latency scenarios where immediate response timing is critical.

2. Text-based

Text-based solutions analyze the transcribed content of speech rather than the raw audio. These models detect linguistic cues that indicate turn completion, such as sentence boundaries, punctuation, discourse markers (e.g., “so,” “anyway”), natural language patterns or semantics (e.g., user might directly ask the bot not to speak). Text-based systems are often integrated with dialogue state tracking and natural language processing (NLP) modules, making them effective for scenarios where accurate semantic interpretation of user intent is essential. However, they may require larger neural network architectures to effectively analyze the linguistic content.

3. Audio-Text Multimodal (Fusion)

Multimodal solutions combine both acoustic and textual inputs, leveraging the strengths of each. While audio-based methods capture real-time prosodic cues, text-based analysis provides deeper semantic understanding. By integrating both modalities, fusion models can make accurate and context-aware predictions of turn boundaries. These systems are effective in complex, multi-turn conversations where relying on either audio or text alone might lead to errors in timing or intent detection.

Challenges of turn-taking

Hesitation and filler words

In natural dialogue, speakers often take a pause using fillers like “um” or “you know” without intending to give up their turn. For instance:

“I think we should, um, maybe –” [The agent jumps in, assuming the sentence is over]

Here, a turn-taking system must distinguish hesitation from completion, or risk interrupting too early.

Natural pauses vs. true end-of-turns

Pauses are not always indicators that a speaker has finished. For example:

“Yesterday I woke up early, then… [pause] I went to work…”

A model might misinterpret the pause as a turn boundary, generating a premature response and breaking the conversational flow.

Quick turn prediction

Minimizing response latency is essential for maintaining natural conversational flow. Humans tend to respond quickly, sometimes even reactively, when the end of the speech is obvious. If a model fails to predict the turn boundary fast enough, the system may sound sluggish or unnatural. The challenge is to trigger responses at just the right moment – early enough to sound fluid, but not so early that it risks interrupting the speaker.

Varying speaking styles and accents

People speak in diverse rhythms, intonations, and speeds. A fast speaker with sharp pitch drops might appear to end a sentence even when they haven’t. Conversely, a slow, melodic speaker may stretch syllables in ways that confuse timing-based systems. Modeling these variations effectively requires a neural network–based approach.

Krisp’s audio-based Turn-Taking model

Recently Krisp had released AI models for effective noise cancellation and voice isolation for Voice AI Agent use-cases, particularly improving pre-mature turn taking caused by background noise. See more details. This technology is widely deployed and has recently passed a 1B mins/month milestone.

It was only natural for us to take on a larger problem of turn-taking (TT). In this first iteration, we designed a lightweight, low-latency, audio-based turn-taking model optimized to run efficiently on a CPU. The Krisp TT model is built into Krisp’s VIVA SDK, where using the Python SDK you can easily chain it with the Voice Isolation models , placing it in front of a voice agents to create a complete, end‑to‑end conversational flow, as shown in the following diagram.

Here, the TT model continuously outputs a confidence score (probability) ranging from 0 to 1, indicating the likelihood of a shift – a point where a speaker is expected to finish their turn. It operates on 100ms audio frames, assigning a shift confidence score to each frame. To convert this score into a binary decision, we apply a configurable threshold. If the score exceeds this threshold (Δ), we interpret it as a shift (end of turn) prediction; otherwise, the model considers the current speaker is still holding the turn.

We also define a maximum hold duration, which defaults to 5 seconds. The model is designed such that, during uninterrupted silence, the confidence score gradually increases and reaches a value of 1 precisely at the end of this maximum hold period.

Comparison with other Turn-Taking models

Let’s take a closer look at how other solutions handle the turn-taking problem in comparison to Krisp.

Simple VAD (Voice Activity Detection)

The basic VAD-based approach is as straightforward as it gets – if you taken a pause in your speech, you have probably have finished your turn. Technically, once a few seconds of (usually configurable) silence is detected, the system assumes the speaker has finished and hands over the turn. While efficient, this method lacks awareness of conversational context and often struggles with natural pauses or hesitant speech. In our comparisons, we use the Silero-VAD model with a 1-second silence detection window as a simple VAD-based turn-taking approach.

SmartTurn

SmartTurn v1 and SmartTurn v2 by Pipecat are open-source AI models, designed to detect exactly when a speaker has finished their turn. We picked them for in-depth comparison because like Krisp TT, they are audio-based models.

Interestingly, SmartTurn models introduce a hybrid strategy. They first wait for 200ms of silence detected by Silero VAD, then evaluate whether a turn shift should occur. If the confidence is too low to switch, the system defers the decision. However, if silence persists for 3 seconds (default value, configurable parameter in SmartTurn), it forcefully initiates the turn transition. This layered approach aims to strike a balance between speed and caution in handling user pauses.

Tested Models

The following table gives a high-level comparison between the contenders

Attribute	Krisp TT	SmartTurn v1	SmartTurn v2	VAD-based TT
Model Parameters count	6.1M	581M	95M	260k
Model Size	65 MB	2.3 GB	360 MB	2.3 MB
Recommended Execution	On CPU	On GPU	On GPU	On CPU
Overall Accuracy	Good	Good	Good	Poor

Test Dataset

The test dataset was built using real conversational recordings, with manually labeled turn-taking (shift) and hold scenarios (hold). A turn-taking instance marks a point where one speaker hands over the conversation, we will call a shift, while a hold scenario captures cases where the speaker continues after a brief pause, filler words, or unfinished context.

The dataset consists of 1,875 labeled audio samples, including a significant number of labeled shift and hold scenarios. Each audio file is annotated to include the silence at the end of a speaker’s segment – either resulting in a turn shift or a hold. The test data was annotated according to multiple criteria, including context, intonation, filler words (e.g., “um,” “am”), keywords (e.g., “but,” “and”), and breathing patterns.

Below are the statistics on silence duration for each scenario type as well as the distribution of shift and hold cases based on mentioned criteria.

Training Dataset

Our training dataset comprises approximately 2,000 hours of conversational speech, containing around 700,000 speaker turns.

Evaluation: Prediction Quality Metrics

To assess the performance of the turn-taking model, we used a combination of classification metrics and timing-based analysis:

Metric	Description
TP	True Positives: Correctly predicted positive class cases
TN	True Negatives: Correctly predicted negative class cases
FP	False Positives: Incorrectly predicted positive class cases
FN	False Negatives: Missed positive class cases

Metric	Formula	Description
Precision	TP / (TP + FP)	Proportion of predicted positives that are actually positive
Recall	TP / (TP + FN)	Proportion of actual positives correctly predicted
Specificity	TN / (TN + FP)	Proportion of actual negatives correctly predicted
Balanced Accuracy	(Recall + Specificity) / 2	Average performance across both classes (positive and negative)
F1 Score	2 × (Precision × Recall) / (Precision + Recall)	Harmonic mean of Precision and Recall; balances false positives and false negatives

AUC: The AUC is the area under the ROC curve. A higher AUC value indicates better classification performance, here ROC (receiver operating characteristic) shows the trade-off between the true positive rate and the false positive rate as the decision threshold is varied, for more details on AUC and other metrics read here.

Evaluation: Latency vs. Accuracy tradeoff (MST vs FPR)

We realized that there is a natural tradeoff between the accuracy and latency, i.e. how quickly the system detects a true shift. We can reduce the latency by lowering the threshold, however, it will likely lead to increased false-positive rate (FPR) and unwanted interruptions. On the other hand, we don’t want to wait too long to predict a shift, because the increased latency will result in awkward interaction (see the chart below).

Therefore, the latency to accuracy relationship is important and here we measure TT system’s latency by mean shift time (MST). The shift time is defined as the duration between the onset of silence and the moment of predicting end-of-turn (shift). If the model outputs a confidence score, the end-of-turn prediction can be controlled via a threshold. This makes the threshold an important control lever in the trade-off between reaction speed and prediction accuracy:

Higher thresholds result in delayed shift predictions, which help reduce false positives (i.e., shift detections during the current speaker hold period which leads to interruption from the bot). However, this increases the mean shift time, making the system slower to respond.
Lower thresholds lead to faster responses, decreasing mean shift time, but at the cost of increased false positives, potentially causing the bot to interrupt speakers prematurely.

To visualize this trade-off, we plot a chart showing the relationship between mean shift time calculated in end-of-speech cases and false positive (interruption) rate as the threshold varies from 0 to 1. To provide a comparative summary of models, we plot these charts. A lower curve indicates a faster mean response time for the same interruption rate – or, from another perspective, fewer interruptions for the same mean response time. Here you can see the corresponding plots for Krisp TT, SmartTurn v1 and SmartTurn v2. Note that we can’t directly visualize such a chart for the VAD-based TT, as MST vs FPR requires a model that outputs a confidence score, whereas the VAD-based model produces binary outputs (0 or 1). The same limitation applies to AUC-shift computation shown in the table above.

This basically means that the Krisp TT model has considerably faster average response time (0.9 vs. 1.3 seconds at a 0.06 FPR) compared to SmartTurn to produce a true-positive answer.

To summarize the overall latency-accuracy tradeoff, we also compute the area under the MST vs FPR curve. This single scalar score captures the model’s ability to respond quickly while minimizing interruptions across different thresholds. A lower area indicates better performance.

Evaluation Results

Model	Balanced Accuracy	AUC Shift	F1 Score Shift	F1 Score Hold	AUC (MSP vs FPR)
Krisp TT	0.82	0.89	0.80	0.83	0.21
VAD based TT	0.59	–	0.48	0.70	–
SmartTurn V1	0.78	0.86	0.73	0.84	0.39
SmartTurn V2	0.78	0.83	0.76	0.78	0.44

It’s important to note that the Krisp TT model delivers comparable quality in terms of predictive quality metrics and significantly better quality in terms of latency vs accuracy tradeoff while being 5-10x smaller and optimized to run efficiently on a CPU. The VAD-based turn-taking approach is more lightweight, but it performs significantly worse than dedicated TT models – highlighting the importance of modeling the complex relationships between speech structure, acoustic features, and turn-taking behavior.

Demo

Here’s a simple dialogue showing how Krisp’s Turn-Taking model works in practice. In the demo, you’ll hear intentional utterances, pauses, filler words and interruptions. The response time you observe includes the Turn-Taking model’s speed, plus the latency from the speech-to-text (STT) system and the language model (LLM).

Krisp’s Turn-Taking Model

Krisp’s TT model vs Pipecat’s SmartTurn V2

This demo compares Krisp’s Turn-Taking model with Pipecat’s SmartTurn model (3-second default value, configurable parameter in SmartTurn). To highlight the differences visually, we’ve also overlaid a speech-to-text transcript on the video.

Future Plans

Improved Accuracy in TT

While this initial, audio-based TT model provides balanced accuracy and latency, it is mainly limited to analyzing prosodic and acoustic features, such as changes in intonation, pitch and rhythm. By analyzing linguistic features like the syntactic completion of a sentence we can further improve the accuracy of the TT model.

We plan to build the following features as well:

Text-based Turn-Taking: This model will use text only input and predict end-of-turn with a custom Neural Network trained for this use case.
Audio-Text Multimodal (Fusion): This model will use both audio and text inputs to leverage the best from these two modalities and give the highest accuracy end-of-turn prediction.

Early prototypes show promising results, with the multimodal approach outperforming the audio-based turn-taking models noticeably.

Backchannel support

Backchannel detection is another challenge encountered during the development of Voice AI agents. The “backchannel” is the secondary or parallel forms of communication that occur alongside a primary conversation or presentation. It encompasses the responses a listener gives to a speaker to indicate they are paying attention, without taking over the main speaking role.

While interacting with AI agent, in some cases, the user may genuinely want to interrupt – to ask a question or shift the conversation. In others, they might simply be using backchannel cues like “right” or “okay” to signal that they’re actively listening. The core challenge lies distinguishing meaningful interruptions from casual acknowledgments.

Our roadmap includes the release of a reliable dedicated backchannel prediction model.

The post Audio-only, 6M weights Turn-Taking model for Voice AI Agents appeared first on Krisp.

Krisp vs. Sanas – Inbound Noise Cancellation Comparison

Krisp Team — Tue, 22 Jul 2025 05:55:10 +0000

Introduction

Call quality isn’t just about how clearly the agent speaks — it’s also about how clearly the agent can hear. In contact centers, where efficiency and accuracy drive performance, background noise from the customer side can have a major impact on agent productivity and experience. Customers often call in from noisy environments — traffic, households, public spaces — introducing acoustic clutter that leads to repetition, frustration, and longer handle times.

Traditional noise cancellation solutions have focused primarily on the outbound audio channel, removing noise from the agent’s side before it reaches the customer. But that only solves half the problem. Inbound noise cancellation — removing distractions from the customer side before it reaches the agent or the AI — is often just as important.

At Krisp, we’ve long recognized the importance of customer-side audio cleanup, and we’ve been solving it at scale for years. Our vision for inbound noise cancellation as a key enabler for better agent experiences is detailed in this article, where we highlight how noisy customer audio affects handle times, comprehension, and overall call quality.

Today, Krisp’s mature, production-grade inbound noise cancellation models power real-world applications:

Krisp AI Meeting Assistant — deployed for years in Krisp’s desktop app, helping professionals clearly hear their remote counterparts during online meetings — even when the other side is calling from a noisy café, home, or airport.
Krisp AI Contact Center — used by BPOs and customer support teams to clean up customer voices in live calls, boosting agent comprehension.
Krisp SDK — starting in March 2025, Krisp’s inbound Noise and Voice Cancellation technology became available through our SDKs for seamless integration with server-side, real-time voice AI systems. Today, Krisp powers some of the largest production Voice Bots, helping them solve critical challenges like turn-taking accuracy, background noise robustness, and ASR performance in real-world environments.

Sanas entered the market with an outbound noise cancellation solution, generally available in August, 2024. On May 30, 2025, Sanas announced a new omnidirectional noise cancellation model, claiming support for both inbound and outbound audio cleanup, including real-time customer-side voice processing.

Given the importance of inbound noise cancellation, we decided to put this new offering to the test.

Just as we conducted an in-depth comparison of Sanas vs. Krisp for outbound noise cancellation, we ran a technical evaluation of Sanas’s inbound noise cancellation solution against Krisp’s production-grade models, focusing on real-world call center scenarios, voice quality, and effectiveness in handling background speech.

Understanding the Differences Between Inbound and Outbound Noise Cancellation

While both inbound and outbound noise cancellation (NC) aim to improve voice clarity, the conditions they operate under are fundamentally different. Constraints make inbound NC a technically more complex and demanding task, and not all noise cancellation models are designed to handle it effectively.

Aspect	Outbound Noise Cancellation	Inbound Noise Cancellation
Audio Source	Agent’s local microphone	Customer audio received over network
Audio Quality	High-fidelity, uncompressed	Compressed, degraded audio (e.g., VoIP, PSTN)
Typical Sample Rate	32 kHz	8 kHz or 16 kHz
Use Cases	Improving how the customer hears the agent	Improving how the agent or AI hears the customer
Speaker Scenarios	Typically single-speaker	Single or multi-speaker (e.g., speakerphones, conference rooms)

With these fundamental differences between inbound and outbound noise cancellation in mind, we evaluated how Krisp and Sanas approach the inbound side of the problem.

Operational Differences: Krisp vs. Sanas

While both Krisp and Sanas aim to improve customer voice clarity for agents, their architectural choices, product maturity, and performance under real-world conditions vary significantly.

The table below summarizes the key differences between Krisp’s and Sanas’s inbound noise cancellation solutions based on our analysis.

Aspect	Krisp	Sanas
Model Design	Use-case optimized models tailored for different inbound Noise Cancellation situations	Single, multi-purpose model for both inbound and outbound NC
Audio Quality	Up to 16 kHz	Up to 8 kHz
Use Case Coverage	`krisp-viva-v6-lite` – integrated world-class Voice Isolation technology. General-purpose AI model for WebRTC, mobile, and telephony (up to 16kHz), resilient to codec artifacts (e.g., G.711) `krisp-nc-i-v8-pro` – multi-speaker model optimized for 16kHz far-field use cases like conference rooms	Single, omnichannel AI model used across all conditions
Production Maturity	Mature, production-grade models used across enterprise, SDK, and desktop	Inbound noise cancellation announced in May 2025; production readiness unverified
Deployment through SDK	Available	Unknown

Note: All observations regarding Sanas’s inbound noise cancellation performance are based on publicly available Sanas version 3.2.72, conducted in July 2025.

Krisp vs. Sanas: In-depth inbound noise cancellation evaluation

In this section, we present a comparative summary of evaluations conducted on Krisp and Sanas inbound noise cancellation technologies. These evaluations reflect real-world usage scenarios and benchmark data commonly produced by enterprise customers and BPOs assessing solution fit and performance.

We cover comparison methodology, present objective evaluation results, crowdsourced subjective evaluation results, and share comparative audio samples.

Evaluation Methodology and Metrics

For quantitative evaluation, we used the POLQA (Perceptual Objective Listening Quality Analysis) metric — an industry-standard objective metric for predicting perceived listening quality. POLQA is suitable for evaluating narrowband and wideband speech affected by noise, compression artifacts, and signal degradation.

We also processed the outputs using Meta’s AudioBox Aesthetics model, which is a reference-free ML-based model quantitatively assessing listening experience and quality of audios. While not a direct replacement for human perception, it adds a complementary viewpoint to our analysis.

Objective Metric	Interpretation	Highly Correlated to Subjective Metric	What It Captures
POLQA	Higher is better	Speech Intelligibility & MOS	Fidelity and clarity under real-world network conditions; penalizes distortion and noise artifacts in the speech
Production Quality	Higher is better	Speech Clarity	Fidelity, presence of audio artifacts, balance, and clarity of the output signal
Content Enjoyment	Higher is better	Natural Speech	Perceived naturalness, fluidity, and enjoyment of listening — akin to human listening satisfaction

In addition to the objective metrics, a subjective crowdsourced evaluation was conducted, where participants were asked to compare anonymized paired audio samples (e.g., Sanas vs. Krisp) and asked, “Which audio sounds more pleasant and clear?”.

Evaluated Models

To ensure a fair comparison, we focused our primary benchmark on single-speaker inbound noise cancellation scenarios, since Sanas’s model appears to perform some level of secondary background speech suppression — suggesting a form of voice isolation. As such, we compared it directly with Krisp’s Background Voice Cancellation (BVC) enabled inbound model, which is also optimized for single-speaker voice isolation.

However, to offer a more comprehensive view of Krisp’s capabilities, we also included Krisp’s multi-speaker inbound model in the evaluation. This demonstrates how Krisp performs in far-field environments such as speakerphones, group calls, where multiple speakers talk from a distance away from the microphone.

Model	Sampling Rate	Speaker Scenario	Voice Isolation	Near Field/Far Field
`krisp-viva-v6-lite`	up to 16 kHz	Single Speaker	Yes	Near Field
`krisp-nc-i-v8-pro`	up to 16 kHz	Multi Speaker	No	Far Field
`sanas-inbound`	up to 8 kHz	Single Speaker*	Limited	Both

this assumption is based on the performance for cases with background speech.

Evaluation Dataset

We created a controlled test dataset by mixing English utterances from the ITU-T P.501 dataset with 24 different real-world background noises at 0dB, 5dB, and 10dB SNR levels. To simulate realistic telephony transmission conditions, we applied common voice codecs — G.729, G.711, and OPUS — before feeding the degraded audio into each model.

Note: Krisp natively produces higher-quality audio at 16kHz sampling rate. For head-to-head comparison, though, we standardized the evaluation pipeline by downsampling Krisp’s output to 8 kHz, matching Sanas’s maximum supported sample rate. This ensured a fair reference test dataset and alignment for POLQA and other narrowband evaluations.

Evaluation Results

The following table summarizes subjective and objective evaluation of Krisp vs. Sanas across key metrics.

Here, the original audio was mixed with various noise types and processed using the Krisp and Sanas models. For a fair comparison, the Krisp model’s output was downsampled to 8 kHz to enable direct comparison with Sanas.

Metric	Type	Krisp	Sanas	Winner
POLQA: Home noise	Objective	3.7/5	3.1/5	Krisp
POLQA: Street noise	Objective	3.7/5	3.1/5	Krisp
POLQA: Cafe noise	Objective	3.8/5	2.9/5	Krisp
POLQA: Distractor noise	Objective	3.8/5	3.3/5	Krisp
Meta Audiobox: Content Enjoyment	Objective	4.7/10	3.8/10	Krisp
Meta Audiobox: Production Quality	Objective	5.2/10	3.9/10	Krisp
Which audio sounds more pleasant and clear? Preferred by (# votes / total responses)	Subjective	704/960	256/960	Krisp

The following sections provide a deeper comparison of Krisp’s inbound models, evaluated at both 8 kHz and 16 kHz output resolutions, highlighting how sampling rate and model specialization impact voice quality, noise suppression, and the listener’s experience.

Objective Evaluation – POLQA

Key Takeaways

Krisp krisp-viva-v6-lite model consistently outperforms all other models, delivering the highest POLQA score across all four noise environments.
- It provides an average improvement of +0.59 POLQA points over Sanas, and +0.38 over Krisp’s model with multi-speaker support.
Sanas’s inbound model shows gains over the original noisy audio (avg. +1.48 points), but lags behind Krisp in every scenario:
- In café noise, krisp-viva-v6-lite is ahead by a very significant +0.87 POLQA points.
- In distractor noise, where competing speech overlaps with the target voice, the krisp-viva-v6-lite model outperforms Sanas by a significant +0.47 points — highlighting the effectiveness of Krisp’s dedicated voice isolation design.
Krisp krisp-nc-i-v8-pro model performs on par with krisp-viva-v6-lite in ambient noise conditions (home, street, café), with <0.2 difference — but drops sharply in distractor noise (scoring 2.53 vs. 3.82 for krisp-viva-v6-lite), confirming it’s not tuned for background voice suppression.

Objective Evaluation – Meta Audiobox Aesthetics

In this evaluation, we compared Krisp’s best-performing inbound model, krisp-viva-v6-lite, at both 8 kHz and 16 kHz output levels, against Sanas’ inbound model, which supports only 8 kHz output. To ensure a fair comparison, we downsampled Krisp’s output to 8 kHz when required.

Note: These objective metrics measure on a 1-10 scale. Even studio-quality recordings with rich prosody and zero background noise typically score just under 9 in our experiments. As such, a delta of 0.3–0.5 points between models represents a meaningful difference in perceived speech quality.

Key Takeaways

Krisp krisp-viva-v6-lite model leads in both subjective quality metrics.
- Krisp at 16 kHz significantly outpaces all other variants — especially Sanas, which trails at 3.79 and 3.94, respectively. This represents a margin of +1.5 to +2.7 points, a substantial gap.
- Krisp even at 8 kHz, retains its edge. When downsampled to 8 kHz to match Sanas’ max output rate, Krisp still delivers +0.87 higher Content Enjoyment and +1.3 higher Production Quality.
Sanas struggles with perceived listening quality
- Sanas’s lower scores indicate noticeably reduced speech fidelity and listener enjoyment.
- Sanas’s output actually scores lower than the original audio

Interestingly, Sanas’s inbound noise cancellation output scores lower than the original noisy audio in both metrics — particularly in Production Quality (3.94 vs. 5.04). This can be explained by the fact that while the model removes the background noise, it actually introduces audible artifacts or residual noise, which degrade the overall listening experience. These issues are clearly perceptible in the sample audio, even with low-quality built-in speakers, but especially with USB headsets agents typically use.

Subjective Evaluation – Crowdsourced A/B testing

We processed 24 noisy audio samples using both krisp-viva-v6-lite and sanas-inbound, then submitted them for evaluation.

Each audio pair was compared 40 times, resulting in a total of 960 votes.
Listeners were asked: “Which audio sounds more pleasant and clear?”
For fair comparison, a downsampled 8kHz version of krisp-viva-v6-lite model’s outputs was used for comparison.
To further eliminate bias, all branding information has been removed from the file name and other metadata.

Here are the results:

Across both objective metrics (like POLQA and Meta Audiobox Aesthetics) and crowdsourced subjective A/B testing, Krisp consistently delivered better speech clarity, fewer audio artifacts, and a more natural listening experience. In fact, Krisp’s model outperformed Sanas in every evaluated scenario, including those with challenging noise types like background speech and telephony degradation.

If you need reliability, voice quality, and real-world performance that scales across your teams and customers — Krisp is the clear and proven choice.

The post Krisp vs. Sanas – Inbound Noise Cancellation Comparison appeared first on Krisp.

Krisp Launches All-in-One Voice Productivity Platform for Contact Centers

Krisp Team — Mon, 09 Jun 2025 12:49:37 +0000

BERKELEY, CA, June 9, 2025— Krisp, the leader in AI-powered voice technology, announced today the launch of its real-time Voice AI Platform for call centers. Unveiled at CCW Vegas, the new platform offering marks a turning point in how contact centers equip their agents with AI by providing access to AI Noise Cancellation, AI Accent Conversion, AI Live Interpreter for speech-to-speech translation, and AI Agent Assist tools in one seamless solution.

Today’s contact center teams are under increasing pressure to deliver high-quality interactions while balancing complex customer expectations, rapid automation, and shrinking margins. With access to a single streamlined platform, Krisp empowers agents with the tools to work faster, speak clearly, and support customers worldwide without losing the human touch or compromising on security, latency, or voice quality. The platform also provides access to Krisp’s best-in-class noise cancellation engine and voice isolation while backed by enterprise-grade security, privacy, and scalability.

“We’re not launching a product, we’re launching a new standard,” said Davit Baghdasaryan, CEO and Co-Founder at Krisp. “This platform goes beyond another AI tool used in contact centers. Krisp is transforming human agent performance in contact centers by empowering agents with the technology to overcome key barriers to high-quality service. With the best real-time voice AI tech in the world, support for global teams, and a price point no one can match, this is how contact centers move forward.”

Platform users will also gain access to Krisp’s latest AI Accent Conversion updates, which now support five Latin American English accents that represent approximately 85% of Spanish speakers across the major dialect groups in Latin America, including Mexican and Central American, Caribbean Spanish, and Andean Spanish/Neutral Standard Spanish.

The new Voice AI Platform provides access to the following features with recent updates:

AI Accent Conversion v3.5: Support for three accent packs, including Latin American English (new), Indian English, and Filipino English, with its latest model, v3.5, delivering industry-leading phoneme precision, voice clarity, and natural-sounding speech.
AI Live Interpreter: Bidirectional speech-to-speech translation for 80+ languages, with real-time bi-lingual transcription shown to the agent for more productive conversation handling.
AI Agent Assist: Full call lifecycle, AI-powered Knowledge Chat with answers based on call context and centralized knowledge base for agents, and after-call summaries with follow-up actions, call statistics, and performance feedback.
AI Noise Cancellation: Bidirectional background voice and noise cancellation for unmatched call clarity on both sides of the call.

To learn more, visit https://krisp.ai/accent-conversion/

About Krisp

Founded in 2017, Krisp pioneered the world’s first AI-powered Voice Productivity software. Krisp’s Voice AI technology enhances digital voice communication through audio cleansing, noise cancellation, accent conversion, live speech-to-speech translation, and agent assist. Offering full privacy, Krisp works on-device, across all audio hardware configurations and applications that support digital voice communication. Today, Krisp is deployed on over 200 million devices, has transcribed over 80 million calls, and processes over 80 billion minutes of voice conversations every month, helping businesses harness the power of voice to unlock higher productivity and deliver better business outcomes.

Media Contact

Molly Leahy

krispPR@walkersands.com

The post Krisp Launches All-in-One Voice Productivity Platform for Contact Centers appeared first on Krisp.

Krisp Unveils Industry-First AI Accent Conversion for LATAM

Krisp Team — Thu, 05 Jun 2025 10:03:00 +0000

BERKELEY, CA, June 5, 2025 — Krisp, the leader in AI-powered voice technology, announced today the launch of AI Accent Conversion for Latin America, the first AI-powered voice provider to offer accent conversion services to this region. Krisp supports five Latin American English accents that represent approximately 85% of Spanish speakers across the major dialect groups in Latin America, including Mexican and Central American, Caribbean Spanish and Andean Spanish/Neutral Standard Spanish.

AI Accent Conversion utilizes real-time inflection changes to help customers understand agents better by dynamically changing agents’ accents into the customer’s natively understood accent. By allowing agents to refine Latin American accents live on a call, Krisp helps agents and customers understand each other while keeping voices natural and authentic. As contact centers continue to use AI-powered tools to improve productivity, this update empowers call center agents and customers to have improved and clear conversations without altering their identity.

“Krisp is proud to be the first to bring AI Accent Conversion to Latin America, an incredibly diverse region where authentic communication is key,” said Davit Baghdasaryan, CEO and co-founder of Krisp. “By supporting the most widely spoken Latin American English accents, we’re not just improving call clarity but helping to bridge cultural and linguistic gaps in real time. At Krisp, our goal has always been to enable human conversations that are clear, with nuance and without background noise.”

Other benefits of AI Accent Conversion include:

Boosted agent productivity: Allows agents to focus on communication content and solving the customer’s inquiry rather than mitigating the accent barrier, leveling up the productivity.
Enhanced fairness in recruitment: Krisp removes the native English accent requirement, broadening the talent pool and promoting diversity in hiring.
Reduced bias: Mitigates customer bias against call center agents’ accents, boosting confidence and fostering a supportive work environment.
Improved agent interaction quality: Eliminates the need for accent faking, enhancing employee and agent satisfaction and well-being.

With the Latin American update, Krisp’s AI Accent Conversion tool now supports Spanish, Indian and Filipino dialects, with new dialects coming this year, such as South African and non-US English accents. The newest update also includes improved naturalness and voice preservation for Indian and Filipino accents, and improved accent leakage and phoneme precision. No other provider is delivering accent localization in LATAM with real-time accent conversion, significantly improving agent and customer satisfaction by enabling clearer conversations.

To learn more, visit https://krisp.ai/.

About Krisp

Founded in 2017, Krisp pioneered the world’s first AI-powered Voice Productivity software. Krisp’s Voice AI technology enhances digital voice communication through audio cleansing, noise cancellation, accent conversion, live speech-to-speech translation, and agent assist. Offering full privacy, Krisp works on-device, across all audio hardware configurations and applications that support digital voice communication. Today, Krisp is deployed on over 200 million devices, has transcribed over 40 million calls and processes over 80 billion minutes of voice conversations every month, helping businesses harness the power of voice to unlock higher productivity and deliver better business outcomes.

Learn more at www.krisp.ai

Media Contact

Molly Leahy

krispPR@walkersands.com

The post Krisp Unveils Industry-First AI Accent Conversion for LATAM appeared first on Krisp.