Engineering Blog - Krisp

Audio-Only Turn-Taking Model v2

Krisp Engineering Team — Mon, 27 Oct 2025 13:19:26 +0000

Introducing Krisp’s Turn-Taking v2

We’ve already discussed the challenges of turn-taking in conversational AI in this blog post.
Now, we’re excited to announce our newest Turn-Taking model, available as part of Krisp’s VIVA SDK.

In this article, we’ll walk through the technology behind the new model and share our latest testing results. The new generation of models is more streamlined than ever—making it simple to integrate Voice Isolation, Turn-Taking, and VAD into your Voice AI pipelines.

If you’d like to see how Krisp’s VIVA SDK can enhance your Voice AI agent experience, apply now from our Developers page.

How the New Model Works

Our latest model predicts End-of-Turns using only audio input—perfect for real-time conversational systems like human-bot interactions.

Compared to v1, krisp-viva-tt-v2 represents a major step forward. It was trained on a more diverse and better-structured dataset, with richer data augmentations that help the model perform more reliably in real-world conditions.

Key Improvements in v2

Greater robustness in noisy environments
Higher accuracy when paired with Krisp’s Voice Isolation models
Faster and more stable turn detection in live conversations

Testing Results

Testing on Clean Audio

We evaluated both model versions on ~1800 audio samples from real conversations, including ~1000 “hold” cases and ~800 “shift” cases, with mild background noise.

Although the numerical difference between versions is small on this clean dataset, the results show that v2 achieves faster mean shift prediction time at the same false positive rate.

Model	Balanced Accuracy	AUC	F1 Score
krisp-viva-tt-v1	0.82	0.89	0.804
krisp-viva-tt-v2	0.823	0.904	0.813

Insight: Even in clean audio conditions, krisp-viva-tt-v2 offers slightly better prediction stability and overall performance.

Testing on Noisy Audio

Next, we evaluated the models on noisy audio mixes at 5 dB, 10 dB, and 15 dB noise levels. Two scenarios were tested:

Directly on the noisy dataset
On the same dataset after processing through the Krisp VIVA Voice Isolation model

In both scenarios, krisp-viva-tt-v2 consistently outperformed v1.

Model	Balanced Accuracy	AUC	F1 Score
krisp-viva-tt-v1	0.723	0.799	0.71
krisp-viva-tt-v2	0.768	0.842	0.757

Insight: krisp-viva-tt-v2 delivers up to a 6% improvement in F1 score under noisy conditions, demonstrating greater resilience in real-world environments.

Testing After Noise and Voice Removal

Finally, we tested both models on the same noisy dataset after applying background noise and voice removal with the krisp-viva-tel-v2 model.

Model	Balanced Accuracy	AUC	F1 Score
krisp-viva-tt-v1	0.787	0.854	0.775
krisp-viva-tt-v2	0.816	0.885	0.808

Insight: When combined with Krisp’s Voice Isolation technology, v2 achieves even greater accuracy and stability.

Conclusion

The new krisp-viva-tt-v2 model marks a significant leap forward in real-time conversation handling for Voice AI. With improved robustness against noise and smoother integration with Krisp’s other models, developers can now build faster, smarter, and more natural-sounding conversational agents.

Explore the VIVA SDK today and see how Krisp’s advanced models can elevate your Voice AI experience.

The post Audio-Only Turn-Taking Model v2 appeared first on Krisp.

Introducing Krisp Accent Conversion v3.7

Krisp Engineering Team — Thu, 07 Aug 2025 07:55:59 +0000

Krisp Accent Conversion v3, released in March 2025, marked a breakthrough moment in the evolution of our accent conversion technology. For the first time in two years, we felt the system was mature enough for wide-scale production use.

In May 2025, we released Accent Conversion v3.5, bringing a major quality upgrade — with ~20% improvement across key metrics for both Filipino and Indian accents (details here). Thanks to Krisp desktop application’s auto-update mechanism, the rollout reached 95% of users within 2 days, and the feedback was overwhelmingly positive, both from agents and customers, driving sentiment and business KPIs.

In July 2025, we expanded the offering to support the Latin American accent pack. The launch quickly gained traction with several large customers and is now deployed across thousands of agents.

Throughout this period, we’ve worked closely with partners, agents, and customers to deeply understand corner cases — especially for the Indian accent, which is the most challenging due to its vast regional variation and phonetic complexity. This close collaboration, combined with relentless efforts from the world-class research and engineering teams at Krisp, has culminated in another major step forward now.

Today, we’re launching Accent Conversion v3.7, delivering significant improvements in naturalness and voice stability. This release is currently focused on the Indian accent pack, with support for other accents rolling out soon.

The following sections summarize the key improvements, benchmarking methodology, and a side-by-side comparison of Accent Conversion v3.7 with v3.5.

Key Improvements in AC v3.7

Naturalness: The converted speech sounds even more human-like and natural, with much improved filler-sound handling. Here, expert-rated naturalness scores improved by +14%. Crowdsourced evaluations confirm it with a +6% gain.
Voice Stability: Enhanced consistency in pitch and tone throughout the utterance, helping avoid unnatural fluctuations, especially for thick accents. This contributed to improved naturalness and clarity scores across all metrics.
Speech & Audio Clarity: Improvements were noted in both intelligibility and the reduction of artifacts and distortions. Speech Clarity scores rose by 5% in expert assessments, with corresponding enhancements across Meta metrics.
Pronunciation Accuracy: There’s a gain in objective metrics as well, about a 4% relative improvement in Phoneme Error Rate (PER), which can be attributed to more conversational data inclusion in the training. Here, some noticeable accent-specific enhancements in phoneme pronunciation, such as more native-like articulation of “R” and “L”, contribute to a +5% increase in the Accent Conversion score.

Evaluation Results

For subjective and objective evaluations, 78 real-world recordings were sampled.

For the crowdsourced evaluation, each recording received exactly 30 independent votes to ensure statistical confidence, 2340 total votes.

The results shown in the table below represent aggregated averages across all recordings.

Metric	IN AC v3.5	IN AC v3.7	Comment
Expert Evaluation – Natural speech (1 to 5)	3.7	4.2 (+14%)	Speech sounds even more human-like, with much improved filler-sound handling
Expert Evaluation – Speech Clarity (1 to 5)	4.0	4.2 (+5%)	Speech is with fewer artifacts and clearer, especially in slurred and mumbling segments
Expert Evaluation – Accent Conversion (1 to 5)	4.3	4.5 (+5%)	Accent-specific enhancements in phoneme pronunciation, such as more native-like articulation of “R” and “L”
Crowdsourced Evaluation – “How natural does the voice sound?” (1 to 5)	3.4	3.6 (+6%)	78 real-world audio recordings assessed by 30 participants
Crowdsourced Models’ Comparison – Which option sounds more natural?	1242	1878 (+20%)	78 real-world audio recording pairs were evaluated, with each pair assessed by 40 participants
Meta Aesthetic – Natural speech (1 to 10)	5.6	5.8 (+4%)
Meta Aesthetic – Speech Clarity (1 to 10)	7.5	7.6 (+1%)

Comparative audio samples

Listening Tip: For the most accurate and immersive comparison between v3.5 and v3.7 Accent Conversion, we recommend using quality headphones.

This helps highlight the improvements in clarity, naturalness, and speaker identity preservation that may be less perceptible on laptop or mobile speakers.

#	Improvement Category	Original	Converted AC v3.5	Converted AC v3.7
1	Speech Naturalness
2	Speech Naturalness
3	Speech Naturalness Speech Clarity
4	Speech Clarity
5	Speech Clarity Speech Naturalness Voice Stability
6	Speech Clarity Speech Naturalness Voice Stability
7	Speech Naturalness Speech Clarity
8	Speech Naturalness Speech Clarity

Appendix

Subjective Evaluation

Our evaluation was conducted across two structured tracks: expert panel ratings and crowdsourced listener preferences, designed to capture both technical precision and human perception.

Real-world agent calls have been sampled to represent a diverse set of speakers and input conditions, including, but not limited to

Accent level – high, medium, low
Speech rates and fluency
Background conditions (quiet, noisy, multi-speaker environments)

Evaluators scored each recording across four qualitative dimensions using a 5-point Likert scale:

Score	Meaning
5	Excellent / Native-like
4	Very Good
3	Acceptable
2	Needs Improvement
1	Poor / Unintelligible

1. Expert Panel Evaluation

Six expert evaluators independently rated matching audio pairs — each pair consisting of the same original voice converted by AC v3.5 and AC v3.7.

To eliminate bias:

File names were anonymized (no version markers)
The order of samples was randomized
Scoring was blind and individual (no group discussion)

2. Crowdsourced Evaluation

To further simulate real-world user perception, a blind A/B test was run with a pairs of recordings: AC v3.5 vs. AC v3.7.
78 real-world audio recording pairs were evaluated, with each pair assessed by 40 participants, resulting in 3,120 votes overall.

Participants were asked the following question:
“Which option sounds more natural (i.e., more human-like)?”

Results:

Version 3.5 was selected 1242 times
Version 3.7 was selected 1878 times

Evaluation metrics

Accent Conversion performance was measured across four key dimensions. These were selected based on real-world call center priorities such as clarity, naturalness, and robustness.

Metric	Description
Accent Conversion	How effectively the speaker’s original accent is transformed into a neutral or target accent. High scores mean minimal accent leakage or trace of the original pronunciation.
Speech Clarity	Evaluates articulation, intelligibility, and absence of audio distortions, such as mumbling, muffling, or low vocal energy.
Natural Speech	Measures how closely the output resembles fluid, human-like speech, including natural variations in pitch, tone, rhythm, and intonation.
Pronunciation Accuracy	Measures how closely the converted speech matches standard American English pronunciation at the phoneme level. It evaluates whether individual sounds (vowels, consonants, syllables) are produced correctly and consistently, without distortion, misplacement, or omission, ensuring that the converted voice sounds intelligible and native-like to a U.S. listener.

Objective Evaluation

For objective evaluation, the same set of recordings was processed using the Meta Audiobox Aesthetics and captured metrics strongly correlated to Natural Speech and Speech Clarity. Additionally, to quantify how each system impacts phoneme accuracy, all recordings were also processed using the Facebook NN Phonemizer, which is strongly correlated with the Accent Conversion metric.

Objective Metric	Interpretation	Highly Correlated to Subjective Metric	What It Captures
Production Quality*	Higher is better	Speech Clarity	Fidelity, presence of audio artifacts, balance, and clarity of the output signal
Content Enjoyment*	Higher is better	Natural Speech	Perceived naturalness, fluidity, and enjoyment of listening — akin to human listening satisfaction
Phoneme Error Rate (PER)	Lower is better	Accent Conversion	Measures pronunciation distortion. Lower scores mean more accurate, intelligible speech with better articulation.

these metrics are derived from waveform-level analysis and do not require transcript or linguistic alignment, making them ideal for evaluating accent conversion outputs that vary in delivery and prosody.

The post Introducing Krisp Accent Conversion v3.7 appeared first on Krisp.

Audio-only, 6M weights Turn-Taking model for Voice AI Agents

Krisp Engineering Team — Mon, 04 Aug 2025 23:20:04 +0000

In this article we discuss an outstanding problem in today’s Voice AI Agents – turn-taking. We examine why it is a hard problem and present a solution in
Krisp’s VIVA SDK.
We also benchmark the Krisp solution against some of the established solutions in the market.

Note: The Turn-Taking model is included in the VIVA SDK offering at no additional charge.

What is turn-taking?

Turn-taking is the fundamental mechanism by which participants in a conversation coordinate who speaks when. While seemingly effortless in human interaction, in human to AI agent conversations modeling this process computationally is highly complex. In the context of Voice AI Agents (including voice assistants, customer support bots, and AI meeting agents), turn-taking decides when the agent should speak, listen, or remain silent.

Without effective turn-taking, even the most advanced dialogue systems can come across as unnatural, unresponsive, and frustrating to use. A precise and lightweight turn-taking model enables natural, seamless conversations by minimizing interruptions and awkward pauses while adapting in real time to human cues such as hesitations, prosody, and pauses.

In general, turn-taking includes the following tasks:

End-of-turn prediction – predicting when the current speaker is likely to finish their turn
Backchannel prediction – detecting moments where a listener may provide short verbal acknowledgments like “uh-huh”, “yeah”, etc. to show engagement, without intending to take over the speaking turn.

In this article, we present our first audio-based turn-taking model, which focuses on the end-of-turn prediction task using only audio input. We chose to release the audio-based turn-taking model first, as it enables faster response times and a lightweight solution compared to text-based models, which usually require large architectures and depend on the availability of a streamable ASR providing real-time, accurate transcriptions.

Approaches to Turn-Taking

Solutions to Turn Taking problem are usually implemented in AI models, which use audio and/or text representation.

1. Audio-based

Audio-based approaches rely on analyzing acoustic and prosodic features of speech. These features include, changes in pitch, energy levels, intonation, pauses and speaking rate. By detecting silence or overlapping speech, the system predicts when the user has finished speaking and when it is safe to respond. For example, a sudden drop in energy followed by a pause can be interpreted as a turn-ending cue. Such models are effective in real-time, low-latency scenarios where immediate response timing is critical.

2. Text-based

Text-based solutions analyze the transcribed content of speech rather than the raw audio. These models detect linguistic cues that indicate turn completion, such as sentence boundaries, punctuation, discourse markers (e.g., “so,” “anyway”), natural language patterns or semantics (e.g., user might directly ask the bot not to speak). Text-based systems are often integrated with dialogue state tracking and natural language processing (NLP) modules, making them effective for scenarios where accurate semantic interpretation of user intent is essential. However, they may require larger neural network architectures to effectively analyze the linguistic content.

3. Audio-Text Multimodal (Fusion)

Multimodal solutions combine both acoustic and textual inputs, leveraging the strengths of each. While audio-based methods capture real-time prosodic cues, text-based analysis provides deeper semantic understanding. By integrating both modalities, fusion models can make accurate and context-aware predictions of turn boundaries. These systems are effective in complex, multi-turn conversations where relying on either audio or text alone might lead to errors in timing or intent detection.

Challenges of turn-taking

Hesitation and filler words

In natural dialogue, speakers often take a pause using fillers like “um” or “you know” without intending to give up their turn. For instance:

“I think we should, um, maybe –” [The agent jumps in, assuming the sentence is over]

Here, a turn-taking system must distinguish hesitation from completion, or risk interrupting too early.

Natural pauses vs. true end-of-turns

Pauses are not always indicators that a speaker has finished. For example:

“Yesterday I woke up early, then… [pause] I went to work…”

A model might misinterpret the pause as a turn boundary, generating a premature response and breaking the conversational flow.

Quick turn prediction

Minimizing response latency is essential for maintaining natural conversational flow. Humans tend to respond quickly, sometimes even reactively, when the end of the speech is obvious. If a model fails to predict the turn boundary fast enough, the system may sound sluggish or unnatural. The challenge is to trigger responses at just the right moment – early enough to sound fluid, but not so early that it risks interrupting the speaker.

Varying speaking styles and accents

People speak in diverse rhythms, intonations, and speeds. A fast speaker with sharp pitch drops might appear to end a sentence even when they haven’t. Conversely, a slow, melodic speaker may stretch syllables in ways that confuse timing-based systems. Modeling these variations effectively requires a neural network–based approach.

Krisp’s audio-based Turn-Taking model

Recently Krisp had released AI models for effective noise cancellation and voice isolation for Voice AI Agent use-cases, particularly improving pre-mature turn taking caused by background noise. See more details. This technology is widely deployed and has recently passed a 1B mins/month milestone.

It was only natural for us to take on a larger problem of turn-taking (TT). In this first iteration, we designed a lightweight, low-latency, audio-based turn-taking model optimized to run efficiently on a CPU. The Krisp TT model is built into Krisp’s VIVA SDK, where using the Python SDK you can easily chain it with the Voice Isolation models , placing it in front of a voice agents to create a complete, end‑to‑end conversational flow, as shown in the following diagram.

Here, the TT model continuously outputs a confidence score (probability) ranging from 0 to 1, indicating the likelihood of a shift – a point where a speaker is expected to finish their turn. It operates on 100ms audio frames, assigning a shift confidence score to each frame. To convert this score into a binary decision, we apply a configurable threshold. If the score exceeds this threshold (Δ), we interpret it as a shift (end of turn) prediction; otherwise, the model considers the current speaker is still holding the turn.

We also define a maximum hold duration, which defaults to 5 seconds. The model is designed such that, during uninterrupted silence, the confidence score gradually increases and reaches a value of 1 precisely at the end of this maximum hold period.

Comparison with other Turn-Taking models

Let’s take a closer look at how other solutions handle the turn-taking problem in comparison to Krisp.

Simple VAD (Voice Activity Detection)

The basic VAD-based approach is as straightforward as it gets – if you taken a pause in your speech, you have probably have finished your turn. Technically, once a few seconds of (usually configurable) silence is detected, the system assumes the speaker has finished and hands over the turn. While efficient, this method lacks awareness of conversational context and often struggles with natural pauses or hesitant speech. In our comparisons, we use the Silero-VAD model with a 1-second silence detection window as a simple VAD-based turn-taking approach.

SmartTurn

SmartTurn v1 and SmartTurn v2 by Pipecat are open-source AI models, designed to detect exactly when a speaker has finished their turn. We picked them for in-depth comparison because like Krisp TT, they are audio-based models.

Interestingly, SmartTurn models introduce a hybrid strategy. They first wait for 200ms of silence detected by Silero VAD, then evaluate whether a turn shift should occur. If the confidence is too low to switch, the system defers the decision. However, if silence persists for 3 seconds (default value, configurable parameter in SmartTurn), it forcefully initiates the turn transition. This layered approach aims to strike a balance between speed and caution in handling user pauses.

Tested Models

The following table gives a high-level comparison between the contenders

Attribute	Krisp TT	SmartTurn v1	SmartTurn v2	VAD-based TT
Model Parameters count	6.1M	581M	95M	260k
Model Size	65 MB	2.3 GB	360 MB	2.3 MB
Recommended Execution	On CPU	On GPU	On GPU	On CPU
Overall Accuracy	Good	Good	Good	Poor

Test Dataset

The test dataset was built using real conversational recordings, with manually labeled turn-taking (shift) and hold scenarios (hold). A turn-taking instance marks a point where one speaker hands over the conversation, we will call a shift, while a hold scenario captures cases where the speaker continues after a brief pause, filler words, or unfinished context.

The dataset consists of 1,875 labeled audio samples, including a significant number of labeled shift and hold scenarios. Each audio file is annotated to include the silence at the end of a speaker’s segment – either resulting in a turn shift or a hold. The test data was annotated according to multiple criteria, including context, intonation, filler words (e.g., “um,” “am”), keywords (e.g., “but,” “and”), and breathing patterns.

Below are the statistics on silence duration for each scenario type as well as the distribution of shift and hold cases based on mentioned criteria.

Training Dataset

Our training dataset comprises approximately 2,000 hours of conversational speech, containing around 700,000 speaker turns.

Evaluation: Prediction Quality Metrics

To assess the performance of the turn-taking model, we used a combination of classification metrics and timing-based analysis:

Metric	Description
TP	True Positives: Correctly predicted positive class cases
TN	True Negatives: Correctly predicted negative class cases
FP	False Positives: Incorrectly predicted positive class cases
FN	False Negatives: Missed positive class cases

Metric	Formula	Description
Precision	TP / (TP + FP)	Proportion of predicted positives that are actually positive
Recall	TP / (TP + FN)	Proportion of actual positives correctly predicted
Specificity	TN / (TN + FP)	Proportion of actual negatives correctly predicted
Balanced Accuracy	(Recall + Specificity) / 2	Average performance across both classes (positive and negative)
F1 Score	2 × (Precision × Recall) / (Precision + Recall)	Harmonic mean of Precision and Recall; balances false positives and false negatives

AUC: The AUC is the area under the ROC curve. A higher AUC value indicates better classification performance, here ROC (receiver operating characteristic) shows the trade-off between the true positive rate and the false positive rate as the decision threshold is varied, for more details on AUC and other metrics read here.

Evaluation: Latency vs. Accuracy tradeoff (MST vs FPR)

We realized that there is a natural tradeoff between the accuracy and latency, i.e. how quickly the system detects a true shift. We can reduce the latency by lowering the threshold, however, it will likely lead to increased false-positive rate (FPR) and unwanted interruptions. On the other hand, we don’t want to wait too long to predict a shift, because the increased latency will result in awkward interaction (see the chart below).

Therefore, the latency to accuracy relationship is important and here we measure TT system’s latency by mean shift time (MST). The shift time is defined as the duration between the onset of silence and the moment of predicting end-of-turn (shift). If the model outputs a confidence score, the end-of-turn prediction can be controlled via a threshold. This makes the threshold an important control lever in the trade-off between reaction speed and prediction accuracy:

Higher thresholds result in delayed shift predictions, which help reduce false positives (i.e., shift detections during the current speaker hold period which leads to interruption from the bot). However, this increases the mean shift time, making the system slower to respond.
Lower thresholds lead to faster responses, decreasing mean shift time, but at the cost of increased false positives, potentially causing the bot to interrupt speakers prematurely.

To visualize this trade-off, we plot a chart showing the relationship between mean shift time calculated in end-of-speech cases and false positive (interruption) rate as the threshold varies from 0 to 1. To provide a comparative summary of models, we plot these charts. A lower curve indicates a faster mean response time for the same interruption rate – or, from another perspective, fewer interruptions for the same mean response time. Here you can see the corresponding plots for Krisp TT, SmartTurn v1 and SmartTurn v2. Note that we can’t directly visualize such a chart for the VAD-based TT, as MST vs FPR requires a model that outputs a confidence score, whereas the VAD-based model produces binary outputs (0 or 1). The same limitation applies to AUC-shift computation shown in the table above.

This basically means that the Krisp TT model has considerably faster average response time (0.9 vs. 1.3 seconds at a 0.06 FPR) compared to SmartTurn to produce a true-positive answer.

To summarize the overall latency-accuracy tradeoff, we also compute the area under the MST vs FPR curve. This single scalar score captures the model’s ability to respond quickly while minimizing interruptions across different thresholds. A lower area indicates better performance.

Evaluation Results

Model	Balanced Accuracy	AUC Shift	F1 Score Shift	F1 Score Hold	AUC (MSP vs FPR)
Krisp TT	0.82	0.89	0.80	0.83	0.21
VAD based TT	0.59	–	0.48	0.70	–
SmartTurn V1	0.78	0.86	0.73	0.84	0.39
SmartTurn V2	0.78	0.83	0.76	0.78	0.44

It’s important to note that the Krisp TT model delivers comparable quality in terms of predictive quality metrics and significantly better quality in terms of latency vs accuracy tradeoff while being 5-10x smaller and optimized to run efficiently on a CPU. The VAD-based turn-taking approach is more lightweight, but it performs significantly worse than dedicated TT models – highlighting the importance of modeling the complex relationships between speech structure, acoustic features, and turn-taking behavior.

Demo

Here’s a simple dialogue showing how Krisp’s Turn-Taking model works in practice. In the demo, you’ll hear intentional utterances, pauses, filler words and interruptions. The response time you observe includes the Turn-Taking model’s speed, plus the latency from the speech-to-text (STT) system and the language model (LLM).

Krisp’s Turn-Taking Model

Krisp’s TT model vs Pipecat’s SmartTurn V2

This demo compares Krisp’s Turn-Taking model with Pipecat’s SmartTurn model (3-second default value, configurable parameter in SmartTurn). To highlight the differences visually, we’ve also overlaid a speech-to-text transcript on the video.

Future Plans

Improved Accuracy in TT

While this initial, audio-based TT model provides balanced accuracy and latency, it is mainly limited to analyzing prosodic and acoustic features, such as changes in intonation, pitch and rhythm. By analyzing linguistic features like the syntactic completion of a sentence we can further improve the accuracy of the TT model.

We plan to build the following features as well:

Text-based Turn-Taking: This model will use text only input and predict end-of-turn with a custom Neural Network trained for this use case.
Audio-Text Multimodal (Fusion): This model will use both audio and text inputs to leverage the best from these two modalities and give the highest accuracy end-of-turn prediction.

Early prototypes show promising results, with the multimodal approach outperforming the audio-based turn-taking models noticeably.

Backchannel support

Backchannel detection is another challenge encountered during the development of Voice AI agents. The “backchannel” is the secondary or parallel forms of communication that occur alongside a primary conversation or presentation. It encompasses the responses a listener gives to a speaker to indicate they are paying attention, without taking over the main speaking role.

While interacting with AI agent, in some cases, the user may genuinely want to interrupt – to ask a question or shift the conversation. In others, they might simply be using backchannel cues like “right” or “okay” to signal that they’re actively listening. The core challenge lies distinguishing meaningful interruptions from casual acknowledgments.

Our roadmap includes the release of a reliable dedicated backchannel prediction model.

The post Audio-only, 6M weights Turn-Taking model for Voice AI Agents appeared first on Krisp.

Krisp Launches VIVA SDK and Surpasses 1B Minutes of Voice AI Processing per Month Milestone

Krisp Team — Wed, 16 Jul 2025 14:02:24 +0000

VIVA powers voice AI agents with real-time voice isolation and background noise cancellation, delivering unmatched clarity and reliability at scale

BERKELEY, CA, July 16, 2025 — Krisp, the leader in real-time Voice AI technology, today announced the launch of VIVA, its voice isolation AI model and software development kit (SDK) built for Voice AI agents, while achieving 1 billion minutes of monthly Voice AI processing across global deployments. The milestone reinforces Krisp’s position in the industry as a leader in real-time voice isolation and noise cancellation, powering the most advanced voice AI products in the market. It also reflects a growing demand for real-time, low-latency voice infrastructure as voice becomes the dominant mode of human-AI interaction.

VIVA delivers server-side voice isolation by seamlessly integrating into an application’s audio path. It empowers voice AI agents by improving turn-taking, enhancing voice activity detection, and preventing false interruptions, leading to more natural and effective conversations.

Already integrated into Daily, Vodex.ai, Vapi, Ultravox.ai (formerly Fixie.ai), LiveKit, and the world’s largest AI labs, VIVA is driving measurable impact:

Improving turn-taking accuracy by 3.5x
Enabling smoother interactions by resulting in 50% fewer dropped calls
Delivering a strong customer experience with 30% higher customer satisfaction scores (CSAT)

“When our development team demonstrated Krisp’s capabilities, we were blown away. Seeing our bot continue uninterrupted, even amidst loud office noise, was a game-changer for us,” said Kumar Saurav, CTO of Vodex. “It felt like a whole new level of innovation.”

“Reaching this volume of processed audio is a reflection of how broadly integrated Krisp’s Voice AI technology has become,” said Davit Baghdasaryan, CEO and Co-Founder of Krisp. “As voice agents take center stage, clarity is non-negotiable. VIVA delivers the voice isolation backbone these systems need to operate reliably and conversationally in the real world.”

Built for high-throughput, low-latency environments, VIVA processes billions of audio requests each month, enabling developers to build more responsive and natural AI agents, from customer support to virtual companions.

To learn more about VIVA, visit https://krisp.ai/developers/#technologies

The post Krisp Launches VIVA SDK and Surpasses 1B Minutes of Voice AI Processing per Month Milestone appeared first on Krisp.

Improving Turn-Taking of AI Voice Agents with Background Noise and Voice Cancellation

Krisp Engineering Team — Mon, 24 Mar 2025 17:08:31 +0000

Turn-Taking is a big challenge

AI Voice Agents are rapidly evolving, powering critical use-cases such as customer support automation, virtual assistants, gaming, and remote collaboration platforms. For these voice-driven interactions to feel natural and practical, the underlying audio pipeline must be resilient to noise, responsive, and accurate—especially in real-time scenarios.

In a typical deployment, audio streams originate from diverse endpoints like mobile applications, web browsers, or traditional telephony and are delivered via real-time communication protocols like WebRTC or WebSockets (WSS). This audio is aggregated and managed through specialized providers like LiveKit, Daily, or Agora, which ensure reliable, low-latency audio transport to the server-side pipeline.

Within the server pipeline, once the audio arrives, it undergoes optional preprocessing steps for formatting or basic adjustments, after which it moves directly into a Voice Activity Detection (VAD).

VAD identifies active speech segments, driving automatic end-pointing and intelligent interruption handling. Following a user speech, when VAD detects silence, relevant API events trigger downstream Voice AI models to generate and deliver responses. If the user resumes speaking during the voice bot’s response generation, the pipeline seamlessly cancels the ongoing output and clears buffers, ensuring natural conversational turn-taking.

In this scenario, background noises—such as music, traffic sounds, TVs, or nearby conversations—remain embedded within the audio stream, reaching the VAD module unfiltered. Because VAD is designed to detect human speech activity, these background sounds often cause false-positive speech detections. As a result, the VAD mistakenly interprets noise or background voices as active user speech, triggering unintended interruptions. These false triggers negatively impact turn-taking, a core component of natural, human-like conversational interactions.

Here, by placing Krisp Background Voice and Noise Cancellation before the VAD, the pipeline substantially reduces false-positive triggers and prevents interruptions from common background distractions.

Additionally, Krisp significantly improves downstream speech processing accuracy by delivering cleaner audio.

Introducing Krisp Server SDK for AI Voice Agents

We’re excited to announce the launch of Krisp Server SDK, featuring two advanced AI models engineered explicitly for superior noise cancellation for AI Voice Agents.

Compared to our on-device AI models, these models are optimized to deliver unmatched performance and voice quality, especially in challenging corner cases.

Both models remove background noise, chatter, and secondary voices, ensuring the retention and clarity of only the primary speaker’s voice.

BVC-tel (General-Purpose Model):
- Designed as a robust, versatile solution ideal for a wide variety of audio sources, including WebRTC, mobile, and traditional telephony inputs.
- Specifically engineered to be highly resilient against audio artifacts introduced by common telephony codecs, such as the G711 codec, widely used in telecommunication networks.
- Supports audio sampling rates up to 16 kHz, which is optimal for AI Voice Agents as it effectively captures the essential frequency ranges of human speech.
BVC-app (High-Fidelity Model):
- Specifically optimized for WebRTC use-cases where high-quality audio streams are required.
- Supports higher sampling rates up to 32 kHz, enabling clearer, more natural-sounding voice interactions suitable for applications with superior audio fidelity.
If the incoming audio source has a sampling rate higher than the model’s supported rate (e.g., 48 kHz), the SDK intelligently manages the audio processing by automatically downsampling to the model’s working rate, applying the noise cancellation and then seamlessly upsampling back to the original audio quality.

Despite significant quality enhancements, server-side models maintain a low algorithmic latency of just 15 milliseconds, identical to our on-device models. This ensures real-time responsiveness, which is critical for conversational interactions.

The new Krisp Server SDK models are CPU-optimized and support a range of platforms, including:

Linux (x64 and ARM64 architectures)
Windows (x64) with ARM64 support coming soon.

Quantifying the Krisp BVC Impact

We comprehensively evaluated how the new Background Voice and Noise Cancellation (BVC) model improves turn-taking accuracy and speech recognition quality.

Using the BVC-tel model, we specifically tested two distinct audio pipeline scenarios:

BVC-VAD-STT: Audio processed by Krisp BVC and VAD is passed to the AI Voice Agent.
BVC-VAD only: The original (unprocessed) audio is passed downstream to the AI Voice Agent, with Krisp BVC processed audio used solely for improved VAD accuracy.

The following graphics and audio examples demonstrate a typical example: Krisp BVC effectively canceling the background TV speech when interacting with the AI Voice Agent.

The red-circled areas represent the TV speech. The green-circled areas represent the primary speaker’s speech.

Turn-taking with VAD only	Turn-taking with BVC-VAD
TV speech passes through VAD, potentially interrupting the AI Voice Agent during its response.	TV speech passes through VAD, potentially interrupting the AI Voice Agent during its response.

Original Audio https://krisp.ai/blog/wp-content/uploads/2025/03/Original-Recording-1.wav	Original Audio https://krisp.ai/blog/wp-content/uploads/2025/03/Original-Recording.wav
Audio after VAD processing only https://krisp.ai/blog/wp-content/uploads/2025/03/Original-Recording-No-BVC-VAD.wav	Audio after BVC processing https://krisp.ai/blog/wp-content/uploads/2025/03/Original-Recording-After-BVC.wav
	Audio after BVC + VAD processing https://krisp.ai/blog/wp-content/uploads/2025/03/Original-Recording-After-BVC-VAD.wav

In the following sections, we perform more comprehensive evaluations to capture and quantify improvements in turn-taking and WER improvements in STT.

Evaluation Setup:

Dataset: We selected the widely-used AMI corpus, specifically the individual headset recordings. This dataset is ideal due to its realistic mix of background conversations and noise, which is representative of many typical mobile and telephony scenarios.
Voice Activity Detection: Latest version of open-source SileroVAD
Speech-To-Text Models: Whisper V3 (base version). In our tests, the difference between the base and large versions was insignificant, so we present only the base model results.

Impact on Turn-Taking

Applying Krisp BVC upstream had a clear, positive impact on VAD precision within the AMI dataset—especially in reducing false-positive speech detections. Lower false positives are particularly critical for ensuring smooth, uninterrupted conversational experiences.

Our tests show that with Krisp BVC, false-positive triggers in VAD were reduced by 3.5x on average. This means the AI Voice Agent is significantly less likely to experience unintended interruptions caused by background speech or noise. Overall, the precision after Krisp BVC increases by over a quarter—a major improvement.

Impact on Speech Recognition Accuracy (WER)

Using Krisp BVC also markedly reduces the Word Error Rate (WER) of Whisper V3 models on the AMI dataset—achieving more than a 2x improvement. This result aligns with expectations, given Krisp’s effectiveness in eliminating distracting background speech.

Interestingly, the WER improvements were consistent in both BVC-VAD and BVC-VAD-STT modes.

To further explore this, we evaluated an additional dataset with minimal background speech: the ITU-T P.501 dataset, which mixes single-speaker audio with 24 different noise types at three intensity levels (0db, 5db, 10db).

Modern STT models, including Whisper, generally have strong built-in noise robustness. We aimed to measure any further WER improvements achievable by applying Krisp BVC upstream.

Indeed, the WER metric was generally much lower in this case compared to the AMI dataset.

In the BVC-VAD mode, where Whisper operated on original audio while leveraging Krisp BVC-processed audio for enhanced VAD, we observed an 18% improvement in WER.

Conversely, in the BVC-VAD-STT mode — where Whisper processed Krisp-modified audio—the WER increased by about 2x, although the absolute WER number is still relatively low. This increase is attributed to Whisper never encountering Krisp NC-processed audio during its training, which could cause suboptimal performance for such modified audio.

Note that WER% results in BVC-VAD-STT mode could be very different on other datasets and STT engines. We recommend experimenting with both BVC-VAD and BVC-VAD-STT modes to determine the optimal audio pipeline setup for you.

Overall, these evaluations demonstrate that incorporating Krisp BVC into AI Voice Agents pipelines substantially improves turn-taking and speech recognition quality, especially in real-world scenarios where background noise and secondary conversations are prevalent.

The post Improving Turn-Taking of AI Voice Agents with Background Noise and Voice Cancellation appeared first on Krisp.

Elevate Your Contact Center Experience with Krisp Background Voice Cancellation (BVC)

Krisp Engineering Team — Wed, 19 Jun 2024 14:49:18 +0000

In the energetic environment of a contact center, maintaining clear and focused communications with customers is critical, and foundational. Agents often face the challenge of background noise and overlapping voices, which not only distract customers but can also lead to inadvertent disclosure of sensitive information. Traditional headsets and hardware solutions fall short in addressing these issues effectively. Krisp’s Background Voice Cancellation (BVC) is a game-changer for contact center operations, materially improving AHT, CSAT and ESAT.

What is Krisp Background Voice Cancellation?

Krisp BVC is an advanced AI noise-canceling technology that eliminates all background noises and other competing voices nearby, including the voices of other agents. This breakthrough technology is enabled as soon as an agent plugs in their headsets, without requiring individual voice enrollment or training. This innovative solution integrates smoothly with both native applications and browser-based calling applications via WebAssembly JavaScript (WASM JS), ensuring high performance and efficiency.

Why Choose Krisp BVC for Your Contact Center?

1. Enhanced Customer Experience

Customers often struggle with understanding agents when there’s background chatter, leading to frustration and reduced satisfaction. By using Krisp BVC, all extraneous voices and noises are filtered out, allowing customers to focus solely on the agent they are speaking with. This ensures a smooth and professional interaction every time, which directly contributes to higher CSAT scores.

2. Privacy and Confidentiality

In a contact center, the risk of customers overhearing personal information from other calls is a significant concern, especially for financial and healthcare customers. Krisp BVC addresses this by completely isolating the agent’s voice from the background, ensuring that sensitive information remains confidential.

3. Hardware Independence

While headsets and other hardware solutions provide some noise reduction, they do not eliminate background voices. Krisp BVC works independently of hardware, offering superior noise and background voice cancellation without the need for additional devices or complicated setups.

4. Plug-and-Play Functionality

Once the agent’s headset is plugged in, Krisp BVC is activated automatically. There’s no need for agents to enroll their voice or go through any training process, making it an effortless solution that saves time and resources.

5. Versatility Across Platforms

Krisp BVC is uniquely available for both native applications and browser-based calling applications through WASM JS. This means it can be integrated effortlessly into various platforms, ensuring consistent performance and reliability.

6. Efficient Performance

Krisp BVC is designed to run efficiently in the browser, making it an ideal solution for Contact Center as a Service (CCaaS) platforms. Its high-performance capabilities ensure minimal latency and a smooth user experience.

7. Improved CSAT Metrics

With the enhanced clarity of communication provided by Krisp BVC, customers are more likely to have positive interactions with agents. This leads to increased satisfaction, as reflected in improved CSAT metrics reported to us by a number of customers. Clear and effective communication is crucial in resolving issues promptly and accurately, which in turn boosts customer loyalty and satisfaction.

Integration Made Easy

Integrating Krisp BVC into your contact center application is straightforward. Here’s a sample code snippet to demonstrate how simple it is to get started:

Visualizing the Difference

The graphical representation above illustrates the clarity and focus achieved by using Krisp BVC. Notice how the agent’s speech is clear and distinct, free from background distractions.

Hear the Difference

Experience the transformative power of Krisp BVC with this audio comparison:

Without BVC – Competing Agent Voices

https://krisp.ai/blog/wp-content/uploads/2024/06/Original.wav

With BVC – Clear communication

https://krisp.ai/blog/wp-content/uploads/2024/06/Krisp-BVC-Processed.wav

Conclusion

Integrating Krisp BVC into your contact center solutions can significantly enhance the quality of interactions and customer satisfaction. Its ease of integration, combined with superior performance and versatility, makes Krisp BVC a must-have feature for modern contact centers. Upgrade your communication systems today with Krisp Background Voice Cancellation and experience the difference it makes, including improved CSAT metrics.

Ready to get started? Visit Krisp’s Developer Portal for more information and comprehensive integration guides.

The post Elevate Your Contact Center Experience with Krisp Background Voice Cancellation (BVC) appeared first on Krisp.

Enhancing Browser App Experiences: Krisp JS SDK Pioneers In-browser AI Voice Processing for Desktop and Mobile

Krisp Engineering Team — Wed, 15 May 2024 07:17:29 +0000

In today’s connected world, where web browsers serve as gateways to an assortment of online experiences, ensuring a seamless and productive user experience is paramount. One crucial aspect often overlooked in browser-based communication applications is voice quality, especially in scenarios where clarity of communication is essential.

Diverse Applications of Noise Cancellation on the Web

From virtual meetings and online classes to contact center operations, the demand for clear audio communications has become ever more important, making AI Voice processing with noise and background voice cancellation an expected and highly sought-after feature. While standalone applications have provided this functionality, integrating this directly into browser-based applications has proven to be a challenge.

The need for noise and background voice cancellation extends beyond conventional communication platforms. In Telehealth, for instance, where accurate communication is vital for call-based diagnosis and consultation, background noise and voices can hinder effective communication. Another interesting example is insurance companies, taking calls from their customers from the place of an incident. Eliminating background noise ensures that critical information is accurately conveyed, leading to smoother claims processing and customer satisfaction. These, and many other use cases, often involve one-click web sessions for the calls.

Overcoming Challenges for Mobile Browser Integration

The growing demand for quality communications in browser-based applications extends to both desktop and mobile devices. Up until recently, achieving compatibility with mobile devices, particularly with iOS Safari, posed significant difficulties. Limitations within Apple’s WebKit framework and the inherently CPU-intensive nature of JavaScript solutions hindered bringing the power of Krisp’s technologies to mobile browser applications.

The introduction of Single Instruction, Multiple Data (SIMD) support marked a significant opening for Krisp to deliver its market-leading technology into Safari specifically, and mobile browsers generally. SIMD enables parallel processing of data, significantly boosting performance and efficiency, particularly on mobile devices with limited computational resources.

By leveraging SIMD, the Krisp JS SDK has achieved low levels of CPU efficiency, making its market-leading noise cancellation available for users on mobile browser applications. This breakthrough not only enhances the user experience but also opens up new possibilities for web-based applications across various industries.

As Krisp’s technologies continue to evolve and extend into new territories, the ability to make AI Voice features available for all users across desktop and mobile browser-based applications is fundamental and allows users to have seamless access to the best voice processing technologies in the market.

Try next-level audio and voice technologies

Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.

The post Enhancing Browser App Experiences: Krisp JS SDK Pioneers In-browser AI Voice Processing for Desktop and Mobile appeared first on Krisp.

Krisp Delivers AI-Powered Voice Clarity to Symphony’s Trader Voice Products

Krisp Team — Wed, 17 Apr 2024 16:18:33 +0000

AI Noise Cancellation partnership redefines communication standards in financial markets

BERKELEY, Calif., April 17, 2024 – Krisp, the world’s leading AI-powered voice productivity software, announced today a new integration with Symphony’s trader voice platform, Cloud9, to enhance voice audio clarity. The partnership delivers Krisp’s advanced AI Noise Cancellation, enabling Cloud9 users to experience clear audio in challenging environments, such as trading floors and busy offices.

Through the integration, Symphony will also be able to enhance its built-for-purpose financial markets voice analytics in partnership with Google Cloud. By creating a space of uninterrupted audio between counterparties both on and off of Cloud9, more efficient communication is allowed with fewer disagreements. Accurate transcription of audio recordings is enhanced with compliance review in mind.

“We are thrilled to partner with Symphony and integrate our Voice AI technology into their products,” said Robert Schoenfield, EVP of Licensing and Partnerships at Krisp. “Symphony’s customers operate in difficult noisy environments, dealing with high value transactions. Improving their communication is of real value.”

Symphony’s chief product officer, Michael Lynch, said: “Cloud9 SaaS approach to trader voice gives our users the flexibility to work from anywhere, whether it’s a bustling trading desk or their living room, and KRISP improves our already best-in-class audio quality to help our users make the best possible real-time decisions while also improving post-trade analytics and compliance.”

This collaboration not only brings improved communication quality but also aligns with Krisp’s and Symphony’s commitment to bringing industry-leading solutions to their customers.

About Krisp

Founded in 2017, Krisp pioneered the world’s first AI-powered Voice Productivity software. Krisp’s Voice AI technology enhances digital voice communication through audio cleansing, noise cancelation, accent localization, and call transcription and summarization. Offering full privacy, Krisp works on-device, across all audio hardware configurations and applications that support digital voice communication. Today, Krisp processes over 75 billion minutes of voice conversations every month, eliminating background noise, echoes, and voices in real-time, helping businesses harness the power of voice to unlock higher productivity and deliver better business outcomes.

Learn more about Krisp’s SDK for developers.

Press contact:

Shara Maurer

Head of Corporate Marketing
smaurer@krisp.ai

About Symphony

Symphony is the most secure and compliant markets’ infrastructure and technology platform, where solutions are built or integrated to standardize, automate and innovate financial services workflows. It is a vibrant community of over half a million financial professionals with a trusted directory and serves over 1000 institutions. Symphony is powering over 2,000 community built applications and bots. For more information, visit www.symphony.com.

Press contact:

Odette Maher

Head of Communications and Corporate Affairs

+44 (0) 7747 420807 / odette.maher@symphony.com

The post Krisp Delivers AI-Powered Voice Clarity to Symphony’s Trader Voice Products appeared first on Krisp.

Twilio Partners with Krisp to Provide AI Noise Cancellation to All Twilio Voice Customers

Krisp Team — Mon, 11 Mar 2024 17:02:05 +0000

Krisp and Twilio have partnered to provide Twilio customers a best-in-class audio experience, removing all unwanted background noise and voices from calls using the Krisp plug-in for Twilio Programmable Voice. This feature is available directly from Krisp, for integration and use with Twilio Voice.

How Krisp AI Noise Cancellation works

Trusted by more than 100 million users to process over 75 billion minutes of calls monthly, Krisp’s Voice AI SDKs are designed to identify human voice and cancel all background noise and voices to eliminate distractions on calls. Krisp SDKs are available for browsers (WASM JS), desktop apps (Win, Mac, Linux) and mobile apps (ioS, Android.) The Krisp Audio Plugin for Twilio Voice is a lightweight audio processor that can run inside your client application and create crystal clear audio.

The plugin needs to be loaded alongside the Twilio SDK and runs as part of the audio pipeline between the microphone and audio encoder in a preprocessing step. During this step, the AI-based noise cancellation algorithm removes unwanted sounds like barking dogs, construction noises, honking horns, coffee shop chatter and even other voices.

After the preprocessing step, the audio is encoded and delivered to the end user. Note that all of these steps happen on device, with near zero latency and without any media sent to a server.

Requirements and considerations

Krisp’s AI Noise Cancellation requires you to host and serve the Krisp audio plugin for Twilio Voice as part of your web application. It also requires browser support of the WebAudio API (specifically Worklet.addModule). Krisp has a team ready to support your integration and optimization for great voice quality.

Learn more about Krisp here and apply for access to the world’s best Voice AI SDKs.

Get started with AI Noise Cancellation

Visit the Krisp for Twilio Voice Developers page to request access to the Krisp SDK Portal. Once access is granted, download Krisp Audio JS SDK and place it in the assets of your project. Use the following code snippet to integrate the SDK with your project. Read the comments inside the code snippets for additional details.

Visit Krisp for Twilio Voice Developers and get started today.

The post Twilio Partners with Krisp to Provide AI Noise Cancellation to All Twilio Voice Customers appeared first on Krisp.

On-Device STT Transcriptions: Accurate, Secure and Less Expensive

Krisp Team — Tue, 05 Mar 2024 17:11:32 +0000

Krisp is synonymous with on-device technologies, led by its innovative Voice AI solutions. With a reputation built on delivering superior performance and privacy-centric features for voice clarity, including background noise and voice cancellation during calls, Krisp’s on-device approach has long been one of its competitive advantages and unique value propositions for users. The Krisp app expanded its offerings to include its AI Meeting Assistant, incorporating on-device speech-to-text (STT) and other voice productivity technologies. Leveraging the expertise acquired through developing in-house, state-of-the-art solutions, the aim was not merely to match server-side transcription providers, but surpass them in terms of accuracy, affordability and privacy, setting a new standard for on-device STT service.

The on-device requirement has in many ways shaped the technical specifications of the technology and posed a series of challenges that the team has been able to tackle head-on, working through various iterations. The path to achieving high quality on-device STT continues, as the Krisp app has now transcribed over 15 million hours of calls and the company is now making this technology for its license partners via on-device SDKs. Let’s dive into the specific challenges Krisp worked through to bring this technology to market.

Challenges and solutions to on-device STT

Resource constraints

Without diving into the specifics of on-device STT technology and its architecture, one of the first and obvious constraints that the development had to be guided by was the computational resource. On-device STT systems operate within the confines of limited resources, including CPU, memory, and power. Unlike cloud-based solutions, which can leverage expansive server infrastructure, on-device systems must deliver comparable performance with significantly restricted resources. This constraint necessitates the optimization of algorithms, models, and processing pipelines to ensure efficient resource utilization without compromising accuracy and responsiveness. In many use cases, STT would need to run alongside the Noise Cancellation and other technologies, which further impacts the overall available bandwidth of resources.

Model complexity and size

The effectiveness of STT models hinges on their complexity and size, with larger models generally exhibiting superior accuracy and robustness. However, deploying large models on-device presents a formidable challenge, as it exacerbates memory and processing overheads. Balancing model complexity and size becomes paramount, requiring developers to employ techniques like model pruning, quantization, and compression to achieve optimal trade-offs between performance and resource utilization.

In order to achieve high quality transcripts and feature-rich speech-to-text systems, there is a need to build complex network architectures consisting of a number of AI models and algorithms. Such models include language models, punctuation and text normalization, speaker diarization and personalization (custom vocabulary) models, each presenting unique technical challenges and performance considerations.

The technology that Krisp employs both in its app and SDKs includes a combination of all of the above-mentioned technologies, as well as other adjacent algorithms to ensure readability and grammatical coherence of the final output.

The language model enhances transcription accuracy by predicting the likelihood of word sequences based on contextual and syntactic information. It helps in disambiguating words and improving the coherence of transcribed text. The Punctuation & Capitalization Model predicts the appropriate punctuation marks and capitalization based on speech patterns and semantic cues, enhancing the readability and comprehension of transcribed text. While the Inverse Text Normalization model standardizes and formats transcribed text to adhere to predefined conventions, such as converting numbers to textual representations or vice versa, expanding abbreviations, and correcting spelling errors. For cases where customers might have domain-specific terminology or proper names that are not widely recognized by the standard models, Krisp also provides Custom Vocabulary support.

Apart from the features ensuring text readability and accuracy, a major important technology included in Krisp’s on-device STT is Speaker Diarization. This model segments speech into distinct speaker segments, enabling the identification and differentiation of multiple speakers within a conversation or audio stream. It is crucial for speaker-dependent processing and improving transcription accuracy in multi-speaker scenarios.

Real or near real-time processing for on-device STT

Depending on a use case, on-device STT technology might have to deliver real or near real-time processing capabilities to enable seamless user interactions across diverse applications. Achieving low-latency speech recognition necessitates streamlining inference pipelines, minimizing computational overheads, and optimizing signal processing algorithms. Moreover, the heterogeneity of device architectures and hardware accelerators further complicates real-time performance optimization, requiring tailored solutions for different platforms and configurations. Krisp developers have achieved a delicate balance between latency, selecting optimal model combinations, ensuring processing synergy, and addressing the scalability and flexibility of the pipeline to accommodate various use-cases.

Robustness to variability

With a global and multi-domain user-base, there is an inherent variability of speech arising from diverse accents, vocabularies, environments, and speaking styles. Our on-device STT technology must exhibit robustness to such variability to ensure consistent performance across disparate contexts. This entails training models on diverse datasets, augmenting training data to encompass various scenarios, and implementing robust feature extraction techniques capable of capturing salient speech characteristics while mitigating noise and various device or network-dependent distortions.

In addition to addressing resource constraints and optimizing algorithms for on-device STT, Krisp prioritizes rigorous speech recognition testing to ensure its technology’s robustness across diverse accents, environments, and speaking styles.

Integration & embeddability of on-device STT

Along with being on-device, the technologies underlying the Krisp app AI Meeting Assistant are also designed with embeddability in mind. Integrating on-device STT technology into communication applications and devices presents a range of additional challenges, all of which Krisp has tackled. Resources must be carefully allocated to ensure optimal performance without compromising existing customer infrastructure. Customization and configuration options are essential to meet the diverse needs of end-users while maintaining scalability and performance across large-scale deployments. Security and compliance considerations demand robust encryption and privacy measures to protect sensitive data. Seamless integration with existing infrastructure, including telephony systems and collaboration tools, requires interoperability standards, codec support and integration frameworks.

One prevailing requirement for communication services is for on-device STT technology to be functional on the web. This presents a new set of challenges in terms of further resource optimization, as well as compatibility across diverse web platforms, browsers, frameworks and devices.

Bringing it all together

While the integration of on-device STT technology into communication applications and devices presents challenges and requires meticulous resource utilization, customization, and seamless interoperability, Krisp has addressed these challenges and today delivers embedded STT solutions that enhance the functionality and value proposition for applications and their end-users.

Try next-level on-device STT, audio and voice technologies

Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.

The post On-Device STT Transcriptions: Accurate, Secure and Less Expensive appeared first on Krisp.

Engineering Blog - Krisp

Audio-Only Turn-Taking Model v2

Introducing Krisp’s Turn-Taking v2

How the New Model Works

Key Improvements in v2

Testing Results

Testing on Clean Audio

Testing on Noisy Audio

Testing After Noise and Voice Removal

Conclusion

Introducing Krisp Accent Conversion v3.7

Key Improvements in AC v3.7

Evaluation Results

Comparative audio samples

Appendix

Subjective Evaluation

1. Expert Panel Evaluation

2. Crowdsourced Evaluation

Evaluation metrics

Objective Evaluation

Audio-only, 6M weights Turn-Taking model for Voice AI Agents

What is turn-taking?

Approaches to Turn-Taking

1. Audio-based

2. Text-based

3. Audio-Text Multimodal (Fusion)

Challenges of turn-taking

Hesitation and filler words

Natural pauses vs. true end-of-turns

Quick turn prediction

Varying speaking styles and accents

Krisp’s audio-based Turn-Taking model

Comparison with other Turn-Taking models

Simple VAD (Voice Activity Detection)

SmartTurn

Tested Models

Test Dataset

Training Dataset

Evaluation: Prediction Quality Metrics

Evaluation: Latency vs. Accuracy tradeoff (MST vs FPR)

Evaluation Results

Demo

Krisp’s Turn-Taking Model

Krisp’s TT model vs Pipecat’s SmartTurn V2

Future Plans

Improved Accuracy in TT

Backchannel support

Krisp Launches VIVA SDK and Surpasses 1B Minutes of Voice AI Processing per Month Milestone

Improving Turn-Taking of AI Voice Agents with Background Noise and Voice Cancellation

Turn-Taking is a big challenge

Introducing Krisp Server SDK for AI Voice Agents

Quantifying the Krisp BVC Impact

Turn-taking with VAD only

Turn-taking with BVC-VAD

Impact on Turn-Taking

Impact on Speech Recognition Accuracy (WER)

Elevate Your Contact Center Experience with Krisp Background Voice Cancellation (BVC)

What is Krisp Background Voice Cancellation?

Why Choose Krisp BVC for Your Contact Center?

1. Enhanced Customer Experience

2. Privacy and Confidentiality

3. Hardware Independence

4. Plug-and-Play Functionality

5. Versatility Across Platforms

6. Efficient Performance

7. Improved CSAT Metrics

Integration Made Easy

Visualizing the Difference

Hear the Difference

Conclusion

Enhancing Browser App Experiences: Krisp JS SDK Pioneers In-browser AI Voice Processing for Desktop and Mobile

Diverse Applications of Noise Cancellation on the Web

Overcoming Challenges for Mobile Browser Integration

Try next-level audio and voice technologies

Krisp Delivers AI-Powered Voice Clarity to Symphony’s Trader Voice Products

AI Noise Cancellation partnership redefines communication standards in financial markets

Twilio Partners with Krisp to Provide AI Noise Cancellation to All Twilio Voice Customers

How Krisp AI Noise Cancellation works

Requirements and considerations

Get started with AI Noise Cancellation