On April 8, 2025, Amazon unveiled Nova Sonic, a cutting-edge generative AI model designed to process and generate natural-sounding speech. This innovation positions Amazon alongside industry leaders like OpenAI and Google, offering competitive performance in speed, speech recognition, and conversational quality.
Nova Sonic represents a significant advancement over earlier voice models, such as those powering Amazon’s Alexa. Traditional digital assistants have often been criticized for their rigid and unnatural interactions. However, recent technological breakthroughs have paved the way for more fluid and human-like conversational agents. Nova Sonic addresses these shortcomings by delivering a more natural and engaging user experience.
Developers can access Nova Sonic through Amazon Bedrock, the company’s platform for building enterprise AI applications. The model is available via a new bi-directional streaming API, facilitating seamless integration into various applications. Amazon touts Nova Sonic as the most cost-efficient AI voice model currently available, claiming it is approximately 80% less expensive than OpenAI’s GPT-4o.
Rohit Prasad, Amazon’s Senior Vice President and Head Scientist of Artificial General Intelligence, highlighted that components of Nova Sonic are already enhancing Alexa+, the upgraded version of Amazon’s digital voice assistant. Prasad emphasized that Nova Sonic builds upon Amazon’s expertise in large orchestration systems—the technical infrastructure underpinning Alexa. This expertise enables Nova Sonic to effectively route user requests to appropriate APIs, whether fetching real-time information from the internet, parsing proprietary data sources, or interacting with external applications.
One of Nova Sonic’s standout features is its ability to engage in two-way dialogues, attentively waiting for appropriate moments to respond by considering user pauses and interruptions. Additionally, it generates text transcripts of user speech, providing valuable data for developers.
In terms of speech recognition accuracy, Nova Sonic demonstrates impressive performance. On the Multilingual LibriSpeech benchmark, which assesses speech recognition across various languages and dialects, Nova Sonic achieved a word error rate (WER) of just 4.2% across English, French, Italian, German, and Spanish. This indicates that only about four out of every 100 words differed from human transcriptions in these languages.
Furthermore, in scenarios involving loud interactions with multiple participants, as measured by the Augmented Multi Party Interaction benchmark, Nova Sonic was 46.7% more accurate in terms of WER compared to OpenAI’s GPT-4o-transcribe model. In terms of responsiveness, Nova Sonic boasts an average perceived latency of 1.09 seconds, outperforming OpenAI’s GPT-4o model, which responds in 1.18 seconds, according to benchmarking by Artificial Analysis.
Prasad emphasized that Nova Sonic is a crucial component of Amazon’s broader strategy to develop artificial general intelligence (AGI)—AI systems capable of performing any task a human can execute on a computer. Looking ahead, Amazon plans to release additional AI models, further advancing the field of conversational AI.