Technology

ElevenLabs and Google lead updated speech to text benchmark

Word error rates fall toward two percent in agent speech tests, The transcript becomes a platform bill

ElevenLabs’ Scribe v2 and Google’s Gemini models now sit at the top of Artificial Analysis’ updated speech-to-text benchmark, a league table built around word error rate. In version 2.0 of the AA-WER benchmark, Scribe v2 leads at 2.3% WER, followed by Gemini 3 Pro at 2.9% and Mistral’s Voxtral Small at 3.0%, according to The Decoder.

Those numbers matter because speech-to-text is quietly becoming a default layer under customer support, compliance logging, media production, and meeting software. But the more immediate business consequence is not whether a model mishears one word in fifty; it is who gets to sell “the transcript” as a product and attach downstream services—summaries, sentiment tags, ticket routing, quality scoring, and searchable archives. Once transcription is good enough to be trusted most of the time, the value moves to packaging: latency that feels live, diarization that correctly separates speakers in messy calls, robustness in background noise, and predictable cost per hour of audio.

The benchmark also illustrates how the biggest platforms can win without building a single-purpose tool. The Decoder notes that Google did not train Gemini specifically for transcription, yet Gemini 3 Pro performs near the top—an advantage that comes from running a general multimodal model at scale and then letting product teams bolt it into Workspace, Android, and call-center stacks. That creates a familiar dependency: the model provider owns the updates, the pricing, and the failure modes, while the customer owns the consequences when a bad transcript becomes a bad decision.

Lower on the list, OpenAI’s open-source Whisper Large v3 posts a 4.2% WER—competitive, but no longer state of the art in this benchmark. That gap is small on paper and large in procurement: for a business replacing human transcription, a few percentage points can be the difference between “we still need a reviewer” and “we can run this unattended,” which is where the cost savings are.

Artificial Analysis also tested “agent talk” speech aimed at voice assistants, where Scribe v2 and Gemini 3 Pro again lead at 1.6% and 1.7% WER. In practice, that pushes more interactions from keypresses into voice—especially in cars, warehouses, and field service—while moving the record of what was said into the vendor’s pipeline.

The benchmark’s top line is a decimal race, but the operational reality is simpler: when transcription becomes cheap and reliable, the call recording stops being a file and becomes an API bill.