Technology

OpenAI expands Realtime API with new voice models

GPT-Realtime-2 adds conversational reasoning while Translate and Whisper run live, voice assistants become metered infrastructure for customer contact

Images

OpenAI launches new voice intelligence features in its API | TechCrunch OpenAI launches new voice intelligence features in its API | TechCrunch techcrunch.com

OpenAI adds real-time voice and translation models to its API, GPT-Realtime-2 and Realtime-Whisper target call centres and live interfaces, the product shift turns conversations into billable minutes and auditable logs

OpenAI on Thursday expanded its Realtime API with new “voice intelligence” models aimed at developers building apps that can talk, transcribe and translate live conversations, according to TechCrunch. The release includes GPT‑Realtime‑2 for interactive voice, GPT‑Realtime‑Translate for real-time translation, and GPT‑Realtime‑Whisper for live speech-to-text transcription. OpenAI says the goal is to move voice systems beyond simple call-and-response into interfaces that can listen, reason, translate, transcribe and take action during a conversation.

The product packaging matters as much as the models. Two of the three new capabilities are billed by the minute, while GPT‑Realtime‑2 is billed by token consumption, TechCrunch reports, pushing voice interaction toward the same metered economics as cloud compute. That pricing structure fits customer-service use cases OpenAI is explicitly targeting, where companies can measure “handle time” and attach a cost to each human-like exchange. It also makes the raw material of the service—streams of user speech—hard to treat as incidental: audio becomes the unit of work and, by necessity, the unit of retention, debugging and compliance.

OpenAI says it has guardrails intended to prevent misuse for spam, fraud or online abuse, including triggers that can halt conversations that violate its harmful-content guidelines. In practice, real-time enforcement requires real-time inspection: the system has to monitor what is being said to decide when to stop it. That is a different posture from a text chatbot where the user can choose to paste only what they want processed. Voice interfaces tend to pull in the surrounding world—background voices, names, addresses, account numbers—because that is what people say when they are not typing.

The company is also positioning the models for education, media, events and creator platforms, sectors where “live” often means high stakes and high sensitivity: classrooms, interviews, backstage comms and audience interactions. Translation at conversational pace—OpenAI says the model supports more than 70 input languages and 13 output languages—lowers the barrier to cross-border services, but it also centralises the intermediary. When the same vendor provides the voice, the transcript and the translation, the vendor becomes the choke point for latency, policy enforcement and failure modes.

OpenAI’s announcement is a reminder that the AI race is not only about bigger models; it is about owning the interface layer where users speak rather than click. In that world, the “app” is less a screen than a continuous audio session with a meter running.

OpenAI’s new voice models are sold as tools for smoother conversation. They are also sold by the minute.