{"id":758,"date":"2025-12-29T04:54:54","date_gmt":"2025-12-29T12:54:54","guid":{"rendered":"https:\/\/51ai.website\/?p=758"},"modified":"2026-01-01T21:21:17","modified_gmt":"2026-01-02T05:21:17","slug":"voice-developers","status":"publish","type":"post","link":"https:\/\/51ai.website\/en\/blog\/voice-developers\/","title":{"rendered":"OpenAI Updates for Voice Developers"},"content":{"rendered":"<p class=\"has-medium-font-size wp-block-paragraph\">New audio model snapshots and broader access to Custom Voices for production voice apps.<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">AI audio capabilities unlock an exciting new frontier of user experiences. Earlier this year we released several new audio models, including gpt-realtime, along with new API features to enable developers to build these experiences.<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Last week, we released new audio model snapshots designed to address some of the common challenges in building reliable audio agents by improving reliability and quality across production voice workflows\u2013from transcription and text-to-speech to real-time, natively speech-to-speech agents.<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">These updates include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\"><a href=\"https:\/\/platform.openai.com\/docs\/models\/gpt-4o-mini-transcribe\" target=\"_blank\" rel=\"noopener\"><code>gpt-4o-mini-transcribe-2025-12-15<\/code><\/a>\u00a0for speech-to-text with the Transcription or Realtime API<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><a href=\"https:\/\/platform.openai.com\/docs\/models\/gpt-4o-mini-tts\" target=\"_blank\" rel=\"noopener\"><code>gpt-4o-mini-tts-2025-12-15<\/code><\/a>\u00a0for text-to-speech with the Speech API<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><a href=\"https:\/\/platform.openai.com\/docs\/models\/gpt-realtime-mini\" target=\"_blank\" rel=\"noopener\"><code>gpt-realtime-mini-2025-12-15<\/code><\/a>\u00a0for native, real-time speech-to-speech with the Realtime API<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><a href=\"https:\/\/platform.openai.com\/docs\/models\/gpt-audio-mini\" target=\"_blank\" rel=\"noopener\"><code>gpt-audio-mini-2025-12-15<\/code><\/a>\u00a0for native speech-to-speech with the Chat Completions API<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">The new snapshots share a few common improvements:<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\"><strong><strong>With audio input:<\/strong>\uff1a<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\">Lower word-error rates for real-world and noisy audio<\/li>\n\n\n\n<li class=\"has-medium-font-size\">Fewer hallucinations during silence or with background noise<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\"><strong><strong>With audio output:<\/strong>\uff1a<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\">More natural and stable voice output, including when using Custom Voices<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Pricing remains the same as previous model snapshots, so we recommend switching to these new snapshots to benefit from improved performance for the same price.<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">If you\u2019re building voice agents, customer support systems, or branded voice experiences, these updates will help you make production deployments more reliable. Below, we\u2019ll break down what\u2019s new and how these improvements show up in real-world voice workflows.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-medium-font-size\" id=\"speech-to-speech\"><strong>Speech-to-speech<\/strong><\/h2>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">We\u2019re deploying new Realtime mini and Audio mini models that have been optimized for better tool calling and instruction following. These models reduce the intelligence gap between the mini and full-size models, enabling some applications to optimize cost by moving to the mini model.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\" id=\"gpt-realtime-mini-2025-12-15\"><code>gpt-realtime-mini-2025-12-15<\/code><\/h3>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\"><code>gpt-realtime-mini<\/code> model is meant to be used with the Realtime API, our API for low-latency, native multi-modal interactions. It supports features like streaming audio in and out, handling interruptions (with optional voice activity detection), and function calling in the background while the model keeps talking.<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">The new Realtime mini snapshot is better suited for real-time agents, with clear gains in instruction following and tool calling. On our internal speech-to-speech evaluations, we\u2019ve seen an improvement of 18.6 percentage points in instruction-following accuracy and 12.9 percentage points in tool-calling accuracy compared to the previous snapshot, as well as an improvement on the Big Bench Audio benchmark.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"886\" height=\"436\" src=\"https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image.png\" alt=\"\" class=\"wp-image-760\" srcset=\"https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image.png 886w, https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image-300x148.png 300w, https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image-768x378.png 768w, https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image-18x9.png 18w\" sizes=\"auto, (max-width: 886px) 100vw, 886px\" \/><\/figure>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Together, these gains lead to more reliable multi-step interactions and more consistent function execution in live, low-latency settings.<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">For scenarios where agent accuracy is worth a higher cost, gpt-realtime remains our best performing model. But when cost and latency matter most, gpt-realtime-mini is a great option, performing well on real-world scenarios.<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">For example, Genspark stress-tested it on bilingual translation and intelligent intent routing, and in addition to the improved voice quality, they found the latency to be near-instant, while keeping the intent recognition spot-on throughout rapid exchanges.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\" id=\"gpt-audio-mini-2025-12-15\"><code>gpt-audio-mini-2025-12-15<\/code><\/h3>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">The gpt-audio-mini model can be used with the Chat Completions API for speech-to-speech use cases where real-time interaction isn\u2019t a requirement.<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Both new snapshots also feature an upgraded decoder for more natural sounding voices, and better maintain voice consistency when used with Custom Voices.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-medium-font-size\" id=\"text-to-speech\">Text-to-speech<\/h2>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Our latest text-to-speech model, gpt-4o-mini-tts-2025-12-15, delivers a significant jump in accuracy, with substantially lower word error rates across standard speech benchmarks compared to the previous generation. On Common Voice and FLEURS, we see roughly 35% lower WER, with consistent gains on Multilingual LibriSpeech as well.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"890\" height=\"411\" src=\"https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image-1.png\" alt=\"\" class=\"wp-image-761\" srcset=\"https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image-1.png 890w, https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image-1-300x139.png 300w, https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image-1-768x355.png 768w, https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image-1-18x8.png 18w\" sizes=\"auto, (max-width: 890px) 100vw, 890px\" \/><\/figure>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Together, these results reflect improved pronunciation accuracy and robustness across a wide range of languages.<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Similar to the new gpt-realtime-mini snapshot, this model sounds much more natural and performs better with Custom Voices.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-medium-font-size\" id=\"speech-to-text\">Speech-to-text<\/h2>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">The latest transcription model, gpt-4o-mini-transcribe-2025-12-15, shows strong gains in both accuracy and reliability. On standard ASR benchmarks like Common Voice and FLEURS (without language hints), it delivers lower word error rates than prior models. We\u2019ve optimized this model for behavior on real-world conversational settings, such as short user utterances and noisy backgrounds. In an internal hallucination-with-noise evaluation, where we played clips of real-world background noise and audio with varying speaking intervals (including silence), the model produced ~90% fewer hallucinations compared to Whisper v2 and ~70% fewer compared to previous GPT-4o-transcribe models.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"883\" height=\"287\" src=\"https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image-2.png\" alt=\"\" class=\"wp-image-762\" srcset=\"https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image-2.png 883w, https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image-2-300x98.png 300w, https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image-2-768x250.png 768w, https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/image-2-18x6.png 18w\" sizes=\"auto, (max-width: 883px) 100vw, 883px\" \/><\/figure>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">This model snapshot is particularly strong in Chinese (Mandarin), Hindi, Bengali, Japanese, Indonesian, and Italian.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-medium-font-size\" id=\"custom-voices\">Custom Voices<\/h2>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Custom Voices enable organizations to connect with customers in their unique brand voice. Whether you\u2019re building a customer support agent or a brand avatar, OpenAI\u2019s custom voice technology makes it easy to create distinct, realistic voices.<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Theese new speech-to-speech and text-to-speech models unlock improvements for custom voices such as more natural tones, increased faithfulness to the original sample, and improved accuracy across dialects.\u00a0<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">To ensure safe use of this technology, Custom Voices are limited to eligible customers. Contact your account director or reach out to our sales team to learn more.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-medium-font-size\" id=\"from-prototype-to-production\">From prototype to production<\/h2>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Voice apps tend to fail in the same places, mainly on long conversations or with edge cases like silence, and tool-driven flows where the voice agent needs to be precise. These updates are focused on those failure modes\u2014lower error rates, fewer hallucinations, more consistent tool use, better instruction following. And as a bonus, we\u2019ve improved the stability of the output audio so your voice experiences can sound more natural.<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">If you\u2019re shipping voice experiences today, we recommend moving to the new 2025-12-15 snapshots and re-running your key production test cases. Early testers have confirmed noticeable improvements without changing their instructions and simply switching to the new snapshots, but we recommend experimenting with your own use cases and adjusting your prompts as needed.<\/p>","protected":false},"excerpt":{"rendered":"<p>\u65b0\u7684\u97f3\u9891\u6a21\u578b\u5feb\u7167\u4ee5\u53ca\u751f\u4ea7\u8bed\u97f3\u5e94\u7528\u7a0b\u5e8f\u5bf9\u81ea\u5b9a\u4e49\u8bed\u97f3\u66f4\u5e7f\u6cdb\u7684\u8bbf\u95ee\u6743\u9650\u3002 \u4eba\u5de5\u667a\u80fd\u97f3\u9891\u529f\u80fd\u5f00\u542f\u4e86\u7528\u6237\u4f53\u9a8c\u4ee4\u4eba\u5174\u594b\u7684\u65b0 [&hellip;]<\/p>","protected":false},"author":2,"featured_media":759,"comment_status":"open","ping_status":"open","sticky":false,"template":"wp-custom-template","format":"standard","meta":{"_acf_changed":false,"_crdt_document":"","_uag_custom_page_level_css":"","footnotes":""},"categories":[9],"tags":[],"class_list":["post-758","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog"],"acf":[],"views":347,"uagb_featured_image_src":{"full":["https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/updates-audio-blog.webp",1600,590,false],"thumbnail":["https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/updates-audio-blog-150x150.webp",150,150,true],"medium":["https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/updates-audio-blog-300x111.webp",300,111,true],"medium_large":["https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/updates-audio-blog-768x283.webp",768,283,true],"large":["https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/updates-audio-blog-1024x378.webp",1024,378,true],"1536x1536":["https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/updates-audio-blog-1536x566.webp",1536,566,true],"2048x2048":["https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/updates-audio-blog.webp",1600,590,false],"trp-custom-language-flag":["https:\/\/51ai.website\/wp-content\/uploads\/2025\/12\/updates-audio-blog-18x7.webp",18,7,true]},"uagb_author_info":{"display_name":"stark, tony","author_link":"https:\/\/51ai.website\/en\/author\/admin\/"},"uagb_comment_info":10,"uagb_excerpt":"\u65b0\u7684\u97f3\u9891\u6a21\u578b\u5feb\u7167\u4ee5\u53ca\u751f\u4ea7\u8bed\u97f3\u5e94\u7528\u7a0b\u5e8f\u5bf9\u81ea\u5b9a\u4e49\u8bed\u97f3\u66f4\u5e7f\u6cdb\u7684\u8bbf\u95ee\u6743\u9650\u3002 \u4eba\u5de5\u667a\u80fd\u97f3\u9891\u529f\u80fd\u5f00\u542f\u4e86\u7528\u6237\u4f53\u9a8c\u4ee4\u4eba\u5174\u594b\u7684\u65b0&hellip;","_links":{"self":[{"href":"https:\/\/51ai.website\/en\/wp-json\/wp\/v2\/posts\/758","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/51ai.website\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/51ai.website\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/51ai.website\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/51ai.website\/en\/wp-json\/wp\/v2\/comments?post=758"}],"version-history":[{"count":7,"href":"https:\/\/51ai.website\/en\/wp-json\/wp\/v2\/posts\/758\/revisions"}],"predecessor-version":[{"id":769,"href":"https:\/\/51ai.website\/en\/wp-json\/wp\/v2\/posts\/758\/revisions\/769"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/51ai.website\/en\/wp-json\/wp\/v2\/media\/759"}],"wp:attachment":[{"href":"https:\/\/51ai.website\/en\/wp-json\/wp\/v2\/media?parent=758"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/51ai.website\/en\/wp-json\/wp\/v2\/categories?post=758"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/51ai.website\/en\/wp-json\/wp\/v2\/tags?post=758"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}