OpenAI、推論・翻訳・文字起こしを実現する3つの新音声モデルをAPIで提供開始
YouTubeチャンネル「つくもち英語部」の連動コンテンツ置き場。
AI・データサイエンス分野の最新ニュースを題材に、エンジニア特有の英語表現を日英対訳で学べます。
動画スクリプトに加え、専門用語の解説や関連するG検定項目への内部リンクも併設。
「英語で技術情報をキャッチアップする力」を、技術知識と一緒に鍛えるのが狙いです。
理系のキャリアをグローバルに広げたい人のための、実務直結型の英語学習リソースです。
OpenAIは、GPT-5級の推論能力を持ち、自然な対話や複雑なツール実行が可能な「GPT-Realtime-2」など、3つの新しい音声モデルをAPIに追加しました。多言語のリアルタイム翻訳や低遅延の文字起こしに特化したモデルも含まれており、多言語カスタマーサポートや議事録作成など幅広い分野での活用が想定されています。これらのモデルは従来より高い精度と応答性能を誇り、有害コンテンツ検知などの安全性も備えながら、すでに不動産や教育などの実務で導入が進んでいます。
📖 英文と日本語訳(一文ずつ)
On May 7, 2026, OpenAI announced that it will add three new voice models to its developer API.
OpenAIは2026年5月7日、開発者向けAPIに3つの新しい音声モデルを追加すると発表しました
Three models are being introduced: "GPT-Realtime-2," the first voice model featuring GPT-5 level reasoning capabilities; "GPT-Realtime-Translate," which performs real-time translation from over 70 input languages into 13 output languages; and "GPT-Realtime-Whisper," a streaming speech recognition model that provides live transcription following the speaker’s speech.
今回投入されるのは、GPT-5級の推論能力を備えた初の音声モデル「GPT-Realtime-2」、70以上の入力言語から13の出力言語へリアルタイム翻訳を行う「GPT-Realtime-Translate」、そして話者の発話に追随してライブで文字起こしを行うストリーミング型音声認識モデル「GPT-Realtime-Whisper」の3種類です
The core GPT-Realtime-2 is designed to infer requests while maintaining a conversation, call tools, handle corrections and interruptions, and provide contextually appropriate responses.
中核となるGPT-Realtime-2は、会話を継続しながら要求を推論し、ツールを呼び出し、訂正や割り込みに対応し、状況にふさわしい応答を返せるよう設計されています
Specific features include a mechanism for inserting short preambles such as "Please wait a moment" before responding, a function to invoke multiple tools in parallel with audible transparency regarding their operations, and a recovery feature that naturally communicates messages like "It is taking a little longer to respond" instead of remaining silent during processing delays or failures.
具体的な機能として、応答前に「少々お待ちください」といった短い前置き(プリアンブル)を挿入できる仕組み、複数のツールを並列で呼び出しその動作を音声で透明化する機能、処理に失敗した際に黙り込まず「現在対応に少し時間がかかっています」などと自然に伝えるリカバリ機能が搭載されています
The context window has been expanded from the previous 32K to 128K, supporting longer, more coherent sessions and complex task flows.
コンテキストウィンドウは従来の32Kから128Kへ拡張され、より長く一貫したセッションや複雑なタスクフローに対応します
Furthermore, the accuracy of preserving technical terms, proper nouns, and medical terminology has improved, and the controllability of tone has also been enhanced.
さらに、専門用語や固有名詞、医療用語などの保持精度が向上し、トーンの制御性も高まりました
Users can select from five levels of reasoning depth—Minimum, Low, Medium, High, and Maximum—with the default set to Low.
推論の深さについては最小・低・中・高・最高の5段階から選択でき、デフォルトは低に設定されています
In performance evaluations, it outperformed GPT-Realtime-1.5 by 15.2% on Big Bench Audio, which measures audio intelligence, and showed a 13.8% improvement on Audio MultiChallenge, which measures conversational instruction-following.
性能評価では、音声知能を測るBig Bench AudioでGPT-Realtime-1.5を15.2%上回り、対話の指示追従を測るAudio MultiChallengeでは13.8%の改善を示しています
GPT-Realtime-Translate enables the creation of multilingual audio experiences where each speaker can communicate in their preferred language while receiving real-time translated audio and transcriptions.
GPT-Realtime-Translateは、各話者がそれぞれ得意な言語で発話し、リアルタイムで翻訳音声と文字起こしを得られる多言語音声体験の構築を可能にします
It is expected to be utilized in areas such as customer support, cross-border sales, education, events, and media.
カスタマーサポートや越境セールス、教育、イベント、メディアなどでの活用が想定されています
Vimeo has released a demo that translates product introduction videos into multiple languages during playback, while Deutsche Telekom is proceeding with verification for its use in multilingual customer support.
Vimeoは製品紹介動画を再生中にそのまま多言語へ翻訳するデモを公開しており、Deutsche Telekomは多言語カスタマーサポートでの検証を進めています
GPT-Realtime-Whisper is a model specialized in low-latency speech-to-text conversion. It can be used for generating subtitles for meetings, lectures, broadcasts, and events, as well as for creating meeting minutes and developing voice agents that continuously understand user speech. Furthermore, it can be utilized to accelerate downstream workflows in high-frequency voice-based operations such as customer support, healthcare, sales, and recruitment.
GPT-Realtime-Whisperは低遅延の音声テキスト変換に特化したモデルで、会議や授業、放送、イベントの字幕生成、議事録作成、継続的にユーザー発話を理解する音声エージェント、カスタマーサポートやヘルスケア、営業、採用といった高頻度の音声業務での後続ワークフロー高速化などに用いることができます
OpenAI has identified three patterns for developer-built voice experiences: "voice-to-action," which involves reasoning through user requests and executing them via tools; "systems-to-voice," which proactively provides guidance based on system context; and "voice-to-voice," which maintains conversation continuity across different languages and contexts.
OpenAIは、開発者が構築する音声体験のパターンとして、ユーザーの依頼を推論しツールで実行する「voice-to-action」、システム側のコンテキストを能動的に音声で案内する「systems-to-voice」、言語や文脈をまたいで会話を継続させる「voice-to-voice」の3つを挙げています
Zillow is developing a voice assistant capable of responding to instructions such as "find a property on a low-traffic street within my BuyAbility range and book a Saturday viewing," while companies including Priceline, Intercom, Glean, and Foundation Health are also moving forward with its implementation.
Zillowは「BuyAbilityの範囲で交通量の少ない通り沿いの物件を探し、土曜の内見を予約する」といった指示に応える音声アシスタントを開発しており、Priceline、Intercom、Glean、Foundation Healthなども活用を進めています
Josh Weisberg of Zillow stated that prompt optimization improved call success rates from 69% to 95%—a 26-percentage-point increase—on the most challenging adversarial benchmarks.
Zillow社のJosh Weisberg氏は、最も難易度の高い敵対的ベンチマークにおいて、プロンプト最適化後のコール成功率が69%から95%へと26ポイント向上したと述べています
In terms of safety, the Realtime API features a built-in classifier that detects harmful content and terminates conversations, while developers can implement their own guardrails via the Agents SDK.
安全性の面では、Realtime APIに有害コンテンツを検知して会話を停止する分類器が組み込まれており、開発者はAgents SDKを通じて独自のガードレールを追加できます
EU data residency is also supported.
EUデータレジデンシーにも対応しています
Pricing for GPT-Realtime-2 is set at $32 per 1 million tokens for audio input ($0.40 for cached input) and $64 per 1 million tokens for audio output. GPT-Realtime-Translate is priced at $0.034 per minute, and GPT-Realtime-Whisper is set at $0.017 per minute.
価格はGPT-Realtime-2が音声入力100万トークンあたり32ドル(キャッシュ済み入力は0.40ドル)、音声出力100万トークンあたり64ドル、GPT-Realtime-Translateが1分あたり0.034ドル、GPT-Realtime-Whisperが1分あたり0.017ドルに設定されています
All three models are available via the Realtime API, with support for trials in the Playground and implementation through Codex.
3モデルともRealtime APIで利用可能で、Playgroundでの試用やCodexからの導入もサポートされています
🎧 通し読み(全文)
リスニング・シャドーイング用の全文です。
On May 7, 2026, OpenAI announced that it will add three new voice models to its developer API. Three models are being introduced: "GPT-Realtime-2," the first voice model featuring GPT-5 level reasoning capabilities; "GPT-Realtime-Translate," which performs real-time translation from over 70 input languages into 13 output languages; and "GPT-Realtime-Whisper," a streaming speech recognition model that provides live transcription following the speaker’s speech. The core GPT-Realtime-2 is designed to infer requests while maintaining a conversation, call tools, handle corrections and interruptions, and provide contextually appropriate responses. Specific features include a mechanism for inserting short preambles such as "Please wait a moment" before responding, a function to invoke multiple tools in parallel with audible transparency regarding their operations, and a recovery feature that naturally communicates messages like "It is taking a little longer to respond" instead of remaining silent during processing delays or failures. The context window has been expanded from the previous 32K to 128K, supporting longer, more coherent sessions and complex task flows. Furthermore, the accuracy of preserving technical terms, proper nouns, and medical terminology has improved, and the controllability of tone has also been enhanced. Users can select from five levels of reasoning depth—Minimum, Low, Medium, High, and Maximum—with the default set to Low. In performance evaluations, it outperformed GPT-Realtime-1.5 by 15.2% on Big Bench Audio, which measures audio intelligence, and showed a 13.8% improvement on Audio MultiChallenge, which measures conversational instruction-following. GPT-Realtime-Translate enables the creation of multilingual audio experiences where each speaker can communicate in their preferred language while receiving real-time translated audio and transcriptions. It is expected to be utilized in areas such as customer support, cross-border sales, education, events, and media. Vimeo has released a demo that translates product introduction videos into multiple languages during playback, while Deutsche Telekom is proceeding with verification for its use in multilingual customer support. GPT-Realtime-Whisper is a model specialized in low-latency speech-to-text conversion. It can be used for generating subtitles for meetings, lectures, broadcasts, and events, as well as for creating meeting minutes and developing voice agents that continuously understand user speech. Furthermore, it can be utilized to accelerate downstream workflows in high-frequency voice-based operations such as customer support, healthcare, sales, and recruitment. OpenAI has identified three patterns for developer-built voice experiences: "voice-to-action," which involves reasoning through user requests and executing them via tools; "systems-to-voice," which proactively provides guidance based on system context; and "voice-to-voice," which maintains conversation continuity across different languages and contexts. Zillow is developing a voice assistant capable of responding to instructions such as "find a property on a low-traffic street within my BuyAbility range and book a Saturday viewing," while companies including Priceline, Intercom, Glean, and Foundation Health are also moving forward with its implementation. Josh Weisberg of Zillow stated that prompt optimization improved call success rates from 69% to 95%—a 26-percentage-point increase—on the most challenging adversarial benchmarks. In terms of safety, the Realtime API features a built-in classifier that detects harmful content and terminates conversations, while developers can implement their own guardrails via the Agents SDK. EU data residency is also supported. Pricing for GPT-Realtime-2 is set at $32 per 1 million tokens for audio input ($0.40 for cached input) and $64 per 1 million tokens for audio output. GPT-Realtime-Translate is priced at $0.034 per minute, and GPT-Realtime-Whisper is set at $0.017 per minute. All three models are available via the Realtime API, with support for trials in the Playground and implementation through Codex.
📝 学習のヒント
- 1まず英文を読む — 知らない単語にあたりをつけてから音声へ。
- 2一文ずつ確認 — 日本語訳と照合し、構文を理解する。
- 3通し読み Normal で耳を作る — 内容を追いながらリピート。
- 4Fast でシャドーイング — 口を慣らし、リスニング速度を上げる。
- 5翌日に復習 — 1日空けて再聴すると長期記憶に定着しやすい。
