Which text-to-speech APIs support ogg_opus output?

Google Cloud TTS exposes it directly as the OGG_OPUS audio encoding. OpenAI gpt-4o-mini-tts, Deepgram Aura-2 and ElevenLabs also offer Opus output.

5 Best Text-to-Speech Models Compared (ElevenLabs, OpenAI & More)

Q: Which text-to-speech API is the best in 2026?

It depends on your priority. ElevenLabs (Eleven v3) leads on expressiveness and voice cloning; OpenAI gpt-4o-mini-tts is the cheapest and is steerable by prompt; Deepgram Aura-2 is best for real-time voice agents; Google Cloud TTS has the widest language coverage and SSML control; Kokoro-82M is the best free, self-hosted open-weight option.

Q: What is the best open-source text-to-speech model?

Kokoro-82M stands out for quality-per-parameter and its permissive Apache-2.0 licence. Dia (Nari Labs) is strong for expressive multi-speaker audio, and Chatterbox (Resemble AI) adds voice cloning with tunable expressiveness.

Last updated June 2026.

Text-to-speech moved fast since this guide first ran. The frontier is now steerable, low-latency, multilingual speech — you describe the tone in plain language and get near-human delivery in well under a second. Below are five models worth knowing in 2026: four hosted APIs and one open-weight model you can self-host.

The five models at a glance

ElevenLabs: the expressiveness leader. Eleven v3 for emotional range across 70+ languages; Flash v2.5 for ~75 ms real-time agents.
OpenAI gpt-4o-mini-tts: cheapest of the majors (~$0.015/min) and steerable by prompt — you instruct the tone in natural language, no SSML.
Deepgram Aura-2: enterprise real-time TTS with sub-200 ms time-to-first-byte; strong on numbers, dates and other spoken-form formatting.
Google Cloud TTS: the breadth play — Chirp 3 HD voices, Gemini-2.5 TTS, instant custom-voice cloning, and the widest language coverage.
Kokoro-82M: the open-weight surprise — an Apache-2.0, 82M-parameter model that runs on a small GPU (or CPU) and rivals far larger systems.

Quick comparison

Model	Latency	Languages	Indicative price	Best for
ElevenLabs v3 / Flash v2.5	Flash ~75 ms	v3 70+ / Flash 32	~$0.10 / 1K chars (v3); ~$0.05 (Flash)	Expressive narration, brand voices, voice cloning
OpenAI `gpt-4o-mini-tts`	Low (streaming)	50+	~$0.015 / min audio	Cheap, prompt-steerable speech in apps
Deepgram Aura-2	~90–200 ms TTFB	7 (EN, ES + 5)	$0.030 / 1K chars	Real-time voice agents, IVR, contact centres
Google Cloud TTS	Streaming	Chirp 3 ~31; Gemini-TTS 80+ locales	Usage-based (per char)	Multilingual reach, SSML control, custom voice
Kokoro-82M	Very fast (tiny model)	8	Free self-host; ~$1 / 1M chars hosted	On-device, cost-sensitive, self-hosted pipelines

1. ElevenLabs — expressiveness and voice cloning

ElevenLabs remains the benchmark for emotional, human-like delivery. Its 2026 line-up is built around a few models for different trade-offs:

Eleven v3 — the most expressive model, with strong emotional range and contextual delivery across 70+ languages.
Multilingual v2 — the stable, high-quality workhorse (29 languages).
Flash v2.5 / Turbo v2.5 — latency-optimised (~75 ms and ~250–300 ms respectively) for voice agents and interactive use.

Voice cloning (both instant and professional) is the platform’s signature feature, letting you build a consistent brand or character voice from a short sample. Pricing is credit-based: on the API, Multilingual v2/v3 runs about $0.10 per 1K characters, while Flash and Turbo are roughly half that at $0.05 per 1K characters.

Best for: audiobooks, podcasts, e-learning, and any project where expressive, distinctive voices matter.

2. OpenAI `gpt-4o-mini-tts` — cheap and steerable

OpenAI’s current TTS model, gpt-4o-mini-tts (released March 2025), reframed the category around steerability: instead of SSML tags, you give a natural-language instruction (“speak like a calm, sympathetic support agent”) and the model adapts tone, pacing and emotion. It ships around a dozen named voices (Alloy, Ash, Ballad, Coral, Echo, Onyx, Nova, Sage, Shimmer and more) and covers 50+ languages.

It is also the cheapest of the major hosted APIs at roughly $0.015 per minute of generated audio (about $0.60 per million input-text tokens plus audio-output tokens). Output formats include MP3, Opus, AAC, FLAC, WAV and PCM. For full speech-to-speech conversations, OpenAI’s Realtime API uses the same voice stack.

Best for: developers who want good, controllable English (and multilingual) speech at the lowest price, baked into an app.

3. Deepgram Aura-2 — built for real-time agents

Deepgram Aura-2 (launched April 2025) is purpose-built for production voice agents: sub-200 ms time-to-first-byte, optimised as low as ~90 ms. It is engineered to read spoken-form content correctly — numbers, dates, currencies, email addresses and the like — which matters for IVR, scheduling and support flows.

The voice catalogue spans 40+ English voices (American, British, Australian, Irish and Filipino accents) and 10+ Spanish voices; as of December 2025 it also added Dutch, French, German, Italian and Japanese, for seven languages in total. Pricing is a simple flat $0.030 per 1K characters, all voices included.

Best for: low-latency conversational agents, contact centres and IVR where consistency and clear formatting beat theatrical expressiveness.

4. Google Cloud TTS — breadth and control

Google’s strength is coverage and integration. The 2026 stack includes Chirp 3: HD voices (generative, emotionally resonant voices across ~31 languages), Gemini-2.5 TTS (Flash and Pro, now GA, with prompt-controlled style across 80+ locales and multi-speaker output), and Chirp 3: Instant Custom Voice for rapid voice cloning.

It retains full SSML control (pauses, emphasis, pronunciation), long-form synthesis, and a wide format range. Output encodings include MP3, LINEAR16 (WAV) and OGG_OPUS — Google is one of the few that exposes Opus-in-Ogg directly (more on that below).

Best for: multilingual products, accessibility tooling, and teams already on Google Cloud that want SSML precision and the broadest language list.

5. Kokoro-82M — the open-weight outlier

Kokoro-82M is the open-source story of the year. At just 82 million parameters with Apache-2.0 weights, it punches well above its size: the v1.0 release (January 2025) ships 54 voices across 8 languages, and it has held a ~44% win rate on the community TTS Arena — competitive with models many times larger.

Because it is tiny (weights under 1 GB; ~2–3 GB GPU at inference, and it will even run on CPU or in-browser), Kokoro is ideal when you want to self-host and avoid per-character API costs. Hosted endpoints run around $1 per million characters; self-hosting is effectively free beyond compute. It outputs 24 kHz WAV, which you can transcode to MP3 or Opus with ffmpeg.

Best for: privacy-sensitive, offline, or high-volume pipelines where cost and control beat the polish of the hosted giants.

If you are exploring open models further, Dia (Nari Labs) is excellent for expressive, multi-speaker audio with nonverbal sounds, and Chatterbox (Resemble AI) adds voice cloning with tunable expressiveness.

How to choose

Most expressive / brand & character voices: ElevenLabs (v3), with voice cloning.
Cheapest, steerable, app-embedded: OpenAI gpt-4o-mini-tts.
Real-time voice agents & IVR: Deepgram Aura-2 (lowest latency, clean formatting).
Widest languages & SSML control: Google Cloud TTS.
Self-hosted, low-cost, private: Kokoro-82M (or Dia / Chatterbox).

The honest summary: there is no single “best” TTS model in 2026 — the right pick is set by your constraints on latency, language, expressiveness, price and where the audio is allowed to be processed.

Audio output formats (MP3, WAV, OGG Opus & more)

A recurring developer question is which TTS APIs support OGG Opus (the ogg_opus encoding) and other formats. Google Cloud TTS exposes it directly as the OGG_OPUS encoding; ElevenLabs, OpenAI and Deepgram Aura-2 all offer Opus output as well. Here is how the five compare:

Model	Supported output / export formats
ElevenLabs	MP3, PCM, WAV, Opus (`opus_48000`, 48 kHz, 32–192 kbps), µ-law, A-law
OpenAI `gpt-4o-mini-tts`	MP3, Opus, AAC, FLAC, WAV, PCM
Google Cloud TTS	MP3, LINEAR16 (WAV), `OGG_OPUS`, MULAW, ALAW
Deepgram Aura-2	Linear16 (WAV), MP3, Opus, FLAC, AAC, µ-law, A-law
Kokoro-82M	24 kHz WAV (self-host); transcode to MP3/Opus with ffmpeg

Does ElevenLabs support OGG Opus output?

Yes. ElevenLabs outputs Opus audio via its API using the opus_48000_* formats (48 kHz at 32, 64, 96, 128 or 192 kbps) — the same Opus codec used in .ogg/ogg_opus files — alongside MP3, PCM, WAV and µ-law/A-law.

FAQs

Which text-to-speech API is the best in 2026?

It depends on the constraint that matters most. For expressive narration and voice cloning, ElevenLabs (Eleven v3). For the lowest price and prompt-level control inside an app, OpenAI gpt-4o-mini-tts. For real-time voice agents, Deepgram Aura-2. For the widest language coverage and SSML control, Google Cloud TTS. For a free, self-hosted option, the open-weight Kokoro-82M.

What is the best open-source text-to-speech model?

Kokoro-82M is the standout for its size-to-quality ratio and permissive Apache-2.0 licence. For expressive, multi-speaker audio consider Dia (Nari Labs); for voice cloning with tunable expressiveness, Chatterbox (Resemble AI).