Last updated June 2026.
Text-to-speech moved fast since this guide first ran. The frontier is now steerable, low-latency, multilingual speech — you describe the tone in plain language and get near-human delivery in well under a second. Below are five models worth knowing in 2026: four hosted APIs and one open-weight model you can self-host.
The five models at a glance
- ElevenLabs: the expressiveness leader. Eleven v3 for emotional range across 70+ languages; Flash v2.5 for ~75 ms real-time agents.
- OpenAI
gpt-4o-mini-tts: cheapest of the majors (~$0.015/min) and steerable by prompt — you instruct the tone in natural language, no SSML. - Deepgram Aura-2: enterprise real-time TTS with sub-200 ms time-to-first-byte; strong on numbers, dates and other spoken-form formatting.
- Google Cloud TTS: the breadth play — Chirp 3 HD voices, Gemini-2.5 TTS, instant custom-voice cloning, and the widest language coverage.
- Kokoro-82M: the open-weight surprise — an Apache-2.0, 82M-parameter model that runs on a small GPU (or CPU) and rivals far larger systems.
Quick comparison
| Model | Latency | Languages | Indicative price | Best for |
|---|---|---|---|---|
| ElevenLabs v3 / Flash v2.5 | Flash ~75 ms | v3 70+ / Flash 32 | ~$0.10 / 1K chars (v3); ~$0.05 (Flash) | Expressive narration, brand voices, voice cloning |
OpenAI gpt-4o-mini-tts |
Low (streaming) | 50+ | ~$0.015 / min audio | Cheap, prompt-steerable speech in apps |
| Deepgram Aura-2 | ~90–200 ms TTFB | 7 (EN, ES + 5) | $0.030 / 1K chars | Real-time voice agents, IVR, contact centres |
| Google Cloud TTS | Streaming | Chirp 3 ~31; Gemini-TTS 80+ locales | Usage-based (per char) | Multilingual reach, SSML control, custom voice |
| Kokoro-82M | Very fast (tiny model) | 8 | Free self-host; ~$1 / 1M chars hosted | On-device, cost-sensitive, self-hosted pipelines |
1. ElevenLabs — expressiveness and voice cloning
ElevenLabs remains the benchmark for emotional, human-like delivery. Its 2026 line-up is built around a few models for different trade-offs:
- Eleven v3 — the most expressive model, with strong emotional range and contextual delivery across 70+ languages.
- Multilingual v2 — the stable, high-quality workhorse (29 languages).
- Flash v2.5 / Turbo v2.5 — latency-optimised (~75 ms and ~250–300 ms respectively) for voice agents and interactive use.
Voice cloning (both instant and professional) is the platform’s signature feature, letting you build a consistent brand or character voice from a short sample. Pricing is credit-based: on the API, Multilingual v2/v3 runs about $0.10 per 1K characters, while Flash and Turbo are roughly half that at $0.05 per 1K characters.
Best for: audiobooks, podcasts, e-learning, and any project where expressive, distinctive voices matter.
2. OpenAI gpt-4o-mini-tts — cheap and steerable
OpenAI’s current TTS model, gpt-4o-mini-tts (released March 2025), reframed the category around steerability: instead of SSML tags, you give a natural-language instruction (“speak like a calm, sympathetic support agent”) and the model adapts tone, pacing and emotion. It ships around a dozen named voices (Alloy, Ash, Ballad, Coral, Echo, Onyx, Nova, Sage, Shimmer and more) and covers 50+ languages.
It is also the cheapest of the major hosted APIs at roughly $0.015 per minute of generated audio (about $0.60 per million input-text tokens plus audio-output tokens). Output formats include MP3, Opus, AAC, FLAC, WAV and PCM. For full speech-to-speech conversations, OpenAI’s Realtime API uses the same voice stack.
Best for: developers who want good, controllable English (and multilingual) speech at the lowest price, baked into an app.
3. Deepgram Aura-2 — built for real-time agents
Deepgram Aura-2 (launched April 2025) is purpose-built for production voice agents: sub-200 ms time-to-first-byte, optimised as low as ~90 ms. It is engineered to read spoken-form content correctly — numbers, dates, currencies, email addresses and the like — which matters for IVR, scheduling and support flows.
The voice catalogue spans 40+ English voices (American, British, Australian, Irish and Filipino accents) and 10+ Spanish voices; as of December 2025 it also added Dutch, French, German, Italian and Japanese, for seven languages in total. Pricing is a simple flat $0.030 per 1K characters, all voices included.
Best for: low-latency conversational agents, contact centres and IVR where consistency and clear formatting beat theatrical expressiveness.
4. Google Cloud TTS — breadth and control
Google’s strength is coverage and integration. The 2026 stack includes Chirp 3: HD voices (generative, emotionally resonant voices across ~31 languages), Gemini-2.5 TTS (Flash and Pro, now GA, with prompt-controlled style across 80+ locales and multi-speaker output), and Chirp 3: Instant Custom Voice for rapid voice cloning.
It retains full SSML control (pauses, emphasis, pronunciation), long-form synthesis, and a wide format range. Output encodings include MP3, LINEAR16 (WAV) and OGG_OPUS — Google is one of the few that exposes Opus-in-Ogg directly (more on that below).
Best for: multilingual products, accessibility tooling, and teams already on Google Cloud that want SSML precision and the broadest language list.
5. Kokoro-82M — the open-weight outlier
Kokoro-82M is the open-source story of the year. At just 82 million parameters with Apache-2.0 weights, it punches well above its size: the v1.0 release (January 2025) ships 54 voices across 8 languages, and it has held a ~44% win rate on the community TTS Arena — competitive with models many times larger.
Because it is tiny (weights under 1 GB; ~2–3 GB GPU at inference, and it will even run on CPU or in-browser), Kokoro is ideal when you want to self-host and avoid per-character API costs. Hosted endpoints run around $1 per million characters; self-hosting is effectively free beyond compute. It outputs 24 kHz WAV, which you can transcode to MP3 or Opus with ffmpeg.
Best for: privacy-sensitive, offline, or high-volume pipelines where cost and control beat the polish of the hosted giants.
If you are exploring open models further, Dia (Nari Labs) is excellent for expressive, multi-speaker audio with nonverbal sounds, and Chatterbox (Resemble AI) adds voice cloning with tunable expressiveness.
How to choose
- Most expressive / brand & character voices: ElevenLabs (v3), with voice cloning.
- Cheapest, steerable, app-embedded: OpenAI
gpt-4o-mini-tts. - Real-time voice agents & IVR: Deepgram Aura-2 (lowest latency, clean formatting).
- Widest languages & SSML control: Google Cloud TTS.
- Self-hosted, low-cost, private: Kokoro-82M (or Dia / Chatterbox).
The honest summary: there is no single “best” TTS model in 2026 — the right pick is set by your constraints on latency, language, expressiveness, price and where the audio is allowed to be processed.
Audio output formats (MP3, WAV, OGG Opus & more)
A recurring developer question is which TTS APIs support OGG Opus (the ogg_opus encoding) and other formats. Google Cloud TTS exposes it directly as the OGG_OPUS encoding; ElevenLabs, OpenAI and Deepgram Aura-2 all offer Opus output as well. Here is how the five compare:
| Model | Supported output / export formats |
|---|---|
| ElevenLabs | MP3, PCM, WAV, Opus (opus_48000, 48 kHz, 32–192 kbps), µ-law, A-law |
OpenAI gpt-4o-mini-tts |
MP3, Opus, AAC, FLAC, WAV, PCM |
| Google Cloud TTS | MP3, LINEAR16 (WAV), OGG_OPUS, MULAW, ALAW |
| Deepgram Aura-2 | Linear16 (WAV), MP3, Opus, FLAC, AAC, µ-law, A-law |
| Kokoro-82M | 24 kHz WAV (self-host); transcode to MP3/Opus with ffmpeg |
Does ElevenLabs support OGG Opus output?
Yes. ElevenLabs outputs Opus audio via its API using the opus_48000_* formats (48 kHz at 32, 64, 96, 128 or 192 kbps) — the same Opus codec used in .ogg/ogg_opus files — alongside MP3, PCM, WAV and µ-law/A-law.
FAQs
Which text-to-speech API is the best in 2026?
It depends on the constraint that matters most. For expressive narration and voice cloning, ElevenLabs (Eleven v3). For the lowest price and prompt-level control inside an app, OpenAI gpt-4o-mini-tts. For real-time voice agents, Deepgram Aura-2. For the widest language coverage and SSML control, Google Cloud TTS. For a free, self-hosted option, the open-weight Kokoro-82M.
What is the best open-source text-to-speech model?
Kokoro-82M is the standout for its size-to-quality ratio and permissive Apache-2.0 licence. For expressive, multi-speaker audio consider Dia (Nari Labs); for voice cloning with tunable expressiveness, Chatterbox (Resemble AI).
