The Higgs Audio v3 API Service

Expressive, controllable text-to-speech (100+ languages, inline emotion & prosody) and speech-to-text, served OpenAI-compatible on RTX 3090. Drop-in for /v1/audio/speech and /v1/audio/transcriptions.

Base URL … Copy

Get started

📖

↗

API Docs

Full reference — endpoints, parameters, the inline control-token catalog, languages, and copy-paste samples.

🦊

↗

Live Demo

“Coco” — a bilingual 中/英 early-childhood oral-practice & emotional-companion app showing off emotion + pacing.

🧩

↗

Models

higgs-audio-v3-tts-4b & higgs-audio-v3-stt — weights, model cards, and architecture.

Endpoints

🗣️

↗

Text → Speech

Synthesize expressive audio. Inline <|emotion:…|> tags for emotion, prosody, speed, and sound effects. Zero-shot voice cloning.

POST /v1/audio/speech

👂

↗

Speech → Text

Whisper-compatible transcription. 60+ languages, thinking mode for accuracy. Drop-in OpenAI transcriptions API.

POST /v1/audio/transcriptions

Quick start

# Synthesize speech with an emotion tag
curl $BASE/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input":"<|emotion:elation|>Hello from Higgs!","response_format":"mp3"}' \
  --output hello.mp3

Resources

📚

↗

Boson AI Docs

Official Higgs Audio v3 documentation — API spec, voices, control tokens.

⚙️

↗

SGLang-Omni

The serving engine powering TTS — cookbook & deployment recipes.

🪄

↗

ModelScope

Model weights mirror (CN-fast) for the TTS-4B and STT models.

Higgs Audio v3 · powered by SGLang-Omni (TTS) + transformers (ASR). Demo brain: DeepSeek V4 Flash via AI Gateway.