Higgs Audio v3 β API Reference
Self-hosted, OpenAI-compatible text-to-speech and speech-to-text powered by
Boson AI's Higgs Audio v3 models, plus a realtime full-duplex voice WebSocket. The TTS API mirrors the
Boson /v1/audio/speech
spec; ASR mirrors the OpenAI /v1/audio/transcriptions spec. TTS supports both
non-streaming (full file) and streaming (low-latency PCM);
the voice socket chains STT β LLM β TTS for live conversation.
Authentication
This is a self-hosted internal deployment β endpoints are open on the service host, no API key required.
(Boson's hosted API uses Authorization: Bearer <key>; clients written for it work here by
simply omitting/ignoring the header.)
Base URL
β¦CopyAll endpoints below are relative to this base URL.
Streaming vs non-streaming
The service exposes three endpoints. Two are request/response (the speech and transcription HTTP routes); the third is a full-duplex realtime voice socket. Whether you get audio incrementally or all-at-once depends on the endpoint and flags below.
| Endpoint | Mode | What you get |
|---|---|---|
POST /v1/audio/speech | non-streaming (default) | One response with the complete audio file in the requested format (mp3/wav/opus/β¦). Returns once synthesis finishes. |
POST /v1/audio/speechstream:true | streaming | Raw PCM chunks are streamed as they decode (first audio ~0.2β0.5 s).
Requires response_format:"pcm" β 24 kHz mono Int16LE, headerless. |
POST /v1/audio/transcriptions | batch only | The whole file is transcribed and returned as one JSON body. No streaming flag β the STT model is not a streaming architecture (see ASR). |
WS /ws/voice | realtime | Full-duplex voice agent: live STT partials + true PCM TTS streamed back per sentence. See Voice WebSocket. |
response_format values are only
valid for the non-streaming response (the encoder needs the full waveform). To play raw PCM:
ffplay -f s16le -ar 24000 -ac 1 out.pcm.Create speech POST
Generate expressive speech from text. The input may embed inline
control tokens for emotion, prosody, speed, and sound effects.
Request body
| Parameter | Type | Description |
|---|---|---|
input required | string | Text to synthesize (1β5000 chars). May contain inline <|β¦|> control tags. |
model optional | string | Model id / alias. Default higgs-audio-v3-tts (the served model). |
voice optional | string | Preset voice name or custom voice id. Default default. Mutually exclusive with ref_audio. |
response_format optional | string | One of mp3 (default), opus, pcm, wav, aac, flac. Streaming requires pcm. |
stream optional | boolean | Stream raw PCM chunks as they decode. Requires response_format: "pcm". Default false. |
ref_audio optional | string | Zero-shot voice cloning: an http(s) URL, data URI, or base64 audio (β€10 MB). See cloning. |
ref_text optional | string | Transcript of ref_audio (recommended for quality). |
temperature, top_k, top_p, max_new_tokens optional | number | Sampling controls forwarded to the engine. Recommended for cloning: temperature 0.8, top_k 50, max_new_tokens 1024. |
Responses
200 β audio in the requested format (audio/mpeg, audio/wav, audio/ogg,
audio/L16 for pcm, β¦). 400 β invalid/missing input. 502 β upstream engine error.
Non-streaming β default, returns the full file
Omit stream (or set it to false). The server synthesizes the whole
utterance and returns it as a single response body in your chosen response_format.
# Emotional, slow, expressive β saved as MP3 curl $BASE/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "<|prosody:speed_slow|><|emotion:affection|>Hello little friend, you did great today!", "response_format": "mp3" }' --output hello.mp3
import requests r = requests.post("$BASE" + "/v1/audio/speech", json={ "input": "<|emotion:elation|><|sfx:laughter|>Haha, wonderful job!", "response_format": "wav", }) open("out.wav", "wb").write(r.content)
# Works with the OpenAI SDK (api_key is ignored here) from openai import OpenAI client = OpenAI(base_url="$BASE" + "/v1", api_key="not-needed") client.audio.speech.create( model="higgs-audio-v3-tts", voice="default", input="<|emotion:contentment|>Good night, sleep tight.", ).stream_to_file("out.mp3")
Streaming β low-latency raw PCM
Set stream:true and response_format:"pcm". The server forwards
raw PCM chunks as the engine decodes them β first audio typically arrives in ~0.2β0.5 s, so
you can start playback before synthesis completes. The stream is headerless
24 kHz mono Int16LE (audio/L16); concatenate all chunks to get the full waveform.
stream:true with any non-pcm response_format will not
stream correctly β encoded formats (mp3/opus/β¦) need the complete waveform. Always pair streaming with
"response_format":"pcm", and use curl -N to disable output buffering.# -N disables curl buffering so PCM bytes land as they arrive curl -N $BASE/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "<|emotion:enthusiasm|>Streaming straight to your speakers!", "stream": true, "response_format": "pcm" }' --output out.pcm # Headerless 24kHz mono s16le β play it with ffplay: ffplay -f s16le -ar 24000 -ac 1 out.pcm
import requests # stream=True on the request so chunks are yielded as they decode with requests.post("$BASE" + "/v1/audio/speech", stream=True, json={ "input": "<|emotion:enthusiasm|>Streaming straight to your speakers!", "stream": True, "response_format": "pcm", # required for streaming }) as r, open("out.pcm", "wb") as f: for chunk in r.iter_content(chunk_size=4096): if chunk: f.write(chunk) # 24kHz mono Int16LE samples # play: ffplay -f s16le -ar 24000 -ac 1 out.pcm
Control tokens
Embed tags as <|category:value|>. Sentence-level tags (emotion, style, prosody
speed/pitch/expressive) lead the line and color the whole sentence; inline tags (sfx, prosody pause)
are placed at the exact spot. Only catalog values are recognized β anything else is read aloud literally.
Emotion β sentence-level, <|emotion:β¦|>
Prosody β <|prosody:β¦|>
Style β <|style:β¦|> Β· Sound effects β inline, <|sfx:β¦|>
<|sfx:laughter|>Haha, glad you made it!. For very slow delivery, insert
<|prosody:long_pause|> between phrases. Full catalog & samples:
Boson Β· Tags.Voice cloning
Pass a short reference clip via ref_audio (+ ref_text) for one-shot zero-shot cloning.
This service translates them to the engine's references field.
curl $BASE/v1/audio/speech -H "Content-Type: application/json" -d '{ "input": "Have a wonderful day!", "ref_audio": "https://example.com/sample.wav", "ref_text": "Hi, this is a sample of my voice.", "temperature": 0.8, "top_k": 50, "max_new_tokens": 1024 }' --output cloned.mp3
Languages
Higgs Audio v3 TTS speaks 100+ languages and code-switches within a single utterance β write the
input in any mix (e.g. "εοΌδ½ ηζ£οΌLet's try again, εε€ε₯½δΊε?") and it is spoken
naturally. Full list: Boson Β· Languages (102).
Transcriptions POST
Transcribe speech to text β OpenAI Whisper-compatible. Built on Higgs Audio v3 STT (Whisper-large-v3 encoder + Qwen3 decoder), with a "thinking" mode for accuracy.
Request (multipart/form-data)
| Field | Type | Description |
|---|---|---|
file required | file | Audio to transcribe (wav/mp3/webm/β¦ β decoded server-side). 16 kHz mono internally. |
model optional | string | Model id. Default higgs-stt. |
response_format optional | string | json (default) β {"text": "β¦"}. |
Example
curl $BASE/v1/audio/transcriptions \ -F "file=@speech.wav" \ -F "model=higgs-stt"
import requests r = requests.post("$BASE" + "/v1/audio/transcriptions", files={"file": open("speech.wav", "rb")}, data={"model": "higgs-stt"}) print(r.json()["text"])
Realtime voice WS
A full-duplex realtime voice agent over a single WebSocket. You stream microphone PCM up; the server returns live transcript partials, a streamed LLM reply, and synthesized speech PCM back down β all on the same socket. It chains the three Higgs/LLM stages into one low-latency loop.
Pipeline β what is truly streaming
| Stage | Streaming? | How |
|---|---|---|
| STT (speech β text) | chunked pseudo-streaming | The growing mic buffer is re-transcribed every ~0.7 s β emitted as partial
events; a clean final is produced on stop. Whisper-class models can't
do true frame-by-frame streaming, so this approximates it by re-running on the accumulated audio. |
| LLM (DeepSeek V4 Flash) | true SSE token streaming | The reply is consumed token-by-token over server-sent events. Reasoning is disabled
(thinking:{"type":"disabled"}) for low latency, and the text is emitted as
per-sentence reply_delta events. |
| TTS (text β speech) | true PCM streaming | Each reply sentence is synthesized as soon as it arrives and pushed as raw 24 kHz
mono Int16LE PCM, framed by audio_start / audio_end. |
final ~0.8 s after you stop speaking,
and time-to-first-audio ~1.7 s. Because TTS streams per sentence, playback begins well before the
full reply is generated.Message protocol
The socket carries both JSON text frames (control + events) and binary frames (raw PCM audio). Direction matters: you send mic audio up as binary, the server sends speech down as binary.
Client β server
| Message | Type | Meaning |
|---|---|---|
{"type":"start", β¦} | JSON | Begin a recording turn. Fields: scenario_id (default "daily"),
history (array of {role,content}). Resets the mic buffer and starts STT partials. |
| binary frame | bytes | Microphone audio: raw Int16LE PCM @16 kHz mono. Appended to the buffer while recording. |
{"type":"stop"} | JSON | End the turn. Server runs a final transcription, emits final, then generates
and speaks the reply. |
{"type":"text", β¦} | JSON | Skip STT entirely and reply to typed text. Fields: text (required),
scenario_id, history. Triggers the LLMβTTS reply immediately. |
Server β client
| Message | Type | Meaning |
|---|---|---|
{"type":"partial","text"} | JSON | Live STT β a re-transcription of the audio captured so far (may revise as more arrives). |
{"type":"final","text"} | JSON | The settled transcript of the turn after stop. |
{"type":"delivery", β¦} | JSON | Resolved Higgs delivery directive for the reply: tokens (the control-token prefix)
plus any of emotion, speed, expressive, pitch, sfx. |
{"type":"reply_delta","text"} | JSON | One sentence of the reply (emitted as the LLM streams; immediately precedes its audio). |
{"type":"audio_start","sample_rate":24000} | JSON | The next binary frames are speech PCM at this sample rate (24 kHz mono Int16LE). |
| binary frame | bytes | Synthesized speech: raw Int16LE PCM @24 kHz mono. Concatenate to play. |
{"type":"audio_end"} | JSON | No more audio frames for this reply. |
{"type":"reply_done","text"} | JSON | The reply turn is complete; text is the full spoken reply. |
{"type":"error","detail"} | JSON | An error occurred during the turn. |
Python client
A minimal verifiable client using the websockets
library (pip install websockets). It sends a typed turn (skipping STT), prints every event,
and collects the streamed speech PCM into a file you can play with ffplay.
import asyncio, json, websockets # wss:// for https hosts, ws:// for http URL = "$BASE".replace("https://", "wss://").replace("http://", "ws://") + "/ws/voice" async def main(): async with websockets.connect(URL) as ws: # Typed turn β skips STT, goes straight to LLM β TTS await ws.send(json.dumps({ "type": "text", "scenario_id": "daily", "text": "hello", })) audio = bytearray() async for msg in ws: if not isinstance(msg, str): audio += msg # binary frame = 24kHz mono Int16LE PCM continue evt = json.loads(msg) print(evt["type"], evt.get("text", "")) if evt["type"] == "reply_done": break open("reply.pcm", "wb").write(audio) print(f"got {len(audio)} bytes of audio") # play: ffplay -f s16le -ar 24000 -ac 1 reply.pcm asyncio.run(main())
Original model documentation
- Boson AI β Documentation overview
- Higgs Audio v3 TTS β model overview Β· create-speech API Β· control tokens Β· languages
- HF Β· higgs-audio-v3-tts-4b Β· HF Β· higgs-audio-v3-stt
- ModelScope Β· TTS-4B Β· ModelScope Β· STT
- SGLang-Omni β Higgs TTS cookbook Β· GitHub Β· boson-ai/higgs-audio
- Boson Β· streaming TTS
β the
stream+ PCM contract this service mirrors for streaming speech websockets(Python) β client library used in the realtime voice sample