Higgs Audio v3 — API Reference

Self-hosted, OpenAI-compatible text-to-speech and speech-to-text powered by Boson AI's Higgs Audio v3 models, plus a realtime full-duplex voice WebSocket. The TTS API mirrors the Boson /v1/audio/speech spec; ASR mirrors the OpenAI /v1/audio/transcriptions spec. TTS supports both non-streaming (full file) and streaming (low-latency PCM); the voice socket chains STT → LLM → TTS for live conversation.

Authentication

This is a self-hosted internal deployment — endpoints are open on the service host, no API key required. (Boson's hosted API uses Authorization: Bearer <key>; clients written for it work here by simply omitting/ignoring the header.)

Base URL

…Copy

All endpoints below are relative to this base URL.

Streaming vs non-streaming

The service exposes three endpoints. Two are request/response (the speech and transcription HTTP routes); the third is a full-duplex realtime voice socket. Whether you get audio incrementally or all-at-once depends on the endpoint and flags below.

Endpoint	Mode	What you get
`POST /v1/audio/speech`	non-streaming (default)	One response with the complete audio file in the requested format (mp3/wav/opus/…). Returns once synthesis finishes.
`POST /v1/audio/speech` `stream:true`	streaming	Raw PCM chunks are streamed as they decode (first audio ~0.2–0.5 s). Requires `response_format:"pcm"` — 24 kHz mono Int16LE, headerless.
`POST /v1/audio/transcriptions`	batch only	The whole file is transcribed and returned as one JSON body. No streaming flag — the STT model is not a streaming architecture (see ASR).
`WS /ws/voice`	realtime	Full-duplex voice agent: live STT partials + true PCM TTS streamed back per sentence. See Voice WebSocket.

⚡ Streaming TTS requires PCM. The streamed bytes are headerless 24 kHz mono signed-16-bit little-endian samples. Other response_format values are only valid for the non-streaming response (the encoder needs the full waveform). To play raw PCM: ffplay -f s16le -ar 24000 -ac 1 out.pcm.

Create speech POST

POST /v1/audio/speech

Generate expressive speech from text. The input may embed inline control tokens for emotion, prosody, speed, and sound effects.

Request body

Parameter	Type	Description
`input` required	string	Text to synthesize (1–5000 chars). May contain inline `<\|…\|>` control tags.
`model` optional	string	Model id / alias. Default `higgs-audio-v3-tts` (the served model).
`voice` optional	string	Preset voice name or custom voice id. Default `default`. Mutually exclusive with `ref_audio`.
`response_format` optional	string	One of `mp3` (default), `opus`, `pcm`, `wav`, `aac`, `flac`. Streaming requires `pcm`.
`stream` optional	boolean	Stream raw PCM chunks as they decode. Requires `response_format: "pcm"`. Default `false`.
`ref_audio` optional	string	Zero-shot voice cloning: an http(s) URL, data URI, or base64 audio (≤10 MB). See cloning.
`ref_text` optional	string	Transcript of `ref_audio` (recommended for quality).
`temperature`, `top_k`, `top_p`, `max_new_tokens` optional	number	Sampling controls forwarded to the engine. Recommended for cloning: `temperature 0.8`, `top_k 50`, `max_new_tokens 1024`.

Responses

200 — audio in the requested format (audio/mpeg, audio/wav, audio/ogg, audio/L16 for pcm, …). 400 — invalid/missing input. 502 — upstream engine error.

Non-streaming — default, returns the full file

Omit stream (or set it to false). The server synthesizes the whole utterance and returns it as a single response body in your chosen response_format.

curl

Python

OpenAI SDK

# Emotional, slow, expressive — saved as MP3
curl $BASE/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "<|prosody:speed_slow|><|emotion:affection|>Hello little friend, you did great today!",
    "response_format": "mp3"
  }' --output hello.mp3

import requests
r = requests.post("$BASE" + "/v1/audio/speech", json={
    "input": "<|emotion:elation|><|sfx:laughter|>Haha, wonderful job!",
    "response_format": "wav",
})
open("out.wav", "wb").write(r.content)

# Works with the OpenAI SDK (api_key is ignored here)
from openai import OpenAI
client = OpenAI(base_url="$BASE" + "/v1", api_key="not-needed")
client.audio.speech.create(
    model="higgs-audio-v3-tts",
    voice="default",
    input="<|emotion:contentment|>Good night, sleep tight.",
).stream_to_file("out.mp3")

Streaming — low-latency raw PCM

Set stream:true and response_format:"pcm". The server forwards raw PCM chunks as the engine decodes them — first audio typically arrives in ~0.2–0.5 s, so you can start playback before synthesis completes. The stream is headerless 24 kHz mono Int16LE (audio/L16); concatenate all chunks to get the full waveform.

⚠️ stream:true with any non-pcm response_format will not stream correctly — encoded formats (mp3/opus/…) need the complete waveform. Always pair streaming with "response_format":"pcm", and use curl -N to disable output buffering.

curl

Python

# -N disables curl buffering so PCM bytes land as they arrive
curl -N $BASE/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "<|emotion:enthusiasm|>Streaming straight to your speakers!",
    "stream": true,
    "response_format": "pcm"
  }' --output out.pcm

# Headerless 24kHz mono s16le — play it with ffplay:
ffplay -f s16le -ar 24000 -ac 1 out.pcm

import requests
# stream=True on the request so chunks are yielded as they decode
with requests.post("$BASE" + "/v1/audio/speech", stream=True, json={
    "input": "<|emotion:enthusiasm|>Streaming straight to your speakers!",
    "stream": True,
    "response_format": "pcm",        # required for streaming
}) as r, open("out.pcm", "wb") as f:
    for chunk in r.iter_content(chunk_size=4096):
        if chunk:
            f.write(chunk)        # 24kHz mono Int16LE samples
# play: ffplay -f s16le -ar 24000 -ac 1 out.pcm

Control tokens

Embed tags as <|category:value|>. Sentence-level tags (emotion, style, prosody speed/pitch/expressive) lead the line and color the whole sentence; inline tags (sfx, prosody pause) are placed at the exact spot. Only catalog values are recognized — anything else is read aloud literally.

Emotion — sentence-level, `<|emotion:…|>`

elationamusemententhusiasmdetermination pridecontentmentaffectionrelief contemplationconfusionsurpriseawe longingarousalangerfear disgustbitternesssadnessshamehelplessness

Prosody — `<|prosody:…|>`

speed_very_slowspeed_slowspeed_fastspeed_very_fast pitch_lowpitch_highexpressive_highexpressive_low pauselong_pause

Style — `<|style:…|>` · Sound effects — inline, `<|sfx:…|>`

singingshoutingwhispering laughtercoughcryingscreaming burpinghummingsighsniffsneeze

💡 sfx: put the tag immediately before a matching onomatopoeia, no space — e.g. <|sfx:laughter|>Haha, glad you made it!. For very slow delivery, insert <|prosody:long_pause|> between phrases. Full catalog & samples: Boson · Tags.

Voice cloning

Pass a short reference clip via ref_audio (+ ref_text) for one-shot zero-shot cloning. This service translates them to the engine's references field.

curl $BASE/v1/audio/speech -H "Content-Type: application/json" -d '{
  "input": "Have a wonderful day!",
  "ref_audio": "https://example.com/sample.wav",
  "ref_text": "Hi, this is a sample of my voice.",
  "temperature": 0.8, "top_k": 50, "max_new_tokens": 1024
}' --output cloned.mp3

Languages

Higgs Audio v3 TTS speaks 100+ languages and code-switches within a single utterance — write the input in any mix (e.g. "哇，你真棒！Let's try again, 准备好了吗?") and it is spoken naturally. Full list: Boson · Languages (102).

Transcriptions POST

POST /v1/audio/transcriptions

Transcribe speech to text — OpenAI Whisper-compatible. Built on Higgs Audio v3 STT (Whisper-large-v3 encoder + Qwen3 decoder), with a "thinking" mode for accuracy.

📦 Batch only — no streaming flag. The whole file is uploaded, transcribed, and returned as one JSON body. The Higgs STT stack is a Whisper-large-v3 encoder + Qwen3 decoder, not a frame-streaming architecture, so it cannot emit partial transcripts mid-utterance the way an RNN-T / streaming model can. For live transcription, the voice WebSocket approximates it by repeatedly re-transcribing the growing audio buffer (chunked pseudo-streaming).

Request (multipart/form-data)

Field	Type	Description
`file` required	file	Audio to transcribe (wav/mp3/webm/… — decoded server-side). 16 kHz mono internally.
`model` optional	string	Model id. Default `higgs-stt`.
`response_format` optional	string	`json` (default) → `{"text": "…"}`.

Example

curl

Python

curl $BASE/v1/audio/transcriptions \
  -F "file=@speech.wav" \
  -F "model=higgs-stt"

import requests
r = requests.post("$BASE" + "/v1/audio/transcriptions",
    files={"file": open("speech.wav", "rb")},
    data={"model": "higgs-stt"})
print(r.json()["text"])

🈶 Chinese speech is transcribed as pinyin (the ASR is English-centric) — still useful as input to an LLM. English transcribes verbatim.

Realtime voice WS

WS /ws/voice

A full-duplex realtime voice agent over a single WebSocket. You stream microphone PCM up; the server returns live transcript partials, a streamed LLM reply, and synthesized speech PCM back down — all on the same socket. It chains the three Higgs/LLM stages into one low-latency loop.

Pipeline — what is truly streaming

Stage	Streaming?	How
STT (speech → text)	chunked pseudo-streaming	The growing mic buffer is re-transcribed every ~0.7 s → emitted as `partial` events; a clean `final` is produced on `stop`. Whisper-class models can't do true frame-by-frame streaming, so this approximates it by re-running on the accumulated audio.
LLM (DeepSeek V4 Flash)	true SSE token streaming	The reply is consumed token-by-token over server-sent events. Reasoning is disabled (`thinking:{"type":"disabled"}`) for low latency, and the text is emitted as per-sentence `reply_delta` events.
TTS (text → speech)	true PCM streaming	Each reply sentence is synthesized as soon as it arrives and pushed as raw 24 kHz mono Int16LE PCM, framed by `audio_start` / `audio_end`.

⏱ Measured on the live box: STT final ~0.8 s after you stop speaking, and time-to-first-audio ~1.7 s. Because TTS streams per sentence, playback begins well before the full reply is generated.

Message protocol

The socket carries both JSON text frames (control + events) and binary frames (raw PCM audio). Direction matters: you send mic audio up as binary, the server sends speech down as binary.

Client → server

Message	Type	Meaning
`{"type":"start", …}`	JSON	Begin a recording turn. Fields: `scenario_id` (default `"daily"`), `history` (array of `{role,content}`). Resets the mic buffer and starts STT partials.
binary frame	bytes	Microphone audio: raw Int16LE PCM @16 kHz mono. Appended to the buffer while recording.
`{"type":"stop"}`	JSON	End the turn. Server runs a final transcription, emits `final`, then generates and speaks the reply.
`{"type":"text", …}`	JSON	Skip STT entirely and reply to typed text. Fields: `text` (required), `scenario_id`, `history`. Triggers the LLM→TTS reply immediately.

Server → client

Message	Type	Meaning
`{"type":"partial","text"}`	JSON	Live STT — a re-transcription of the audio captured so far (may revise as more arrives).
`{"type":"final","text"}`	JSON	The settled transcript of the turn after `stop`.
`{"type":"delivery", …}`	JSON	Resolved Higgs delivery directive for the reply: `tokens` (the control-token prefix) plus any of `emotion`, `speed`, `expressive`, `pitch`, `sfx`.
`{"type":"reply_delta","text"}`	JSON	One sentence of the reply (emitted as the LLM streams; immediately precedes its audio).
`{"type":"audio_start","sample_rate":24000}`	JSON	The next binary frames are speech PCM at this sample rate (24 kHz mono Int16LE).
binary frame	bytes	Synthesized speech: raw Int16LE PCM @24 kHz mono. Concatenate to play.
`{"type":"audio_end"}`	JSON	No more audio frames for this reply.
`{"type":"reply_done","text"}`	JSON	The reply turn is complete; `text` is the full spoken reply.
`{"type":"error","detail"}`	JSON	An error occurred during the turn.

Python client

A minimal verifiable client using the websockets library (pip install websockets). It sends a typed turn (skipping STT), prints every event, and collects the streamed speech PCM into a file you can play with ffplay.

Python

import asyncio, json, websockets

# wss:// for https hosts, ws:// for http
URL = "$BASE".replace("https://", "wss://").replace("http://", "ws://") + "/ws/voice"

async def main():
    async with websockets.connect(URL) as ws:
        # Typed turn — skips STT, goes straight to LLM → TTS
        await ws.send(json.dumps({
            "type": "text",
            "scenario_id": "daily",
            "text": "hello",
        }))

        audio = bytearray()
        async for msg in ws:
            if not isinstance(msg, str):
                audio += msg                 # binary frame = 24kHz mono Int16LE PCM
                continue
            evt = json.loads(msg)
            print(evt["type"], evt.get("text", ""))
            if evt["type"] == "reply_done":
                break

        open("reply.pcm", "wb").write(audio)
        print(f"got {len(audio)} bytes of audio")
        # play: ffplay -f s16le -ar 24000 -ac 1 reply.pcm

asyncio.run(main())

Original model documentation

Boson AI — Documentation overview
Higgs Audio v3 TTS — model overview · create-speech API · control tokens · languages
HF · higgs-audio-v3-tts-4b · HF · higgs-audio-v3-stt
ModelScope · TTS-4B · ModelScope · STT
SGLang-Omni — Higgs TTS cookbook · GitHub · boson-ai/higgs-audio
Boson · streaming TTS — the stream + PCM contract this service mirrors for streaming speech
websockets (Python) — client library used in the realtime voice sample

Higgs Audio v3 — API Reference

Authentication

Base URL

Streaming vs non-streaming

Create speech POST

Request body

Responses

Non-streaming — default, returns the full file

Streaming — low-latency raw PCM

Control tokens

Emotion — sentence-level, <|emotion:…|>

Prosody — <|prosody:…|>

Style — <|style:…|> · Sound effects — inline, <|sfx:…|>

Voice cloning

Languages

Transcriptions POST

Request (multipart/form-data)

Example

Realtime voice WS

Pipeline — what is truly streaming

Message protocol

Client → server

Server → client

Python client

Original model documentation

Emotion — sentence-level, `<|emotion:…|>`

Prosody — `<|prosody:…|>`

Style — `<|style:…|>` · Sound effects — inline, `<|sfx:…|>`