Higgs Audio v3 β€” API Reference

Self-hosted, OpenAI-compatible text-to-speech and speech-to-text powered by Boson AI's Higgs Audio v3 models, plus a realtime full-duplex voice WebSocket. The TTS API mirrors the Boson /v1/audio/speech spec; ASR mirrors the OpenAI /v1/audio/transcriptions spec. TTS supports both non-streaming (full file) and streaming (low-latency PCM); the voice socket chains STT β†’ LLM β†’ TTS for live conversation.

Authentication

This is a self-hosted internal deployment β€” endpoints are open on the service host, no API key required. (Boson's hosted API uses Authorization: Bearer <key>; clients written for it work here by simply omitting/ignoring the header.)

Base URL

…Copy

All endpoints below are relative to this base URL.

Streaming vs non-streaming

The service exposes three endpoints. Two are request/response (the speech and transcription HTTP routes); the third is a full-duplex realtime voice socket. Whether you get audio incrementally or all-at-once depends on the endpoint and flags below.

EndpointModeWhat you get
POST /v1/audio/speechnon-streaming (default) One response with the complete audio file in the requested format (mp3/wav/opus/…). Returns once synthesis finishes.
POST /v1/audio/speech
stream:true
streaming Raw PCM chunks are streamed as they decode (first audio ~0.2–0.5 s). Requires response_format:"pcm" β€” 24 kHz mono Int16LE, headerless.
POST /v1/audio/transcriptionsbatch only The whole file is transcribed and returned as one JSON body. No streaming flag β€” the STT model is not a streaming architecture (see ASR).
WS /ws/voicerealtime Full-duplex voice agent: live STT partials + true PCM TTS streamed back per sentence. See Voice WebSocket.
⚑ Streaming TTS requires PCM. The streamed bytes are headerless 24 kHz mono signed-16-bit little-endian samples. Other response_format values are only valid for the non-streaming response (the encoder needs the full waveform). To play raw PCM: ffplay -f s16le -ar 24000 -ac 1 out.pcm.

Create speech POST

POST /v1/audio/speech

Generate expressive speech from text. The input may embed inline control tokens for emotion, prosody, speed, and sound effects.

Request body

ParameterTypeDescription
input requiredstring Text to synthesize (1–5000 chars). May contain inline <|…|> control tags.
model optionalstring Model id / alias. Default higgs-audio-v3-tts (the served model).
voice optionalstring Preset voice name or custom voice id. Default default. Mutually exclusive with ref_audio.
response_format optionalstring One of mp3 (default), opus, pcm, wav, aac, flac. Streaming requires pcm.
stream optionalboolean Stream raw PCM chunks as they decode. Requires response_format: "pcm". Default false.
ref_audio optionalstring Zero-shot voice cloning: an http(s) URL, data URI, or base64 audio (≀10 MB). See cloning.
ref_text optionalstring Transcript of ref_audio (recommended for quality).
temperature, top_k, top_p, max_new_tokens optionalnumber Sampling controls forwarded to the engine. Recommended for cloning: temperature 0.8, top_k 50, max_new_tokens 1024.

Responses

200 β€” audio in the requested format (audio/mpeg, audio/wav, audio/ogg, audio/L16 for pcm, …).   400 β€” invalid/missing input.   502 β€” upstream engine error.

Non-streaming β€” default, returns the full file

Omit stream (or set it to false). The server synthesizes the whole utterance and returns it as a single response body in your chosen response_format.

curl
Python
OpenAI SDK
# Emotional, slow, expressive β€” saved as MP3
curl $BASE/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "<|prosody:speed_slow|><|emotion:affection|>Hello little friend, you did great today!",
    "response_format": "mp3"
  }' --output hello.mp3

Streaming β€” low-latency raw PCM

Set stream:true and response_format:"pcm". The server forwards raw PCM chunks as the engine decodes them β€” first audio typically arrives in ~0.2–0.5 s, so you can start playback before synthesis completes. The stream is headerless 24 kHz mono Int16LE (audio/L16); concatenate all chunks to get the full waveform.

⚠️ stream:true with any non-pcm response_format will not stream correctly β€” encoded formats (mp3/opus/…) need the complete waveform. Always pair streaming with "response_format":"pcm", and use curl -N to disable output buffering.
curl
Python
# -N disables curl buffering so PCM bytes land as they arrive
curl -N $BASE/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "<|emotion:enthusiasm|>Streaming straight to your speakers!",
    "stream": true,
    "response_format": "pcm"
  }' --output out.pcm

# Headerless 24kHz mono s16le β€” play it with ffplay:
ffplay -f s16le -ar 24000 -ac 1 out.pcm

Control tokens

Embed tags as <|category:value|>. Sentence-level tags (emotion, style, prosody speed/pitch/expressive) lead the line and color the whole sentence; inline tags (sfx, prosody pause) are placed at the exact spot. Only catalog values are recognized β€” anything else is read aloud literally.

Emotion β€” sentence-level, <|emotion:…|>

elationamusemententhusiasmdetermination pridecontentmentaffectionrelief contemplationconfusionsurpriseawe longingarousalangerfear disgustbitternesssadnessshamehelplessness

Prosody β€” <|prosody:…|>

speed_very_slowspeed_slowspeed_fastspeed_very_fast pitch_lowpitch_highexpressive_highexpressive_low pauselong_pause

Style β€” <|style:…|>   Β·   Sound effects β€” inline, <|sfx:…|>

singingshoutingwhispering laughtercoughcryingscreaming burpinghummingsighsniffsneeze
πŸ’‘ sfx: put the tag immediately before a matching onomatopoeia, no space β€” e.g. <|sfx:laughter|>Haha, glad you made it!. For very slow delivery, insert <|prosody:long_pause|> between phrases. Full catalog & samples: Boson Β· Tags.

Voice cloning

Pass a short reference clip via ref_audio (+ ref_text) for one-shot zero-shot cloning. This service translates them to the engine's references field.

curl $BASE/v1/audio/speech -H "Content-Type: application/json" -d '{
  "input": "Have a wonderful day!",
  "ref_audio": "https://example.com/sample.wav",
  "ref_text": "Hi, this is a sample of my voice.",
  "temperature": 0.8, "top_k": 50, "max_new_tokens": 1024
}' --output cloned.mp3

Languages

Higgs Audio v3 TTS speaks 100+ languages and code-switches within a single utterance β€” write the input in any mix (e.g. "ε“‡οΌŒδ½ ηœŸζ£’οΌLet's try again, 准倇ε₯½δΊ†ε—?") and it is spoken naturally. Full list: Boson Β· Languages (102).

Transcriptions POST

POST /v1/audio/transcriptions

Transcribe speech to text β€” OpenAI Whisper-compatible. Built on Higgs Audio v3 STT (Whisper-large-v3 encoder + Qwen3 decoder), with a "thinking" mode for accuracy.

πŸ“¦ Batch only β€” no streaming flag. The whole file is uploaded, transcribed, and returned as one JSON body. The Higgs STT stack is a Whisper-large-v3 encoder + Qwen3 decoder, not a frame-streaming architecture, so it cannot emit partial transcripts mid-utterance the way an RNN-T / streaming model can. For live transcription, the voice WebSocket approximates it by repeatedly re-transcribing the growing audio buffer (chunked pseudo-streaming).

Request (multipart/form-data)

FieldTypeDescription
file requiredfile Audio to transcribe (wav/mp3/webm/… β€” decoded server-side). 16 kHz mono internally.
model optionalstringModel id. Default higgs-stt.
response_format optionalstringjson (default) β†’ {"text": "…"}.

Example

curl
Python
curl $BASE/v1/audio/transcriptions \
  -F "file=@speech.wav" \
  -F "model=higgs-stt"
🈢 Chinese speech is transcribed as pinyin (the ASR is English-centric) β€” still useful as input to an LLM. English transcribes verbatim.

Realtime voice WS

WS /ws/voice

A full-duplex realtime voice agent over a single WebSocket. You stream microphone PCM up; the server returns live transcript partials, a streamed LLM reply, and synthesized speech PCM back down β€” all on the same socket. It chains the three Higgs/LLM stages into one low-latency loop.

Pipeline β€” what is truly streaming

StageStreaming?How
STT (speech β†’ text)chunked pseudo-streaming The growing mic buffer is re-transcribed every ~0.7 s β†’ emitted as partial events; a clean final is produced on stop. Whisper-class models can't do true frame-by-frame streaming, so this approximates it by re-running on the accumulated audio.
LLM (DeepSeek V4 Flash)true SSE token streaming The reply is consumed token-by-token over server-sent events. Reasoning is disabled (thinking:{"type":"disabled"}) for low latency, and the text is emitted as per-sentence reply_delta events.
TTS (text β†’ speech)true PCM streaming Each reply sentence is synthesized as soon as it arrives and pushed as raw 24 kHz mono Int16LE PCM, framed by audio_start / audio_end.
⏱ Measured on the live box: STT final ~0.8 s after you stop speaking, and time-to-first-audio ~1.7 s. Because TTS streams per sentence, playback begins well before the full reply is generated.

Message protocol

The socket carries both JSON text frames (control + events) and binary frames (raw PCM audio). Direction matters: you send mic audio up as binary, the server sends speech down as binary.

Client β†’ server

MessageTypeMeaning
{"type":"start", …}JSON Begin a recording turn. Fields: scenario_id (default "daily"), history (array of {role,content}). Resets the mic buffer and starts STT partials.
binary framebytes Microphone audio: raw Int16LE PCM @16 kHz mono. Appended to the buffer while recording.
{"type":"stop"}JSON End the turn. Server runs a final transcription, emits final, then generates and speaks the reply.
{"type":"text", …}JSON Skip STT entirely and reply to typed text. Fields: text (required), scenario_id, history. Triggers the LLMβ†’TTS reply immediately.

Server β†’ client

MessageTypeMeaning
{"type":"partial","text"}JSON Live STT β€” a re-transcription of the audio captured so far (may revise as more arrives).
{"type":"final","text"}JSON The settled transcript of the turn after stop.
{"type":"delivery", …}JSON Resolved Higgs delivery directive for the reply: tokens (the control-token prefix) plus any of emotion, speed, expressive, pitch, sfx.
{"type":"reply_delta","text"}JSON One sentence of the reply (emitted as the LLM streams; immediately precedes its audio).
{"type":"audio_start","sample_rate":24000}JSON The next binary frames are speech PCM at this sample rate (24 kHz mono Int16LE).
binary framebytes Synthesized speech: raw Int16LE PCM @24 kHz mono. Concatenate to play.
{"type":"audio_end"}JSON No more audio frames for this reply.
{"type":"reply_done","text"}JSON The reply turn is complete; text is the full spoken reply.
{"type":"error","detail"}JSON An error occurred during the turn.

Python client

A minimal verifiable client using the websockets library (pip install websockets). It sends a typed turn (skipping STT), prints every event, and collects the streamed speech PCM into a file you can play with ffplay.

Python
import asyncio, json, websockets

# wss:// for https hosts, ws:// for http
URL = "$BASE".replace("https://", "wss://").replace("http://", "ws://") + "/ws/voice"

async def main():
    async with websockets.connect(URL) as ws:
        # Typed turn β€” skips STT, goes straight to LLM β†’ TTS
        await ws.send(json.dumps({
            "type": "text",
            "scenario_id": "daily",
            "text": "hello",
        }))

        audio = bytearray()
        async for msg in ws:
            if not isinstance(msg, str):
                audio += msg                 # binary frame = 24kHz mono Int16LE PCM
                continue
            evt = json.loads(msg)
            print(evt["type"], evt.get("text", ""))
            if evt["type"] == "reply_done":
                break

        open("reply.pcm", "wb").write(audio)
        print(f"got {len(audio)} bytes of audio")
        # play: ffplay -f s16le -ar 24000 -ac 1 reply.pcm

asyncio.run(main())

Original model documentation