API reference
All endpoints are mounted at /v1/modules/scaispeak/ and authenticate with the standard ScaiGrid bearer token. Responses use ScaiGrid's standard envelope ({ "data": ... } for success, { "error": ... } for failures).
Health#
GET /healthz#
Liveness — process is responding. Cheap; no I/O.
GET /readyz#
Readiness — module can serve requests. Returns 200 when the module's upstream dependencies (managed TTS relay, ScaiInfer, Redis) are reachable enough to dispatch.
Voices — read#
GET /voices#
List voices visible to the caller (global + own tenant + own user). Query parameters:
| Parameter | Notes |
|---|---|
language |
2-letter ISO code (en, fr, de...). |
scope |
global, tenant, user. |
gender |
female, male, neutral, unspecified. |
embedding_status |
pending, processing, ready, failed, evicted. |
q |
Free-text search over display_name, description, style_tags. |
limit |
1-200, default 50. |
Permission: scaispeak:voice.read.
GET /voices/{voice_id}#
Fetch one voice's full record. Returns 404 if the voice doesn't exist OR isn't visible to the caller (existence isn't disclosed across scopes).
Voices — write#
POST /voices#
Create a voice. Two modes — pick one per request:
Cloned voice (from reference + consent). Multipart form fields:
| Field | Required | Notes |
|---|---|---|
reference |
one of | Multipart file part with the reference audio. |
reference_scaidrive_json |
one of | JSON {file_id, mcp_uri, share_url} pointing at a ScaiDrive file. |
consent |
one of | Multipart file part with the consent audio. |
consent_scaidrive_json |
one of | ScaiDrive reference for the consent recording. |
consent_user_full_name |
yes | Speaker's full name; written to the consent row. |
consent_stated_purpose |
yes | What the cloned voice will be used for; verbatim audit. |
consent_text |
yes | The exact scripted statement the speaker reads in the consent clip. |
Designed voice (text-only, no audio):
| Field | Required | Notes |
|---|---|---|
voice_design_prompt |
yes | Natural-language description of the speaker (~12+ chars). When set, reference/consent fields are forbidden — SCAISPEAK_AMBIGUOUS_INTAKE_MODE if both are supplied. |
Common fields (both modes):
| Field | Required | Notes |
|---|---|---|
display_name |
yes | Human-readable label. |
language_primary |
yes | 2-letter ISO code. |
language_supported_json |
no | JSON array of 2-letter codes the voice can speak. |
gender_hint, age_hint, style_tags_json |
no | Library metadata; advisory. |
description |
no | Free-text description (separate from voice_design_prompt). |
Returns 201 Created with the new voice. Cloned voices include a preflight block; designed voices land at embedding_status: ready immediately.
Errors: SCAISPEAK_VOICE_PREFLIGHT_FAILED (audio rejected), SCAISPEAK_AMBIGUOUS_SOURCE (inline + ScaiDrive for the same file), SCAISPEAK_AMBIGUOUS_INTAKE_MODE (cloned + designed in the same request), SCAISPEAK_CONSENT_REQUIRED (cloned mode missing consent text fields), SCAISPEAK_CONSENT_INVALID (consent audio missing or doesn't match the script).
Permission: scaispeak:voice.write.
PATCH /voices/{voice_id}#
Partial update. Settable fields:
display_name,description,language_supported,gender_hint,age_hint,style_tags— library metadata.voxcpm2_reference_transcript— verbatim transcript of the reference clip. When set, upgrades the voice to higher-quality cloning. Send""to clear.voice_design_prompt— natural-language voice description (for designed voices, or to switch a cloned voice into design-only mode). Send""to clear.clear_reference— boolean. When true (alongside a non-emptyvoice_design_prompt), the voice's reference clip + consent record are tombstoned and the voice becomes design-only.SCAISPEAK_DESIGN_PROMPT_REQUIRED_ON_CLEARif the prompt is missing.pronunciation_overrides— list of whole-word substitution rules applied before language-specific text expansion. See Text normalisation and pronunciation overrides below. Send[]to clear all rules; omit to leave unchanged.
Scope mutation is not allowed here — use /share. Permission: scaispeak:voice.write.
POST /voices/{voice_id}/reference#
Replace a cloned voice's reference clip with a fresh recording, capturing new consent. Multipart form fields: reference, consent, consent_user_full_name, consent_stated_purpose, consent_text (same shape as the cloned POST /voices body).
Old reference + consent blobs are tombstoned by the backend. New consent recording is required because the audio is changing (we can't verify it's still the same person without re-capturing).
Permission: scaispeak:voice.write + ownership (or SuperAdmin). 200 OK with the updated voice on success.
DELETE /voices/{voice_id}#
Erase the voice (GDPR Art. 17). Tombstones the row, fans out EvictVoice to every warm replica, clears the Redis registry, deletes reference + consent blobs, writes an immutable erasure_audit row.
1 2 3 4 5 6 7 8 9 10 | |
Permission: scaispeak:voice.write.
POST /voices/{voice_id}/share#
Promote a user-scope voice to tenant scope. Permission: scaispeak:voice.share (separate from voice.write so sharing can be granted independently).
POST /voices/{voice_id}/preview#
Render a short preview clip (max 300 chars). Form fields: text, response_format. Uses the same dispatcher as /speak. Permission: scaispeak:voice.read.
POST /voices/{voice_id}/repromote#
Re-run intake processing for a voice. Idempotent — no-op if ready, no-op if already processing. Used to bring legacy voices (created under the previous-generation cloning engine) onto the current zero-shot path. Returns 202 Accepted. Permission: scaispeak:voice.write.
WS /voices/record#
Live-record voice intake — WebSocket alternative to POST /voices. Two-phase: first reference audio frames + phase_complete, then consent audio frames + finalize. Auth via ?token= query or Authorization header. Permission: scaispeak:voice.write.
Speak#
POST /speak#
Batch synthesis. Body:
| Field | Required | Notes |
|---|---|---|
voice_id |
yes | A voice the caller can see. |
text |
yes | Up to ~500 chars sync, longer async. |
language_hint |
no | 2-letter code to disambiguate multilingual voices. |
speed |
no | 0.5–2.0, default 1.0. |
response_format |
no | mp3, opus, wav, flac, aac, pcm. Default mp3. Self-hosted backend currently emits 48 kHz WAV regardless of this field and logs a downgrade warning if the requested format differs — see Troubleshooting. |
backend_preference |
no | prefer_self_hosted, prefer_relay, any. Advisory; tenant policy wins. |
idempotency_key |
no | Caller-supplied retry key for the output cache. |
force_async |
no | Force the job path regardless of text length. |
save_to |
no | ScaiDrive destination block (see below). JWT auth required. |
inline_response |
no | When save_to is set, return audio bytes too (default true). |
instructions |
no | Free-text style guidance (emotion / pace / affect). Example: "cheerful and energetic" or "slowly and carefully". Meaningful for cloned voices; preset speakers and the relay backend ignore this field. |
cfg_value |
no | Cloning-fidelity vs naturalness tradeoff. Range 0.5–5.0. Higher values stay closer to the reference voice at the cost of naturalness. Engine default ~2.0 when omitted. Meaningful for cloned voices only. |
warmup_trim_ms |
no | Strip the first N ms of generated audio to absorb the warm-up artefact at the start of cloned-voice output. Typical: 150. Use 0 to disable. Meaningful for cloned voices only. |
normalize_text |
no | Run the text-prep pipeline (strip emoji / markdown / URLs, apply tenant + voice pronunciation overrides, expand dates / times / numbers / currency for the voice's language). true / false overrides per request; omit to use the tenant default set via PUT /admin/policy. Supported expander languages: en, nl, de, fr (others pass through the strip + overrides stages only). |
Short text (default ≤500 chars) returns 200 OK with audio_base64 inline. Longer text returns 202 Accepted with job_id — poll /speak/jobs/{job_id}.
save_to block:
1 2 3 4 5 6 | |
Permission: scaispeak:synthesize.
GET /speak/jobs/{job_id}#
Poll an async synth job. Returns status (queued, running, completed, failed), and when complete, audio_base64 inline (for small outputs) or audio_bytes + S3 URI for larger ones. If the job was submitted with save_to, the response also carries save_to.file_id once the upload finishes. Permission: scaispeak:synthesize, scoped to (user, tenant) — you can't poll another user's job by ID guess.
Streaming — WebSocket#
WS /stream/speak#
Real-time TTS over WebSocket. Wire protocol:
| Client → Server | Fields |
|---|---|
{"type":"open"} |
voice_id, language_hint, speed, output.codec, backend_preference |
{"type":"text"} |
delta |
{"type":"flush"} |
— |
{"type":"interrupt"} |
— |
{"type":"close"} |
— |
| Server → Client | Fields |
|---|---|
{"type":"ready"} |
voice_id, backend_used |
| binary frame | audio bytes in the negotiated codec |
{"type":"interrupted"} |
— |
{"type":"closed"} |
stats.chars, stats.backend_used |
{"type":"error"} |
code, message |
Close codes: 4401 unauthorized, 4403 forbidden, 4400 bad request, 4502 backend unavailable, 4500 server error. Auth via ?token= or header. Permission: scaispeak:synthesize.
Streaming — WebRTC#
Status: signalling and lifecycle ship end-to-end. The audio plane (aiortc
MediaStreamTrack.recv) raisesNotImplementedErrortoday — once a peer connection negotiates, no audio drains to the backend. Use the WebSocket streaming endpoints for production until this caveat is removed.
POST /stream/speak/webrtc/sessions#
Create a WebRTC session. Body:
| Field | Notes |
|---|---|
voice_id |
required |
language_hint |
optional 2-letter code |
speed |
0.5–2.0 |
output.codec |
opus or pcm |
output.sample_rate |
8000–48000 |
control.transport |
websocket or datachannel |
ice_servers |
optional tenant-supplied ICE config |
backend_preference |
same vocabulary as /speak |
Returns session_id, ice_servers, expires_at, control_ws_url. Permission: scaispeak:synthesize.
POST /stream/speak/webrtc/sessions/{session_id}/offer#
Apply client SDP offer, return server's SDP answer.
POST /stream/speak/webrtc/sessions/{session_id}/ice-candidates#
Trickle ICE candidate from client. Returns 204 No Content.
DELETE /stream/speak/webrtc/sessions/{session_id}#
Tear down the peer + mark session closed.
WS /stream/speak/webrtc/sessions/{session_id}/control#
Control plane for an active WebRTC session — same text/flush/interrupt/close vocabulary as the WebSocket streaming path, no binary audio frames (audio rides RTP).
Voice warming#
GET /voices/{voice_id}/warm#
Inspect current warm state. Returns warm_node_ids, candidate_node_ids, stale_node_ids. Permission: scaispeak:voice.read.
POST /voices/{voice_id}/warm#
Fan-out PrepareVoice to candidate replicas. Body: { "node_ids": [...] } (empty means "all candidates"). Returns outcomes array with per-node ok, cache_key, load_ms, error. Permission: scaispeak:voice.write.
POST /voices/{voice_id}/evict#
Drop the voice from every currently-warm replica. Always clears the registry. Permission: scaispeak:voice.write.
Tenant policy#
GET /admin/policy#
Read the caller's tenant policy. Returns:
allowed_backends— subset of["A","B"].default_backend—AorB.tokeniser_backend—legacyorscaiinfer.text_normalization_default— tenant default for the per-requestnormalize_textflag onPOST /speak.pronunciation_overrides— tenant-wide pronunciation rules (ornullwhen none are set). Same shape as the per-voice list.
Permission: scaispeak:synthesize — readable by any caller who can synthesise so UIs can show "your tenant routes through Backend B".
PUT /admin/policy#
Update the tenant policy. Body (all fields optional; omitted = leave unchanged):
allowed_backends— string shorthand"A"/"B"/"AB"or list.default_backend—AorB. Must be inallowed_backends.tokeniser_backend—legacyorscaiinfer.text_normalization_default— boolean.pronunciation_overrides— list of rules. Send[]to clear all rules; non-empty list replaces the whole set; omit to leave unchanged.
Validation rejects default_backend not in allowed_backends. Permission: scaispeak:admin.
Text normalisation and pronunciation overrides#
ScaiSpeak ships an optional text-preprocessing pipeline that runs before dispatch when normalize_text is true (per request or via tenant default). Three stages, in order:
- Strip noise — emoji, zero-width characters, markdown emphasis (
**bold**,_italic_,~~strike~~, backticks), bare URLs (rewritten to "link"), bullet glyphs at line start. Language-agnostic. - Pronunciation overrides — whole-word substitution rules from the tenant policy, then per-voice rules layered on top. Tenant rules run first; voice rules can refine or override for one specific voice.
- Expand — date / time / number / currency rendering for the voice's primary language. Supported: en, nl, de, fr. Unsupported languages skip this stage but still benefit from strip + overrides.
Pronunciation rule shape:
1 2 3 4 5 | |
pattern— required, matched as a whole word (Unicode word boundaries;k8swon't match insidek8short).replacement— required, written into the text verbatim. Multiple words allowed.case_sensitive— optional, defaults totrue. Setfalsefor acronyms / brand names where casing varies in caller input.
Bad rules (missing fields, empty patterns) are skipped silently — one operator typo doesn't break a tenant's synth pipeline.
Examples of what the expansion stage produces, voice with language_primary="en":
2026-05-23→ "the twenty third of May, two thousand twenty six"17:30→ "seventeen thirty"5:30 PM→ "five thirty PM"$42.50→ "forty-two dollars and fifty cents"£1,234.56→ "one thousand, two hundred and thirty-four pounds and fifty-six pence"
Same input on language_primary="nl":
2026-05-23→ "drieëntwintig mei tweeduizendzesentwintig"17:30→ "zeventien uur dertig"€42,50→ "tweeënveertig euro en vijftig cent"
Locale convention: DD/MM/YYYY is the assumed convention (matches nl, de, fr, UK English; US callers should use ISO YYYY-MM-DD to disambiguate). Decimal/thousands separators follow the voice's language — $42.50 for en, €42,50 for nl/de/fr.
The pipeline runs after voice resolution + blocklist + the output-cache check, before dispatch — so cache keys stay on the raw caller-provided text and the synthesised audio reflects the normalised text.
ScaiDrive proxy#
GET /admin/scaidrive/shares#
Read-only forwarding to ScaiDrive — list shares the caller can see. Used by the synth page destination picker. Requires JWT auth (not sgk_). Returns 404 with SCAISPEAK_SCAIDRIVE_NOT_AVAILABLE when ScaiDrive isn't configured in the deployment.
GET /admin/scaidrive/shares/{share_id}/folders#
Lazy-browse folders inside a share. Query: folder_id (omit for the share root). Returns folder children only.
Admin lifecycle#
POST /admin/lifecycle/install#
First-time install hook called by the module-host. Idempotent. SuperAdmin-only.
POST /admin/lifecycle/upgrade#
Version upgrade hook. Idempotent. SuperAdmin-only.
POST /admin/lifecycle/uninstall#
Module uninstall — soft-deletes every non-global voice in the deployment, signals the erasure worker to fan out. Requires confirmation_token + expected_module_id. SuperAdmin-only.
POST /admin/lifecycle/tenant/{tenant_id}/enable#
Per-tenant enable. SuperAdmin-only.
POST /admin/lifecycle/tenant/{tenant_id}/disable#
Per-tenant disable — soft-deletes all the tenant's user + tenant scope voices and signals erasure. Global voices untouched. SuperAdmin-only.
Blocklist + audit#
POST /admin/blocklist#
Add a blocklist entry. Body: scope (tenant, user, voice), target_id, reason, optional expires_at. Permission: scaispeak:admin.
GET /admin/blocklist#
List active blocklist entries. Query: scope, tenant_id, limit. Permission: scaispeak:admin.
DELETE /admin/blocklist/{block_id}#
Remove a blocklist entry (manual unblock). Returns 204 No Content. Permission: scaispeak:admin.
GET /admin/erasure/audit#
List erasure audit rows. Query: tenant_id, voice_id, limit. Returns most-recent-first. Permission: scaispeak:admin.
Global voices (SuperAdmin)#
POST /admin/voices/global#
Create a platform-scope (scope='global') voice — no consent, license-based. SuperAdmin-only. Form fields:
| Field | Required | Notes |
|---|---|---|
reference |
yes | Multipart reference audio. ScaiDrive references not accepted for globals. |
display_name, language_primary |
yes | Same shape as user voices. |
licensor_name |
yes | Who licensed the voice to ScaiLabs. |
license_type |
yes | perpetual, time_bound, usage_bound. |
valid_until |
when time_bound |
ISO-8601 timestamp. |
usage_limit_chars |
when usage_bound |
Integer cap. |
licensor_reference |
no | Contract reference. |
valid_from |
no | ISO-8601 start. |
terms_summary |
no | Operator-facing summary of the terms. |
license_document |
no | Optional PDF; stored alongside the voice. |
Returns the new voice_id, license_id, and intake note.
DELETE /admin/voices/global/{voice_id}#
Revoke a global voice. SuperAdmin-only. Form field trigger (license_revoked, license_expired, platform_decision). Bypasses the owner-equality check that protects user/tenant voices. Updates the license row's status to match the trigger. Runs the full erasure pipeline.
Errors#
All endpoints return ScaiGrid's standard error envelope:
1 2 3 4 5 6 7 8 | |
ScaiSpeak-specific codes:
| Code | Meaning |
|---|---|
SCAISPEAK_VOICE_NOT_FOUND |
Voice id doesn't exist or isn't visible. |
SCAISPEAK_VOICE_ACCESS_DENIED |
Caller can't perform this operation on this voice. |
SCAISPEAK_VOICE_PREFLIGHT_FAILED |
Reference audio failed quality checks. Body includes preflight. |
SCAISPEAK_CONSENT_INVALID |
Consent recording missing or doesn't match the scripted text. |
SCAISPEAK_AMBIGUOUS_SOURCE |
Both inline upload and ScaiDrive reference supplied for the same file. |
SCAISPEAK_VOICE_SHARE_FORBIDDEN |
Only the owner with voice.share can promote to tenant scope. |
SCAISPEAK_BACKEND_UNAVAILABLE |
No allowed backend currently available. |
SCAISPEAK_TENANT_POLICY_INVALID |
Policy update rejected (e.g. default not in allowed set). |
SCAISPEAK_JOB_NOT_FOUND |
Job id doesn't exist or doesn't belong to this caller. |
SCAISPEAK_VOICE_NOT_READY_FOR_WARMING |
Legacy warming path returns this when the voice doesn't have the cached state the previous-gen engine needed. No-op on the current zero-shot engine; safe to ignore for new code. |
SCAISPEAK_SAVE_TO_REQUIRES_JWT |
save_to attempted with sgk_ API key auth. |
SCAISPEAK_SAVE_TO_EXCHANGE_FAILED |
ScaiKey token exchange against ScaiDrive failed. |
SCAISPEAK_SCAIDRIVE_NOT_AVAILABLE |
ScaiDrive integration not configured. |
SCAISPEAK_SCAIDRIVE_FORBIDDEN |
Caller lacks write access on the destination share. |
SCAISPEAK_SCAIDRIVE_NOT_FOUND |
Destination share or folder doesn't exist. |
SCAISPEAK_SCAIDRIVE_CONFLICT |
File exists at destination and overwrite is false. |
SCAISPEAK_SCAIDRIVE_QUOTA_EXCEEDED |
Destination share over quota (HTTP 507). |
SCAISPEAK_LICENSE_FIELD_INVALID |
License-type / bound mismatch on global voice create. |
SCAISPEAK_GLOBAL_VOICE_NOT_FOUND |
Global voice doesn't exist or already deleted. |
SCAISPEAK_BLOCKLIST_NOT_FOUND |
Blocklist entry id doesn't exist. |
SCAISPEAK_UNINSTALL_TOKEN_MISMATCH |
Uninstall hook called without a matching token. |