API reference

All endpoints are mounted at /v1/modules/scaispeak/ and authenticate with the standard ScaiGrid bearer token. Responses use ScaiGrid's standard envelope ({ "data": ... } for success, { "error": ... } for failures).

Health#

`GET /healthz`#

Liveness — process is responding. Cheap; no I/O.

`GET /readyz`#

Readiness — module can serve requests. Returns 200 when the module's upstream dependencies (managed TTS relay, ScaiInfer, Redis) are reachable enough to dispatch.

Voices — read#

`GET /voices`#

List voices visible to the caller (global + own tenant + own user). Query parameters:

Parameter	Notes
`language`	2-letter ISO code (`en`, `fr`, `de`...).
`scope`	`global`, `tenant`, `user`.
`gender`	`female`, `male`, `neutral`, `unspecified`.
`embedding_status`	`pending`, `processing`, `ready`, `failed`, `evicted`.
`q`	Free-text search over `display_name`, `description`, `style_tags`.
`limit`	1-200, default 50.

Permission: scaispeak:voice.read.

`GET /voices/{voice_id}`#

Fetch one voice's full record. Returns 404 if the voice doesn't exist OR isn't visible to the caller (existence isn't disclosed across scopes).

Voices — write#

`POST /voices`#

Create a voice. Two modes — pick one per request:

Cloned voice (from reference + consent). Multipart form fields:

Field	Required	Notes
`reference`	one of	Multipart file part with the reference audio.
`reference_scaidrive_json`	one of	JSON `{file_id, mcp_uri, share_url}` pointing at a ScaiDrive file.
`consent`	one of	Multipart file part with the consent audio.
`consent_scaidrive_json`	one of	ScaiDrive reference for the consent recording.
`consent_user_full_name`	yes	Speaker's full name; written to the consent row.
`consent_stated_purpose`	yes	What the cloned voice will be used for; verbatim audit.
`consent_text`	yes	The exact scripted statement the speaker reads in the consent clip.

Designed voice (text-only, no audio):

Field	Required	Notes
`voice_design_prompt`	yes	Natural-language description of the speaker (~12+ chars). When set, reference/consent fields are forbidden — `SCAISPEAK_AMBIGUOUS_INTAKE_MODE` if both are supplied.

Common fields (both modes):

Field	Required	Notes
`display_name`	yes	Human-readable label.
`language_primary`	yes	2-letter ISO code.
`language_supported_json`	no	JSON array of 2-letter codes the voice can speak.
`gender_hint`, `age_hint`, `style_tags_json`	no	Library metadata; advisory.
`description`	no	Free-text description (separate from `voice_design_prompt`).

Returns 201 Created with the new voice. Cloned voices include a preflight block; designed voices land at embedding_status: ready immediately.

Errors: SCAISPEAK_VOICE_PREFLIGHT_FAILED (audio rejected), SCAISPEAK_AMBIGUOUS_SOURCE (inline + ScaiDrive for the same file), SCAISPEAK_AMBIGUOUS_INTAKE_MODE (cloned + designed in the same request), SCAISPEAK_CONSENT_REQUIRED (cloned mode missing consent text fields), SCAISPEAK_CONSENT_INVALID (consent audio missing or doesn't match the script).

Permission: scaispeak:voice.write.

`PATCH /voices/{voice_id}`#

Partial update. Settable fields:

display_name, description, language_supported, gender_hint, age_hint, style_tags — library metadata.
voxcpm2_reference_transcript — verbatim transcript of the reference clip. When set, upgrades the voice to higher-quality cloning. Send "" to clear.
voice_design_prompt — natural-language voice description (for designed voices, or to switch a cloned voice into design-only mode). Send "" to clear.
clear_reference — boolean. When true (alongside a non-empty voice_design_prompt), the voice's reference clip + consent record are tombstoned and the voice becomes design-only. SCAISPEAK_DESIGN_PROMPT_REQUIRED_ON_CLEAR if the prompt is missing.
pronunciation_overrides — list of whole-word substitution rules applied before language-specific text expansion. See Text normalisation and pronunciation overrides below. Send [] to clear all rules; omit to leave unchanged.
default_instructions — free-text style / emotion / delivery guidance that applies to every synthesis call using this voice, unless overridden per request or per ScaiVoice session. Example: "warm and conversational". Send "" to clear.
default_speed — default speaking speed (0.5--2.0) for this voice. Omit or send null to use the engine default (1.0).
default_cfg_value — default cloning-fidelity tradeoff (0.5--5.0) for this voice. Higher values stay closer to the reference at the cost of naturalness. Omit or send null to use the engine default (~2.0). Meaningful for cloned voices only.
default_warmup_trim_ms — default number of milliseconds to trim from the start of generated audio for this voice. Absorbs the warm-up artefact common in cloned-voice output. Omit or send null to use the engine default. Meaningful for cloned voices only.

Scope mutation is not allowed here — use /share. Permission: scaispeak:voice.write.

`POST /voices/{voice_id}/reference`#

Replace a cloned voice's reference clip with a fresh recording, capturing new consent. Multipart form fields: reference, consent, consent_user_full_name, consent_stated_purpose, consent_text (same shape as the cloned POST /voices body).

Old reference + consent blobs are tombstoned by the backend. New consent recording is required because the audio is changing (we can't verify it's still the same person without re-capturing).

Permission: scaispeak:voice.write + ownership (or SuperAdmin). 200 OK with the updated voice on success.

`DELETE /voices/{voice_id}`#

Erase the voice (GDPR Art. 17). Tombstones the row, fans out EvictVoice to every warm replica, clears the Redis registry, deletes reference + consent blobs, writes an immutable erasure_audit row.

json
{
  "data": {
    "audit_id": "aud_...",
    "voice_id": "vc_...",
    "warm_replicas_evicted": 3,
    "blob_bytes_deleted": 1240832,
    "error_summary": null,
    "completed_at": "2026-05-17T14:01:00Z"
  }
}

Permission: scaispeak:voice.write.

`POST /voices/{voice_id}/share`#

Promote a user-scope voice to tenant scope. Permission: scaispeak:voice.share (separate from voice.write so sharing can be granted independently).

`POST /voices/{voice_id}/preview`#

Render a short preview clip (max 300 chars). Form fields: text, response_format. Uses the same dispatcher as /speak. Permission: scaispeak:voice.read.

`POST /voices/{voice_id}/repromote`#

Re-run intake processing for a voice. Idempotent — no-op if ready, no-op if already processing. Used to bring legacy voices (created under the previous-generation cloning engine) onto the current zero-shot path. Returns 202 Accepted. Permission: scaispeak:voice.write.

`WS /voices/record`#

Live-record voice intake — WebSocket alternative to POST /voices. Two-phase: first reference audio frames + phase_complete, then consent audio frames + finalize. Auth via ?token= query or Authorization header. Permission: scaispeak:voice.write.

Speak#

`POST /speak`#

Batch synthesis. Body:

Field	Required	Notes
`voice_id`	yes	A voice the caller can see.
`text`	yes	Up to ~500 chars sync, longer async.
`language_hint`	no	2-letter code to disambiguate multilingual voices.
`speed`	no	0.5–2.0, default 1.0.
`response_format`	no	`mp3`, `opus`, `wav`, `flac`, `aac`, `pcm`. Default `mp3`. Self-hosted backend currently emits 48 kHz WAV regardless of this field and logs a downgrade warning if the requested format differs — see Troubleshooting.
`backend_preference`	no	`prefer_self_hosted`, `prefer_relay`, `any`. Advisory; tenant policy wins.
`idempotency_key`	no	Caller-supplied retry key for the output cache.
`force_async`	no	Force the job path regardless of text length.
`save_to`	no	ScaiDrive destination block (see below). JWT auth required.
`inline_response`	no	When `save_to` is set, return audio bytes too (default true).
`instructions`	no	Free-text style guidance (emotion / pace / affect). Example: `"cheerful and energetic"` or `"slowly and carefully"`. Meaningful for cloned voices; preset speakers and the relay backend ignore this field.
`cfg_value`	no	Cloning-fidelity vs naturalness tradeoff. Range 0.5–5.0. Higher values stay closer to the reference voice at the cost of naturalness. Engine default ~2.0 when omitted. Meaningful for cloned voices only.
`warmup_trim_ms`	no	Strip the first N ms of generated audio to absorb the warm-up artefact at the start of cloned-voice output. Typical: 150. Use 0 to disable. Meaningful for cloned voices only.
`normalize_text`	no	Run the text-prep pipeline (strip emoji / markdown / URLs, apply tenant + voice pronunciation overrides, expand dates / times / numbers / currency for the voice's language). `true` / `false` overrides per request; omit to use the tenant default set via `PUT /admin/policy`. Supported expander languages: en, nl, de, fr (others pass through the strip + overrides stages only).

Short text (default ≤500 chars) returns 200 OK with audio_base64 inline. Longer text returns 202 Accepted with job_id — poll /speak/jobs/{job_id}.

save_to block:

json
{
  "share_id": "shr_xyz",
  "folder_id": "fld_abc",
  "filename": "chapter-01.mp3",
  "overwrite": false
}

Permission: scaispeak:synthesize.

`GET /speak/jobs/{job_id}`#

Poll an async synth job. Returns status (queued, running, completed, failed), and when complete, audio_base64 inline (for small outputs) or audio_bytes + S3 URI for larger ones. If the job was submitted with save_to, the response also carries save_to.file_id once the upload finishes. Permission: scaispeak:synthesize, scoped to (user, tenant) — you can't poll another user's job by ID guess.

Streaming — WebSocket#

`WS /stream/speak`#

Real-time TTS over WebSocket. Wire protocol:

Client → Server	Fields
`{"type":"open"}`	`voice_id`, `language_hint`, `speed`, `output.codec`, `backend_preference`
`{"type":"text"}`	`delta`
`{"type":"flush"}`	—
`{"type":"interrupt"}`	—
`{"type":"close"}`	—

Server → Client	Fields
`{"type":"ready"}`	`voice_id`, `backend_used`
binary frame	audio bytes in the negotiated codec
`{"type":"interrupted"}`	—
`{"type":"closed"}`	`stats.chars`, `stats.backend_used`
`{"type":"error"}`	`code`, `message`

Close codes: 4401 unauthorized, 4403 forbidden, 4400 bad request, 4502 backend unavailable, 4500 server error. Auth via ?token= or header. Permission: scaispeak:synthesize.

Streaming — WebRTC#

Status: signalling and lifecycle ship end-to-end. The audio plane (aiortc MediaStreamTrack.recv) raises NotImplementedError today — once a peer connection negotiates, no audio drains to the backend. Use the WebSocket streaming endpoints for production until this caveat is removed.

`POST /stream/speak/webrtc/sessions`#

Create a WebRTC session. Body:

Field	Notes
`voice_id`	required
`language_hint`	optional 2-letter code
`speed`	0.5–2.0
`output.codec`	`opus` or `pcm`
`output.sample_rate`	8000–48000
`control.transport`	`websocket` or `datachannel`
`ice_servers`	optional tenant-supplied ICE config
`backend_preference`	same vocabulary as `/speak`

Returns session_id, ice_servers, expires_at, control_ws_url. Permission: scaispeak:synthesize.

`POST /stream/speak/webrtc/sessions/{session_id}/offer`#

Apply client SDP offer, return server's SDP answer.

`POST /stream/speak/webrtc/sessions/{session_id}/ice-candidates`#

Trickle ICE candidate from client. Returns 204 No Content.

`DELETE /stream/speak/webrtc/sessions/{session_id}`#

Tear down the peer + mark session closed.

`WS /stream/speak/webrtc/sessions/{session_id}/control`#

Control plane for an active WebRTC session — same text/flush/interrupt/close vocabulary as the WebSocket streaming path, no binary audio frames (audio rides RTP).

Voice warming#

`GET /voices/{voice_id}/warm`#

Inspect current warm state. Returns warm_node_ids, candidate_node_ids, stale_node_ids. Permission: scaispeak:voice.read.

`POST /voices/{voice_id}/warm`#

Fan-out PrepareVoice to candidate replicas. Body: { "node_ids": [...] } (empty means "all candidates"). Returns outcomes array with per-node ok, cache_key, load_ms, error. Permission: scaispeak:voice.write.

`POST /voices/{voice_id}/evict`#

Drop the voice from every currently-warm replica. Always clears the registry. Permission: scaispeak:voice.write.

Tenant policy#

`GET /admin/policy`#

Read the caller's tenant policy. Returns:

allowed_backends — subset of ["A","B"].
default_backend — A or B.
tokeniser_backend — legacy or scaiinfer.
text_normalization_default — tenant default for the per-request normalize_text flag on POST /speak.
pronunciation_overrides — tenant-wide pronunciation rules (or null when none are set). Same shape as the per-voice list.

Permission: scaispeak:synthesize — readable by any caller who can synthesise so UIs can show "your tenant routes through Backend B".

`PUT /admin/policy`#

Update the tenant policy. Body (all fields optional; omitted = leave unchanged):

allowed_backends — string shorthand "A"/"B"/"AB" or list.
default_backend — A or B. Must be in allowed_backends.
tokeniser_backend — legacy or scaiinfer.
text_normalization_default — boolean.
pronunciation_overrides — list of rules. Send [] to clear all rules; non-empty list replaces the whole set; omit to leave unchanged.

Validation rejects default_backend not in allowed_backends. Permission: scaispeak:admin.

Text normalisation and pronunciation overrides#

ScaiSpeak ships an optional text-preprocessing pipeline that runs before dispatch when normalize_text is true (per request or via tenant default). Three stages, in order:

Strip noise — emoji, zero-width characters, markdown emphasis (**bold**, _italic_, ~~strike~~, backticks), bare URLs (rewritten to "link"), bullet glyphs at line start. Language-agnostic.
Pronunciation overrides — whole-word substitution rules from the tenant policy, then per-voice rules layered on top. Tenant rules run first; voice rules can refine or override for one specific voice.
Expand — date / time / number / currency rendering for the voice's primary language. Supported: en, nl, de, fr. Unsupported languages skip this stage but still benefit from strip + overrides.

Pronunciation rule shape:

json
{
  "pattern": "Kubernetes",
  "replacement": "koo-ber-net-eez",
  "case_sensitive": true
}

pattern — required, matched as a whole word (Unicode word boundaries; k8s won't match inside k8short).
replacement — required, written into the text verbatim. Multiple words allowed.
case_sensitive — optional, defaults to true. Set false for acronyms / brand names where casing varies in caller input.

Bad rules (missing fields, empty patterns) are skipped silently — one operator typo doesn't break a tenant's synth pipeline.

Examples of what the expansion stage produces, voice with language_primary="en":

2026-05-23 → "the twenty third of May, two thousand twenty six"
17:30 → "seventeen thirty"
5:30 PM → "five thirty PM"
$42.50 → "forty-two dollars and fifty cents"
£1,234.56 → "one thousand, two hundred and thirty-four pounds and fifty-six pence"

Same input on language_primary="nl":

2026-05-23 → "drieëntwintig mei tweeduizendzesentwintig"
17:30 → "zeventien uur dertig"
€42,50 → "tweeënveertig euro en vijftig cent"

Locale convention: DD/MM/YYYY is the assumed convention (matches nl, de, fr, UK English; US callers should use ISO YYYY-MM-DD to disambiguate). Decimal/thousands separators follow the voice's language — $42.50 for en, €42,50 for nl/de/fr.

The pipeline runs after voice resolution + blocklist + the output-cache check, before dispatch — so cache keys stay on the raw caller-provided text and the synthesised audio reflects the normalised text.

ScaiDrive proxy#

`GET /admin/scaidrive/shares`#

Read-only forwarding to ScaiDrive — list shares the caller can see. Used by the synth page destination picker. Requires JWT auth (not sgk_). Returns 404 with SCAISPEAK_SCAIDRIVE_NOT_AVAILABLE when ScaiDrive isn't configured in the deployment.

`GET /admin/scaidrive/shares/{share_id}/folders`#

Lazy-browse folders inside a share. Query: folder_id (omit for the share root). Returns folder children only.

Admin lifecycle#

`POST /admin/lifecycle/install`#

First-time install hook called by the module-host. Idempotent. SuperAdmin-only.

`POST /admin/lifecycle/upgrade`#

Version upgrade hook. Idempotent. SuperAdmin-only.

`POST /admin/lifecycle/uninstall`#

Module uninstall — soft-deletes every non-global voice in the deployment, signals the erasure worker to fan out. Requires confirmation_token + expected_module_id. SuperAdmin-only.

`POST /admin/lifecycle/tenant/{tenant_id}/enable`#

Per-tenant enable. SuperAdmin-only.

`POST /admin/lifecycle/tenant/{tenant_id}/disable`#

Per-tenant disable — soft-deletes all the tenant's user + tenant scope voices and signals erasure. Global voices untouched. SuperAdmin-only.

Blocklist + audit#

`POST /admin/blocklist`#

Add a blocklist entry. Body: scope (tenant, user, voice), target_id, reason, optional expires_at. Permission: scaispeak:admin.

`GET /admin/blocklist`#

List active blocklist entries. Query: scope, tenant_id, limit. Permission: scaispeak:admin.

`DELETE /admin/blocklist/{block_id}`#

Remove a blocklist entry (manual unblock). Returns 204 No Content. Permission: scaispeak:admin.

`GET /admin/erasure/audit`#

List erasure audit rows. Query: tenant_id, voice_id, limit. Returns most-recent-first. Permission: scaispeak:admin.

Global voices (SuperAdmin)#

`POST /admin/voices/global`#

Create a platform-scope (scope='global') voice — no consent, license-based. SuperAdmin-only. Form fields:

Field	Required	Notes
`reference`	yes	Multipart reference audio. ScaiDrive references not accepted for globals.
`display_name`, `language_primary`	yes	Same shape as user voices.
`licensor_name`	yes	Who licensed the voice to ScaiLabs.
`license_type`	yes	`perpetual`, `time_bound`, `usage_bound`.
`valid_until`	when `time_bound`	ISO-8601 timestamp.
`usage_limit_chars`	when `usage_bound`	Integer cap.
`licensor_reference`	no	Contract reference.
`valid_from`	no	ISO-8601 start.
`terms_summary`	no	Operator-facing summary of the terms.
`license_document`	no	Optional PDF; stored alongside the voice.

Returns the new voice_id, license_id, and intake note.

`DELETE /admin/voices/global/{voice_id}`#

Revoke a global voice. SuperAdmin-only. Form field trigger (license_revoked, license_expired, platform_decision). Bypasses the owner-equality check that protects user/tenant voices. Updates the license row's status to match the trigger. Runs the full erasure pipeline.

Errors#

All endpoints return ScaiGrid's standard error envelope:

json
{
  "error": {
    "code": "SCAISPEAK_VOICE_NOT_FOUND",
    "message": "Voice does not exist or isn't visible to the caller",
    "details": { "voice_id": "vc_..." }
  },
  "meta": { "request_id": "req_..." }
}

ScaiSpeak-specific codes:

Code	Meaning
`SCAISPEAK_VOICE_NOT_FOUND`	Voice id doesn't exist or isn't visible.
`SCAISPEAK_VOICE_ACCESS_DENIED`	Caller can't perform this operation on this voice.
`SCAISPEAK_VOICE_PREFLIGHT_FAILED`	Reference audio failed quality checks. Body includes `preflight`.
`SCAISPEAK_CONSENT_INVALID`	Consent recording missing or doesn't match the scripted text.
`SCAISPEAK_AMBIGUOUS_SOURCE`	Both inline upload and ScaiDrive reference supplied for the same file.
`SCAISPEAK_VOICE_SHARE_FORBIDDEN`	Only the owner with `voice.share` can promote to tenant scope.
`SCAISPEAK_BACKEND_UNAVAILABLE`	No allowed backend currently available.
`SCAISPEAK_TENANT_POLICY_INVALID`	Policy update rejected (e.g. default not in allowed set).
`SCAISPEAK_JOB_NOT_FOUND`	Job id doesn't exist or doesn't belong to this caller.
`SCAISPEAK_VOICE_NOT_READY_FOR_WARMING`	Legacy warming path returns this when the voice doesn't have the cached state the previous-gen engine needed. No-op on the current zero-shot engine; safe to ignore for new code.
`SCAISPEAK_SAVE_TO_REQUIRES_JWT`	save_to attempted with `sgk_` API key auth.
`SCAISPEAK_SAVE_TO_EXCHANGE_FAILED`	ScaiKey token exchange against ScaiDrive failed.
`SCAISPEAK_SCAIDRIVE_NOT_AVAILABLE`	ScaiDrive integration not configured.
`SCAISPEAK_SCAIDRIVE_FORBIDDEN`	Caller lacks write access on the destination share.
`SCAISPEAK_SCAIDRIVE_NOT_FOUND`	Destination share or folder doesn't exist.
`SCAISPEAK_SCAIDRIVE_CONFLICT`	File exists at destination and `overwrite` is false.
`SCAISPEAK_SCAIDRIVE_QUOTA_EXCEEDED`	Destination share over quota (HTTP 507).
`SCAISPEAK_LICENSE_FIELD_INVALID`	License-type / bound mismatch on global voice create.
`SCAISPEAK_GLOBAL_VOICE_NOT_FOUND`	Global voice doesn't exist or already deleted.
`SCAISPEAK_BLOCKLIST_NOT_FOUND`	Blocklist entry id doesn't exist.
`SCAISPEAK_UNINSTALL_TOKEN_MISMATCH`	Uninstall hook called without a matching token.

API reference

Health#

GET /healthz#

GET /readyz#

Voices — read#

GET /voices#

GET /voices/{voice_id}#

Voices — write#

POST /voices#

PATCH /voices/{voice_id}#

POST /voices/{voice_id}/reference#

DELETE /voices/{voice_id}#

POST /voices/{voice_id}/share#

POST /voices/{voice_id}/preview#

POST /voices/{voice_id}/repromote#

WS /voices/record#

Speak#

POST /speak#

GET /speak/jobs/{job_id}#

Streaming — WebSocket#

WS /stream/speak#

Streaming — WebRTC#

POST /stream/speak/webrtc/sessions#

POST /stream/speak/webrtc/sessions/{session_id}/offer#

POST /stream/speak/webrtc/sessions/{session_id}/ice-candidates#

DELETE /stream/speak/webrtc/sessions/{session_id}#

WS /stream/speak/webrtc/sessions/{session_id}/control#