Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

API reference

All endpoints are mounted at /v1/modules/scaispeak/ and authenticate with the standard ScaiGrid bearer token. Responses use ScaiGrid's standard envelope ({ "data": ... } for success, { "error": ... } for failures).

Health#

GET /healthz#

Liveness — process is responding. Cheap; no I/O.

GET /readyz#

Readiness — module can serve requests. Returns 200 when the module's upstream dependencies (managed TTS relay, ScaiInfer, Redis) are reachable enough to dispatch.

Voices — read#

GET /voices#

List voices visible to the caller (global + own tenant + own user). Query parameters:

Parameter Notes
language 2-letter ISO code (en, fr, de...).
scope global, tenant, user.
gender female, male, neutral, unspecified.
embedding_status pending, processing, ready, failed, evicted.
q Free-text search over display_name, description, style_tags.
limit 1-200, default 50.

Permission: scaispeak:voice.read.

GET /voices/{voice_id}#

Fetch one voice's full record. Returns 404 if the voice doesn't exist OR isn't visible to the caller (existence isn't disclosed across scopes).

Voices — write#

POST /voices#

Create a voice. Two modes — pick one per request:

Cloned voice (from reference + consent). Multipart form fields:

Field Required Notes
reference one of Multipart file part with the reference audio.
reference_scaidrive_json one of JSON {file_id, mcp_uri, share_url} pointing at a ScaiDrive file.
consent one of Multipart file part with the consent audio.
consent_scaidrive_json one of ScaiDrive reference for the consent recording.
consent_user_full_name yes Speaker's full name; written to the consent row.
consent_stated_purpose yes What the cloned voice will be used for; verbatim audit.
consent_text yes The exact scripted statement the speaker reads in the consent clip.

Designed voice (text-only, no audio):

Field Required Notes
voice_design_prompt yes Natural-language description of the speaker (~12+ chars). When set, reference/consent fields are forbidden — SCAISPEAK_AMBIGUOUS_INTAKE_MODE if both are supplied.

Common fields (both modes):

Field Required Notes
display_name yes Human-readable label.
language_primary yes 2-letter ISO code.
language_supported_json no JSON array of 2-letter codes the voice can speak.
gender_hint, age_hint, style_tags_json no Library metadata; advisory.
description no Free-text description (separate from voice_design_prompt).

Returns 201 Created with the new voice. Cloned voices include a preflight block; designed voices land at embedding_status: ready immediately.

Errors: SCAISPEAK_VOICE_PREFLIGHT_FAILED (audio rejected), SCAISPEAK_AMBIGUOUS_SOURCE (inline + ScaiDrive for the same file), SCAISPEAK_AMBIGUOUS_INTAKE_MODE (cloned + designed in the same request), SCAISPEAK_CONSENT_REQUIRED (cloned mode missing consent text fields), SCAISPEAK_CONSENT_INVALID (consent audio missing or doesn't match the script).

Permission: scaispeak:voice.write.

PATCH /voices/{voice_id}#

Partial update. Settable fields:

  • display_name, description, language_supported, gender_hint, age_hint, style_tags — library metadata.
  • voxcpm2_reference_transcript — verbatim transcript of the reference clip. When set, upgrades the voice to higher-quality cloning. Send "" to clear.
  • voice_design_prompt — natural-language voice description (for designed voices, or to switch a cloned voice into design-only mode). Send "" to clear.
  • clear_reference — boolean. When true (alongside a non-empty voice_design_prompt), the voice's reference clip + consent record are tombstoned and the voice becomes design-only. SCAISPEAK_DESIGN_PROMPT_REQUIRED_ON_CLEAR if the prompt is missing.
  • pronunciation_overrides — list of whole-word substitution rules applied before language-specific text expansion. See Text normalisation and pronunciation overrides below. Send [] to clear all rules; omit to leave unchanged.

Scope mutation is not allowed here — use /share. Permission: scaispeak:voice.write.

POST /voices/{voice_id}/reference#

Replace a cloned voice's reference clip with a fresh recording, capturing new consent. Multipart form fields: reference, consent, consent_user_full_name, consent_stated_purpose, consent_text (same shape as the cloned POST /voices body).

Old reference + consent blobs are tombstoned by the backend. New consent recording is required because the audio is changing (we can't verify it's still the same person without re-capturing).

Permission: scaispeak:voice.write + ownership (or SuperAdmin). 200 OK with the updated voice on success.

DELETE /voices/{voice_id}#

Erase the voice (GDPR Art. 17). Tombstones the row, fans out EvictVoice to every warm replica, clears the Redis registry, deletes reference + consent blobs, writes an immutable erasure_audit row.

json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "data": {
    "audit_id": "aud_...",
    "voice_id": "vc_...",
    "warm_replicas_evicted": 3,
    "blob_bytes_deleted": 1240832,
    "error_summary": null,
    "completed_at": "2026-05-17T14:01:00Z"
  }
}

Permission: scaispeak:voice.write.

POST /voices/{voice_id}/share#

Promote a user-scope voice to tenant scope. Permission: scaispeak:voice.share (separate from voice.write so sharing can be granted independently).

POST /voices/{voice_id}/preview#

Render a short preview clip (max 300 chars). Form fields: text, response_format. Uses the same dispatcher as /speak. Permission: scaispeak:voice.read.

POST /voices/{voice_id}/repromote#

Re-run intake processing for a voice. Idempotent — no-op if ready, no-op if already processing. Used to bring legacy voices (created under the previous-generation cloning engine) onto the current zero-shot path. Returns 202 Accepted. Permission: scaispeak:voice.write.

WS /voices/record#

Live-record voice intake — WebSocket alternative to POST /voices. Two-phase: first reference audio frames + phase_complete, then consent audio frames + finalize. Auth via ?token= query or Authorization header. Permission: scaispeak:voice.write.

Speak#

POST /speak#

Batch synthesis. Body:

Field Required Notes
voice_id yes A voice the caller can see.
text yes Up to ~500 chars sync, longer async.
language_hint no 2-letter code to disambiguate multilingual voices.
speed no 0.5–2.0, default 1.0.
response_format no mp3, opus, wav, flac, aac, pcm. Default mp3. Self-hosted backend currently emits 48 kHz WAV regardless of this field and logs a downgrade warning if the requested format differs — see Troubleshooting.
backend_preference no prefer_self_hosted, prefer_relay, any. Advisory; tenant policy wins.
idempotency_key no Caller-supplied retry key for the output cache.
force_async no Force the job path regardless of text length.
save_to no ScaiDrive destination block (see below). JWT auth required.
inline_response no When save_to is set, return audio bytes too (default true).
instructions no Free-text style guidance (emotion / pace / affect). Example: "cheerful and energetic" or "slowly and carefully". Meaningful for cloned voices; preset speakers and the relay backend ignore this field.
cfg_value no Cloning-fidelity vs naturalness tradeoff. Range 0.5–5.0. Higher values stay closer to the reference voice at the cost of naturalness. Engine default ~2.0 when omitted. Meaningful for cloned voices only.
warmup_trim_ms no Strip the first N ms of generated audio to absorb the warm-up artefact at the start of cloned-voice output. Typical: 150. Use 0 to disable. Meaningful for cloned voices only.
normalize_text no Run the text-prep pipeline (strip emoji / markdown / URLs, apply tenant + voice pronunciation overrides, expand dates / times / numbers / currency for the voice's language). true / false overrides per request; omit to use the tenant default set via PUT /admin/policy. Supported expander languages: en, nl, de, fr (others pass through the strip + overrides stages only).

Short text (default ≤500 chars) returns 200 OK with audio_base64 inline. Longer text returns 202 Accepted with job_id — poll /speak/jobs/{job_id}.

save_to block:

json
1
2
3
4
5
6
{
  "share_id": "shr_xyz",
  "folder_id": "fld_abc",
  "filename": "chapter-01.mp3",
  "overwrite": false
}

Permission: scaispeak:synthesize.

GET /speak/jobs/{job_id}#

Poll an async synth job. Returns status (queued, running, completed, failed), and when complete, audio_base64 inline (for small outputs) or audio_bytes + S3 URI for larger ones. If the job was submitted with save_to, the response also carries save_to.file_id once the upload finishes. Permission: scaispeak:synthesize, scoped to (user, tenant) — you can't poll another user's job by ID guess.

Streaming — WebSocket#

WS /stream/speak#

Real-time TTS over WebSocket. Wire protocol:

Client → Server Fields
{"type":"open"} voice_id, language_hint, speed, output.codec, backend_preference
{"type":"text"} delta
{"type":"flush"}
{"type":"interrupt"}
{"type":"close"}
Server → Client Fields
{"type":"ready"} voice_id, backend_used
binary frame audio bytes in the negotiated codec
{"type":"interrupted"}
{"type":"closed"} stats.chars, stats.backend_used
{"type":"error"} code, message

Close codes: 4401 unauthorized, 4403 forbidden, 4400 bad request, 4502 backend unavailable, 4500 server error. Auth via ?token= or header. Permission: scaispeak:synthesize.

Streaming — WebRTC#

Status: signalling and lifecycle ship end-to-end. The audio plane (aiortc MediaStreamTrack.recv) raises NotImplementedError today — once a peer connection negotiates, no audio drains to the backend. Use the WebSocket streaming endpoints for production until this caveat is removed.

POST /stream/speak/webrtc/sessions#

Create a WebRTC session. Body:

Field Notes
voice_id required
language_hint optional 2-letter code
speed 0.5–2.0
output.codec opus or pcm
output.sample_rate 8000–48000
control.transport websocket or datachannel
ice_servers optional tenant-supplied ICE config
backend_preference same vocabulary as /speak

Returns session_id, ice_servers, expires_at, control_ws_url. Permission: scaispeak:synthesize.

POST /stream/speak/webrtc/sessions/{session_id}/offer#

Apply client SDP offer, return server's SDP answer.

POST /stream/speak/webrtc/sessions/{session_id}/ice-candidates#

Trickle ICE candidate from client. Returns 204 No Content.

DELETE /stream/speak/webrtc/sessions/{session_id}#

Tear down the peer + mark session closed.

WS /stream/speak/webrtc/sessions/{session_id}/control#

Control plane for an active WebRTC session — same text/flush/interrupt/close vocabulary as the WebSocket streaming path, no binary audio frames (audio rides RTP).

Voice warming#

GET /voices/{voice_id}/warm#

Inspect current warm state. Returns warm_node_ids, candidate_node_ids, stale_node_ids. Permission: scaispeak:voice.read.

POST /voices/{voice_id}/warm#

Fan-out PrepareVoice to candidate replicas. Body: { "node_ids": [...] } (empty means "all candidates"). Returns outcomes array with per-node ok, cache_key, load_ms, error. Permission: scaispeak:voice.write.

POST /voices/{voice_id}/evict#

Drop the voice from every currently-warm replica. Always clears the registry. Permission: scaispeak:voice.write.

Tenant policy#

GET /admin/policy#

Read the caller's tenant policy. Returns:

  • allowed_backends — subset of ["A","B"].
  • default_backendA or B.
  • tokeniser_backendlegacy or scaiinfer.
  • text_normalization_default — tenant default for the per-request normalize_text flag on POST /speak.
  • pronunciation_overrides — tenant-wide pronunciation rules (or null when none are set). Same shape as the per-voice list.

Permission: scaispeak:synthesize — readable by any caller who can synthesise so UIs can show "your tenant routes through Backend B".

PUT /admin/policy#

Update the tenant policy. Body (all fields optional; omitted = leave unchanged):

  • allowed_backends — string shorthand "A"/"B"/"AB" or list.
  • default_backendA or B. Must be in allowed_backends.
  • tokeniser_backendlegacy or scaiinfer.
  • text_normalization_default — boolean.
  • pronunciation_overrides — list of rules. Send [] to clear all rules; non-empty list replaces the whole set; omit to leave unchanged.

Validation rejects default_backend not in allowed_backends. Permission: scaispeak:admin.

Text normalisation and pronunciation overrides#

ScaiSpeak ships an optional text-preprocessing pipeline that runs before dispatch when normalize_text is true (per request or via tenant default). Three stages, in order:

  1. Strip noise — emoji, zero-width characters, markdown emphasis (**bold**, _italic_, ~~strike~~, backticks), bare URLs (rewritten to "link"), bullet glyphs at line start. Language-agnostic.
  2. Pronunciation overrides — whole-word substitution rules from the tenant policy, then per-voice rules layered on top. Tenant rules run first; voice rules can refine or override for one specific voice.
  3. Expand — date / time / number / currency rendering for the voice's primary language. Supported: en, nl, de, fr. Unsupported languages skip this stage but still benefit from strip + overrides.

Pronunciation rule shape:

json
1
2
3
4
5
{
  "pattern": "Kubernetes",
  "replacement": "koo-ber-net-eez",
  "case_sensitive": true
}
  • pattern — required, matched as a whole word (Unicode word boundaries; k8s won't match inside k8short).
  • replacement — required, written into the text verbatim. Multiple words allowed.
  • case_sensitive — optional, defaults to true. Set false for acronyms / brand names where casing varies in caller input.

Bad rules (missing fields, empty patterns) are skipped silently — one operator typo doesn't break a tenant's synth pipeline.

Examples of what the expansion stage produces, voice with language_primary="en":

  • 2026-05-23 → "the twenty third of May, two thousand twenty six"
  • 17:30 → "seventeen thirty"
  • 5:30 PM → "five thirty PM"
  • $42.50 → "forty-two dollars and fifty cents"
  • £1,234.56 → "one thousand, two hundred and thirty-four pounds and fifty-six pence"

Same input on language_primary="nl":

  • 2026-05-23 → "drieëntwintig mei tweeduizendzesentwintig"
  • 17:30 → "zeventien uur dertig"
  • €42,50 → "tweeënveertig euro en vijftig cent"

Locale convention: DD/MM/YYYY is the assumed convention (matches nl, de, fr, UK English; US callers should use ISO YYYY-MM-DD to disambiguate). Decimal/thousands separators follow the voice's language — $42.50 for en, €42,50 for nl/de/fr.

The pipeline runs after voice resolution + blocklist + the output-cache check, before dispatch — so cache keys stay on the raw caller-provided text and the synthesised audio reflects the normalised text.

ScaiDrive proxy#

GET /admin/scaidrive/shares#

Read-only forwarding to ScaiDrive — list shares the caller can see. Used by the synth page destination picker. Requires JWT auth (not sgk_). Returns 404 with SCAISPEAK_SCAIDRIVE_NOT_AVAILABLE when ScaiDrive isn't configured in the deployment.

GET /admin/scaidrive/shares/{share_id}/folders#

Lazy-browse folders inside a share. Query: folder_id (omit for the share root). Returns folder children only.

Admin lifecycle#

POST /admin/lifecycle/install#

First-time install hook called by the module-host. Idempotent. SuperAdmin-only.

POST /admin/lifecycle/upgrade#

Version upgrade hook. Idempotent. SuperAdmin-only.

POST /admin/lifecycle/uninstall#

Module uninstall — soft-deletes every non-global voice in the deployment, signals the erasure worker to fan out. Requires confirmation_token + expected_module_id. SuperAdmin-only.

POST /admin/lifecycle/tenant/{tenant_id}/enable#

Per-tenant enable. SuperAdmin-only.

POST /admin/lifecycle/tenant/{tenant_id}/disable#

Per-tenant disable — soft-deletes all the tenant's user + tenant scope voices and signals erasure. Global voices untouched. SuperAdmin-only.

Blocklist + audit#

POST /admin/blocklist#

Add a blocklist entry. Body: scope (tenant, user, voice), target_id, reason, optional expires_at. Permission: scaispeak:admin.

GET /admin/blocklist#

List active blocklist entries. Query: scope, tenant_id, limit. Permission: scaispeak:admin.

DELETE /admin/blocklist/{block_id}#

Remove a blocklist entry (manual unblock). Returns 204 No Content. Permission: scaispeak:admin.

GET /admin/erasure/audit#

List erasure audit rows. Query: tenant_id, voice_id, limit. Returns most-recent-first. Permission: scaispeak:admin.

Global voices (SuperAdmin)#

POST /admin/voices/global#

Create a platform-scope (scope='global') voice — no consent, license-based. SuperAdmin-only. Form fields:

Field Required Notes
reference yes Multipart reference audio. ScaiDrive references not accepted for globals.
display_name, language_primary yes Same shape as user voices.
licensor_name yes Who licensed the voice to ScaiLabs.
license_type yes perpetual, time_bound, usage_bound.
valid_until when time_bound ISO-8601 timestamp.
usage_limit_chars when usage_bound Integer cap.
licensor_reference no Contract reference.
valid_from no ISO-8601 start.
terms_summary no Operator-facing summary of the terms.
license_document no Optional PDF; stored alongside the voice.

Returns the new voice_id, license_id, and intake note.

DELETE /admin/voices/global/{voice_id}#

Revoke a global voice. SuperAdmin-only. Form field trigger (license_revoked, license_expired, platform_decision). Bypasses the owner-equality check that protects user/tenant voices. Updates the license row's status to match the trigger. Runs the full erasure pipeline.

Errors#

All endpoints return ScaiGrid's standard error envelope:

json
1
2
3
4
5
6
7
8
{
  "error": {
    "code": "SCAISPEAK_VOICE_NOT_FOUND",
    "message": "Voice does not exist or isn't visible to the caller",
    "details": { "voice_id": "vc_..." }
  },
  "meta": { "request_id": "req_..." }
}

ScaiSpeak-specific codes:

Code Meaning
SCAISPEAK_VOICE_NOT_FOUND Voice id doesn't exist or isn't visible.
SCAISPEAK_VOICE_ACCESS_DENIED Caller can't perform this operation on this voice.
SCAISPEAK_VOICE_PREFLIGHT_FAILED Reference audio failed quality checks. Body includes preflight.
SCAISPEAK_CONSENT_INVALID Consent recording missing or doesn't match the scripted text.
SCAISPEAK_AMBIGUOUS_SOURCE Both inline upload and ScaiDrive reference supplied for the same file.
SCAISPEAK_VOICE_SHARE_FORBIDDEN Only the owner with voice.share can promote to tenant scope.
SCAISPEAK_BACKEND_UNAVAILABLE No allowed backend currently available.
SCAISPEAK_TENANT_POLICY_INVALID Policy update rejected (e.g. default not in allowed set).
SCAISPEAK_JOB_NOT_FOUND Job id doesn't exist or doesn't belong to this caller.
SCAISPEAK_VOICE_NOT_READY_FOR_WARMING Legacy warming path returns this when the voice doesn't have the cached state the previous-gen engine needed. No-op on the current zero-shot engine; safe to ignore for new code.
SCAISPEAK_SAVE_TO_REQUIRES_JWT save_to attempted with sgk_ API key auth.
SCAISPEAK_SAVE_TO_EXCHANGE_FAILED ScaiKey token exchange against ScaiDrive failed.
SCAISPEAK_SCAIDRIVE_NOT_AVAILABLE ScaiDrive integration not configured.
SCAISPEAK_SCAIDRIVE_FORBIDDEN Caller lacks write access on the destination share.
SCAISPEAK_SCAIDRIVE_NOT_FOUND Destination share or folder doesn't exist.
SCAISPEAK_SCAIDRIVE_CONFLICT File exists at destination and overwrite is false.
SCAISPEAK_SCAIDRIVE_QUOTA_EXCEEDED Destination share over quota (HTTP 507).
SCAISPEAK_LICENSE_FIELD_INVALID License-type / bound mismatch on global voice create.
SCAISPEAK_GLOBAL_VOICE_NOT_FOUND Global voice doesn't exist or already deleted.
SCAISPEAK_BLOCKLIST_NOT_FOUND Blocklist entry id doesn't exist.
SCAISPEAK_UNINSTALL_TOKEN_MISMATCH Uninstall hook called without a matching token.
Updated 2026-05-23 23:54:33 View source (.md) rev 15