Changelog
User-visible changes only. Internal refactors and infrastructure work omitted.
v0.x — Phase rollout#
ScaiSpeak ships in phases. Each phase adds endpoints and capabilities; the module ID and URL prefix have been stable since Phase 0.
- Phase 1 — Voice library. List / get / clone / update / delete voices. Preflight checks on intake. Consent capture. ScaiDrive references for reference + consent audio. Permissions split into
synthesize,voice.read,voice.write,voice.share,admin. - Phase 2 — Batch synth.
POST /speakwith Backend B (managed TTS relay) wired. Tenant backend policy at/admin/policy. Voice preview endpoint. - Phase 2B — Self-host backend. Backend A (ScaiInfer-hosted TTS engine) added behind the same
/speakpath. Backend policy picks per-tenant. - Phase 3 — Voice warming.
voice_prefix_tokensfrom the previous-generation cloning pipeline. Warm / evict / repromote endpoints. Redis-backed warm registry. Superseded 2026-05-22 by the zero-shot cloning engine; the endpoints remain for compatibility but are no-ops on the new engine. - Phase 4 — WebSocket streaming.
WS /stream/speakwith the text/flush/interrupt/close vocabulary. Opus + PCM output codecs. - Phase 5 — WebRTC. Session lifecycle at
/stream/speak/webrtc/sessions/*plus control WebSocket. Requiresaiortc+avin the deployment. - Phase 6 — Async long-form.
POST /speakreturns202+job_idfor text over the threshold.GET /speak/jobs/{id}for polling. Caller can force the path withforce_async. - Phase 7 — GDPR + safety. Erasure pipeline with audit rows. Blocklist endpoints. Lifecycle hooks (install / upgrade / uninstall / tenant enable / disable) wired into the erasure worker.
- 2026-05-13 — save_to ScaiDrive.
POST /speakaccepts asave_toblock; sync + async paths upload to the caller's ScaiDrive share via token exchange. Synth admin page at/admin/scaispeak/synthesiseships with the ScaiDrive folder picker and localStorage presets. Global voices:POST /admin/voices/global+DELETE /admin/voices/global/{id}, SuperAdmin-only, licensed-not-consent-based. - 2026-05-22 — Zero-shot cloning engine. Self-hosted cloning is now zero-shot: the reference clip is consumed at synth time directly, no separate training step. New voices land at
embedding_status: readyimmediately after intake clears preflight. Three new optional fields onPOST /speak(instructions,cfg_value,warmup_trim_ms) let callers tune per-call delivery for cloned voices. Output sample rate is now 48 kHz on the self-hosted path, up from 24 kHz. The warm / repromote endpoints stay in place as no-ops for compatibility. - 2026-05-23 — Voice Design + reference editing. Two new intake / editing capabilities:
- Voice Design: create a voice from a natural-language description alone — no reference clip, no consent recording.
POST /voicesnow acceptsvoice_design_promptas a top-level field; cloned-modereference/consentbecome forbidden when it's set (SCAISPEAK_AMBIGUOUS_INTAKE_MODE). See the new Design a voice tutorial. - Switch a cloned voice to design-only:
PATCH /voices/{id}withclear_reference: trueplus a freshvoice_design_prompttombstones the reference + consent and keeps the voice ID stable. - Replace a reference clip: new
POST /voices/{id}/referencemultipart endpoint to swap a cloned voice's reference + capture new consent without deleting the row. Old blobs tombstoned. - Per-call
instructionsnow actually moves the voice (was a no-op at engine ship time; engine-side wiring landed this cycle). Field renamed on the wire tocontrol_instructionfor clarity; caller-sideinstructionsfield unchanged.
- Voice Design: create a voice from a natural-language description alone — no reference clip, no consent recording.
- 2026-05-23 — Text normalisation and pronunciation overrides. Optional text-preprocessing pipeline for
POST /speak:- New
normalize_textfield per request (true/false/ omit for tenant default). When on, the pipeline strips emoji + markdown noise, applies tenant + per-voice pronunciation rules, then expands dates / times / numbers / currency for the voice's primary language (en, nl, de, fr supported). - Tenant default settable via
PUT /admin/policy→text_normalization_default. - Tenant-wide pronunciation rules on the same endpoint:
pronunciation_overridesaccepts a list of{pattern, replacement, case_sensitive}rules. Per-voice rules can be set viaPATCH /voices/{id}and layer on top of the tenant rules. - Currency expansion supports
$,€,£with locale-appropriate decimal/thousands separators and lang-specific subunit words (cents / pence / centimes / Cent). - Cache keys stay on the raw caller text — toggling normalisation per request doesn't break idempotency.
- New