Changelog

User-visible changes only. Internal refactors and infrastructure work omitted.

v0.x — Phase rollout#

ScaiSpeak ships in phases. Each phase adds endpoints and capabilities; the module ID and URL prefix have been stable since Phase 0.

Phase 1 — Voice library. List / get / clone / update / delete voices. Preflight checks on intake. Consent capture. ScaiDrive references for reference + consent audio. Permissions split into synthesize, voice.read, voice.write, voice.share, admin.
Phase 2 — Batch synth. POST /speak with Backend B (managed TTS relay) wired. Tenant backend policy at /admin/policy. Voice preview endpoint.
Phase 2B — Self-host backend. Backend A (ScaiInfer-hosted TTS engine) added behind the same /speak path. Backend policy picks per-tenant.
Phase 3 — Voice warming. voice_prefix_tokens from the previous-generation cloning pipeline. Warm / evict / repromote endpoints. Redis-backed warm registry. Superseded 2026-05-22 by the zero-shot cloning engine; the endpoints remain for compatibility but are no-ops on the new engine.
Phase 4 — WebSocket streaming. WS /stream/speak with the text/flush/interrupt/close vocabulary. Opus + PCM output codecs.
Phase 5 — WebRTC. Session lifecycle at /stream/speak/webrtc/sessions/* plus control WebSocket. Requires aiortc + av in the deployment.
Phase 6 — Async long-form. POST /speak returns 202 + job_id for text over the threshold. GET /speak/jobs/{id} for polling. Caller can force the path with force_async.
Phase 7 — GDPR + safety. Erasure pipeline with audit rows. Blocklist endpoints. Lifecycle hooks (install / upgrade / uninstall / tenant enable / disable) wired into the erasure worker.
2026-05-13 — save_to ScaiDrive. POST /speak accepts a save_to block; sync + async paths upload to the caller's ScaiDrive share via token exchange. Synth admin page at /admin/scaispeak/synthesise ships with the ScaiDrive folder picker and localStorage presets. Global voices: POST /admin/voices/global + DELETE /admin/voices/global/{id}, SuperAdmin-only, licensed-not-consent-based.
2026-05-22 — Zero-shot cloning engine. Self-hosted cloning is now zero-shot: the reference clip is consumed at synth time directly, no separate training step. New voices land at embedding_status: ready immediately after intake clears preflight. Three new optional fields on POST /speak (instructions, cfg_value, warmup_trim_ms) let callers tune per-call delivery for cloned voices. Output sample rate is now 48 kHz on the self-hosted path, up from 24 kHz. The warm / repromote endpoints stay in place as no-ops for compatibility.
2026-05-23 — Voice Design + reference editing. Two new intake / editing capabilities:
- Voice Design: create a voice from a natural-language description alone — no reference clip, no consent recording. POST /voices now accepts voice_design_prompt as a top-level field; cloned-mode reference/consent become forbidden when it's set (SCAISPEAK_AMBIGUOUS_INTAKE_MODE). See the new Design a voice tutorial.
- Switch a cloned voice to design-only: PATCH /voices/{id} with clear_reference: true plus a fresh voice_design_prompt tombstones the reference + consent and keeps the voice ID stable.
- Replace a reference clip: new POST /voices/{id}/reference multipart endpoint to swap a cloned voice's reference + capture new consent without deleting the row. Old blobs tombstoned.
- Per-call instructions now actually moves the voice (was a no-op at engine ship time; engine-side wiring landed this cycle). Field renamed on the wire to control_instruction for clarity; caller-side instructions field unchanged.
2026-05-25 — Per-voice TTS defaults. Voices can now carry default synthesis parameters (default_instructions, default_speed, default_cfg_value, default_warmup_trim_ms) settable via PATCH /voices/{id}. These flow as defaults into every POST /speak call and every ScaiVoice session using the voice, unless overridden per request or per session. Lets voice owners bake in tuning that works well for a particular voice without requiring every consumer to specify it.
2026-05-23 — Text normalisation and pronunciation overrides. Optional text-preprocessing pipeline for POST /speak:
- New normalize_text field per request (true / false / omit for tenant default). When on, the pipeline strips emoji + markdown noise, applies tenant + per-voice pronunciation rules, then expands dates / times / numbers / currency for the voice's primary language (en, nl, de, fr supported).
- Tenant default settable via PUT /admin/policy → text_normalization_default.
- Tenant-wide pronunciation rules on the same endpoint: pronunciation_overrides accepts a list of {pattern, replacement, case_sensitive} rules. Per-voice rules can be set via PATCH /voices/{id} and layer on top of the tenant rules.
- Currency expansion supports $, €, £ with locale-appropriate decimal/thousands separators and lang-specific subunit words (cents / pence / centimes / Cent).
- Cache keys stay on the raw caller text — toggling normalisation per request doesn't break idempotency.