---
summary: User-visible changes to ScaiSpeak.
title: Changelog
path: changelog
status: published
---

User-visible changes only. Internal refactors and infrastructure work omitted.

## v0.x — Phase rollout

ScaiSpeak ships in phases. Each phase adds endpoints and capabilities; the module ID and URL prefix have been stable since Phase 0.

- **Phase 1 — Voice library.** List / get / clone / update / delete voices. Preflight checks on intake. Consent capture. ScaiDrive references for reference + consent audio. Permissions split into `synthesize`, `voice.read`, `voice.write`, `voice.share`, `admin`.
- **Phase 2 — Batch synth.** `POST /speak` with Backend B (managed TTS relay) wired. Tenant backend policy at `/admin/policy`. Voice preview endpoint.
- **Phase 2B — Self-host backend.** Backend A (ScaiInfer-hosted TTS engine) added behind the same `/speak` path. Backend policy picks per-tenant.
- **Phase 3 — Voice warming.** `voice_prefix_tokens` from the previous-generation cloning pipeline. Warm / evict / repromote endpoints. Redis-backed warm registry. *Superseded 2026-05-22 by the zero-shot cloning engine; the endpoints remain for compatibility but are no-ops on the new engine.*
- **Phase 4 — WebSocket streaming.** `WS /stream/speak` with the text/flush/interrupt/close vocabulary. Opus + PCM output codecs.
- **Phase 5 — WebRTC.** Session lifecycle at `/stream/speak/webrtc/sessions/*` plus control WebSocket. Requires `aiortc` + `av` in the deployment.
- **Phase 6 — Async long-form.** `POST /speak` returns `202` + `job_id` for text over the threshold. `GET /speak/jobs/{id}` for polling. Caller can force the path with `force_async`.
- **Phase 7 — GDPR + safety.** Erasure pipeline with audit rows. Blocklist endpoints. Lifecycle hooks (install / upgrade / uninstall / tenant enable / disable) wired into the erasure worker.
- **2026-05-13 — save_to ScaiDrive.** `POST /speak` accepts a `save_to` block; sync + async paths upload to the caller's ScaiDrive share via token exchange. Synth admin page at `/admin/scaispeak/synthesise` ships with the ScaiDrive folder picker and localStorage presets. Global voices: `POST /admin/voices/global` + `DELETE /admin/voices/global/{id}`, SuperAdmin-only, licensed-not-consent-based.
- **2026-05-22 — Zero-shot cloning engine.** Self-hosted cloning is now zero-shot: the reference clip is consumed at synth time directly, no separate training step. New voices land at `embedding_status: ready` immediately after intake clears preflight. Three new optional fields on `POST /speak` (`instructions`, `cfg_value`, `warmup_trim_ms`) let callers tune per-call delivery for cloned voices. Output sample rate is now 48 kHz on the self-hosted path, up from 24 kHz. The warm / repromote endpoints stay in place as no-ops for compatibility.
- **2026-05-23 — Voice Design + reference editing.** Two new intake / editing capabilities:
    - **Voice Design**: create a voice from a natural-language description alone — no reference clip, no consent recording. `POST /voices` now accepts `voice_design_prompt` as a top-level field; cloned-mode `reference`/`consent` become forbidden when it's set (`SCAISPEAK_AMBIGUOUS_INTAKE_MODE`). See the new *Design a voice* tutorial.
    - **Switch a cloned voice to design-only**: `PATCH /voices/{id}` with `clear_reference: true` plus a fresh `voice_design_prompt` tombstones the reference + consent and keeps the voice ID stable.
    - **Replace a reference clip**: new `POST /voices/{id}/reference` multipart endpoint to swap a cloned voice's reference + capture new consent without deleting the row. Old blobs tombstoned.
    - Per-call `instructions` now actually moves the voice (was a no-op at engine ship time; engine-side wiring landed this cycle). Field renamed on the wire to `control_instruction` for clarity; caller-side `instructions` field unchanged.
- **2026-05-23 — Text normalisation and pronunciation overrides.** Optional text-preprocessing pipeline for `POST /speak`:
    - New `normalize_text` field per request (`true` / `false` / omit for tenant default). When on, the pipeline strips emoji + markdown noise, applies tenant + per-voice pronunciation rules, then expands dates / times / numbers / currency for the voice's primary language (en, nl, de, fr supported).
    - Tenant default settable via `PUT /admin/policy` → `text_normalization_default`.
    - Tenant-wide pronunciation rules on the same endpoint: `pronunciation_overrides` accepts a list of `{pattern, replacement, case_sensitive}` rules. Per-voice rules can be set via `PATCH /voices/{id}` and layer on top of the tenant rules.
    - Currency expansion supports `$`, `€`, `£` with locale-appropriate decimal/thousands separators and lang-specific subunit words (cents / pence / centimes / Cent).
    - Cache keys stay on the raw caller text — toggling normalisation per request doesn't break idempotency.
