Voice library and consent

A voice in ScaiSpeak is a row in the library with a reference clip, a consent or license trail, and an embedding state. When you call /speak, you pick a voice_id from a list filtered by what's visible to you.

Three scopes, one list#

Every list call (GET /voices) merges three pools:

Global voices are platform-managed. SuperAdmins import them through POST /admin/voices/global against a commercial license — the licensor name, license type, and (for time- or usage-bound licenses) the expiry or character cap live on the voice row. No consent recording. Every tenant sees every global voice.
Tenant voices are shared inside one tenant. They're created by a user, then promoted via POST /voices/{id}/share (which needs scaispeak:voice.share). All users in the tenant see them.
User voices are private to the user who cloned them. Other users in the same tenant don't see them. This is the default scope for any cloned voice.

The visibility ACL is enforced server-side in VoiceService.list_visible and get_visible. You can't bypass it by guessing a voice_id — cross-scope reads return 404, not 403, so the existence of voices outside your scope isn't disclosed.

Two ways to create a voice#

ScaiSpeak supports two distinct intake modes:

Clone from reference — record a real speaker, capture their consent, and the engine reproduces that person's voice at synthesis time. Requires a 5-30 s reference clip + a recorded consent statement.
Design from text — describe the voice in natural language and the engine generates a synthetic speaker that matches. No reference clip, no consent recording, no real-person likeness. See Design a voice tutorial.

Both paths land voices in the same library and use the same /v1/speak API afterwards. Callers don't know (and don't need to know) which kind they're synthesising against.

Cloned voices#

Voice cloning takes two audio inputs:

A reference clip — 5-30 seconds of the speaker's voice in a representative speaking style. This becomes the voice's identity at synthesis time.
A consent clip — the same speaker reading a verbatim scripted statement (the consent_text field) that records what they're agreeing to. Authenticates that the reference voice belongs to the person consenting.

Both arrive through POST /voices as either multipart file uploads or as ScaiDrive references (one of {file_id, mcp_uri, share_url}). You can mix sources per file — reference inline, consent from ScaiDrive — but not both for the same file. Inline-plus-ScaiDrive for the same audio fails fast with SCAISPEAK_AMBIGUOUS_SOURCE.

There's also a WebSocket alternative at WS /voices/record for live-recording in the browser. Same two-phase flow (reference, then consent), same validation, no file handling.

Cloning is zero-shot — the reference clip is consumed at synthesis time directly; there's no separate training step. Voices become usable as soon as intake clears preflight + the consent record is committed.

Designed voices#

Designed voices skip every audio step. The caller sends voice_design_prompt (a natural-language description of the speaker) to the same POST /voices endpoint and the voice lands at embedding_status: ready immediately. No reference, no consent, no preflight — the engine generates a synthetic speaker from the prompt alone.

This is mutually exclusive with the cloned-voice intake — sending voice_design_prompt and a reference upload in the same request is rejected with SCAISPEAK_AMBIGUOUS_INTAKE_MODE. Pick a mode per voice.

Switching between modes#

You can convert an existing cloned voice into a designed one — PATCH /voices/{id} with clear_reference: true and a fresh voice_design_prompt. The reference clip + consent recording are tombstoned; the voice ID stays stable. Useful when a tenant decides they no longer want a real speaker's voice on file. See the Design a voice tutorial §6.

You can also replace the reference clip on a cloned voice (a better recording of the same speaker, or a re-recording with refreshed consent) via POST /voices/{id}/reference. New consent capture required because the recording itself is changing.

Preflight#

Before any audio is stored, the reference clip runs through a cheap preflight (run_preflight):

Duration in milliseconds (must be in spec range).
Sample rate and channel count.
Peak dBFS (clipping check).
Estimated SNR (signal-to-noise sanity check).
Voice-activity ratio (rejects clips that are mostly silence).

When the preflight fails, the response carries the structured preflight block so the operator can see which threshold tripped without having to re-upload. Warn-not-block findings show up in warnings; blocking findings show up in fail_reasons and the request returns 400 with code SCAISPEAK_VOICE_PREFLIGHT_FAILED.

Voice lifecycle#

A new voice goes through these states (column embedding_status):

State	Meaning
`pending`	Created, intake started, blocked on preflight or consent recording. Brief — exists only while the upload completes.
`processing`	Legacy state from the pre-zero-shot era; not used by the current pipeline. Stuck rows here indicate the voice was created before the engine migration and hasn't been re-promoted.
`ready`	The voice is usable. Backend A (self-hosted) reuses the reference clip at synth time for zero-shot cloning; Backend B (managed relay) uses its own enrollment if applicable.
`failed`	Intake failed. `embedding_status_reason` carries a short tag (`reference_too_short`, `reference_unavailable`, etc.).
`evicted`	Soft-deleted by erasure. The row is kept for audit; cached artefacts are cleared; the reference audio is gone from object storage.

For voices stuck in processing (legacy), POST /voices/{id}/repromote re-runs intake processing. Idempotent — if the voice is already ready, it's a no-op.

The audit trail differs by intake mode:

Cloned user / tenant voices carry a voice_consent row. It pins the speaker's full name, the stated purpose, the verbatim consent text, and a hash of the consent audio against that text. This is the GDPR-grade record that the human in the clip authorised use.
Cloned global voices carry a voice_platform_license row instead. It pins the licensor, license type (perpetual, time_bound, usage_bound), and (for non-perpetual) the bounds. Licensed acquisition is the audit equivalent for platform-wide voices — no individual end-user consent exists because the license is between ScaiLabs and the voice talent.
Designed voices carry neither consent nor license rows. There's no human voice involved — no person to consent, no licensor to credit. The audit record is the voice row itself with voice_design_prompt set and consent_record_id/platform_license_id both NULL.

Consent and license rows are immutable once written. Designed voices can have their prompt edited via PATCH /voices/{id} because there's no audit dependency on it.

Erasure (right to be forgotten)#

DELETE /voices/{id} is the user-facing erasure path. It's not a simple row delete — it fans out:

Tombstone the voice row (deleted_at set, embedding_status='evicted').
Send EvictVoice to every ScaiInfer node currently warm on this voice.
Clear the voice-warm Redis registry.
Delete the reference audio + consent audio blobs from object storage.
Write an immutable erasure_audit row capturing the trigger user, source, warm-replicas-evicted count, blob-bytes-deleted count, and any partial-failure error summary.

The response carries the audit_id so GDPR tooling can cross-reference. The tombstoned row gets hard-deleted later by the background tombstone worker; until then, listing endpoints filter it out.

Global voices have a parallel path at DELETE /admin/voices/global/{id} (SuperAdmin-only) with a required trigger: license_revoked, license_expired, or platform_decision. Same erasure pipeline; the license row's status is updated to match the trigger.

Per-voice TTS defaults#

A voice can carry default synthesis parameters that apply every time it is used, without the caller having to repeat them on each request:

default_instructions -- style / emotion / delivery guidance (e.g. "warm and conversational").
default_speed -- speaking speed (0.5--2.0).
default_cfg_value -- cloning-fidelity tradeoff (0.5--5.0; cloned voices only).
default_warmup_trim_ms -- milliseconds to trim from the start of generated audio (cloned voices only).

Set them via PATCH /voices/{id}. They flow as defaults into POST /speak and into every ScaiVoice session that uses the voice. Per-request fields on /speak and per-session fields on ScaiVoice POST /sessions override voice defaults when set. The full precedence chain is: engine default < voice default < per-request or per-session override.

This lets a voice owner bake in tuning that works well for a particular voice (e.g. a slightly slower speed for a voice that sounds rushed at 1.0x, or a warmup trim that removes a click artefact) without forcing every consumer of the voice to know those details.

Per-voice pronunciation overrides#

A voice can carry a list of pronunciation rules (pronunciation_overrides on the voice row) that get applied before language-specific text expansion whenever normalize_text is enabled. Use them for brand names, acronyms, or jargon the engine consistently mispronounces — e.g. {"pattern": "Kubernetes", "replacement": "koo-ber-net-eez"}. Rules are whole-word matches with optional case_sensitive (defaults to true).

Tenant admins can also ship a tenant-wide rule sheet via PUT /admin/policy → pronunciation_overrides. The pipeline applies tenant rules first, then per-voice rules on top — so a voice can refine or override a tenant default for its own use case. See Text normalisation and pronunciation overrides in the API reference for the wire format and supported expander languages.

Backend portability#

Voices are not tied to a single backend. A voice created on Backend A (self-hosted, supports zero-shot cloning) keeps working if the deployment falls back to Backend B (managed relay, preset speakers only) — though the relay can't reproduce a cloned identity; it serves the reference clip's nearest preset match if any. The speaker identity is the reference audio itself, stored in object storage with the voice row.

POST /voices/{id}/repromote re-runs the intake pipeline for voices that were created before the current engine and are still in processing. Idempotent on already-ready voices.