---
summary: How voices are scoped (global, tenant, user), how cloning works, what consent
  vs license means, and how erasure flows through the system.
title: Voice library and consent
path: concepts/voice-library
status: published
---

A voice in ScaiSpeak is a row in the library with a reference clip, a consent or license trail, and an embedding state. When you call `/speak`, you pick a `voice_id` from a list filtered by what's visible to you.

## Three scopes, one list

Every list call (`GET /voices`) merges three pools:

- **Global** voices are platform-managed. SuperAdmins import them through `POST /admin/voices/global` against a commercial license — the licensor name, license type, and (for time- or usage-bound licenses) the expiry or character cap live on the voice row. No consent recording. Every tenant sees every global voice.
- **Tenant** voices are shared inside one tenant. They're created by a user, then promoted via `POST /voices/{id}/share` (which needs `scaispeak:voice.share`). All users in the tenant see them.
- **User** voices are private to the user who cloned them. Other users in the same tenant don't see them. This is the default scope for any cloned voice.

The visibility ACL is enforced server-side in `VoiceService.list_visible` and `get_visible`. You can't bypass it by guessing a `voice_id` — cross-scope reads return 404, not 403, so the existence of voices outside your scope isn't disclosed.

## Two ways to create a voice

ScaiSpeak supports two distinct intake modes:

1. **Clone from reference** — record a real speaker, capture their consent, and the engine reproduces that person's voice at synthesis time. Requires a 5-30 s reference clip + a recorded consent statement.
2. **Design from text** — describe the voice in natural language and the engine generates a synthetic speaker that matches. No reference clip, no consent recording, no real-person likeness. See *Design a voice* tutorial.

Both paths land voices in the same library and use the same `/v1/speak` API afterwards. Callers don't know (and don't need to know) which kind they're synthesising against.

### Cloned voices

Voice cloning takes two audio inputs:

- A **reference** clip — 5-30 seconds of the speaker's voice in a representative speaking style. This becomes the voice's identity at synthesis time.
- A **consent** clip — the same speaker reading a verbatim scripted statement (the `consent_text` field) that records what they're agreeing to. Authenticates that the reference voice belongs to the person consenting.

Both arrive through `POST /voices` as either multipart file uploads or as ScaiDrive references (one of `{file_id, mcp_uri, share_url}`). You can mix sources per file — reference inline, consent from ScaiDrive — but not both for the same file. Inline-plus-ScaiDrive for the same audio fails fast with `SCAISPEAK_AMBIGUOUS_SOURCE`.

There's also a WebSocket alternative at `WS /voices/record` for live-recording in the browser. Same two-phase flow (reference, then consent), same validation, no file handling.

Cloning is **zero-shot** — the reference clip is consumed at synthesis time directly; there's no separate training step. Voices become usable as soon as intake clears preflight + the consent record is committed.

### Designed voices

Designed voices skip every audio step. The caller sends `voice_design_prompt` (a natural-language description of the speaker) to the same `POST /voices` endpoint and the voice lands at `embedding_status: ready` immediately. No reference, no consent, no preflight — the engine generates a synthetic speaker from the prompt alone.

This is mutually exclusive with the cloned-voice intake — sending `voice_design_prompt` *and* a reference upload in the same request is rejected with `SCAISPEAK_AMBIGUOUS_INTAKE_MODE`. Pick a mode per voice.

### Switching between modes

You can convert an existing cloned voice into a designed one — `PATCH /voices/{id}` with `clear_reference: true` and a fresh `voice_design_prompt`. The reference clip + consent recording are tombstoned; the voice ID stays stable. Useful when a tenant decides they no longer want a real speaker's voice on file. See the *Design a voice* tutorial §6.

You can also **replace** the reference clip on a cloned voice (a better recording of the same speaker, or a re-recording with refreshed consent) via `POST /voices/{id}/reference`. New consent capture required because the recording itself is changing.

## Preflight

Before any audio is stored, the reference clip runs through a cheap preflight (`run_preflight`):

- Duration in milliseconds (must be in spec range).
- Sample rate and channel count.
- Peak dBFS (clipping check).
- Estimated SNR (signal-to-noise sanity check).
- Voice-activity ratio (rejects clips that are mostly silence).

When the preflight fails, the response carries the structured `preflight` block so the operator can see which threshold tripped without having to re-upload. Warn-not-block findings show up in `warnings`; blocking findings show up in `fail_reasons` and the request returns 400 with code `SCAISPEAK_VOICE_PREFLIGHT_FAILED`.

## Voice lifecycle

A new voice goes through these states (column `embedding_status`):

| State | Meaning |
|---|---|
| `pending` | Created, intake started, blocked on preflight or consent recording. Brief — exists only while the upload completes. |
| `processing` | Legacy state from the pre-zero-shot era; not used by the current pipeline. Stuck rows here indicate the voice was created before the engine migration and hasn't been re-promoted. |
| `ready` | The voice is usable. Backend A (self-hosted) reuses the reference clip at synth time for zero-shot cloning; Backend B (managed relay) uses its own enrollment if applicable. |
| `failed` | Intake failed. `embedding_status_reason` carries a short tag (`reference_too_short`, `reference_unavailable`, etc.). |
| `evicted` | Soft-deleted by erasure. The row is kept for audit; cached artefacts are cleared; the reference audio is gone from object storage. |

For voices stuck in `processing` (legacy), `POST /voices/{id}/repromote` re-runs intake processing. Idempotent — if the voice is already ready, it's a no-op.

## Consent vs license vs design

The audit trail differs by intake mode:

- **Cloned user / tenant voices** carry a `voice_consent` row. It pins the speaker's full name, the stated purpose, the verbatim consent text, and a hash of the consent audio against that text. This is the GDPR-grade record that the human in the clip authorised use.
- **Cloned global voices** carry a `voice_platform_license` row instead. It pins the licensor, license type (`perpetual`, `time_bound`, `usage_bound`), and (for non-perpetual) the bounds. Licensed acquisition is the audit equivalent for platform-wide voices — no individual end-user consent exists because the license is between ScaiLabs and the voice talent.
- **Designed voices** carry **neither** consent nor license rows. There's no human voice involved — no person to consent, no licensor to credit. The audit record is the voice row itself with `voice_design_prompt` set and `consent_record_id`/`platform_license_id` both NULL.

Consent and license rows are immutable once written. Designed voices can have their prompt edited via `PATCH /voices/{id}` because there's no audit dependency on it.

## Erasure (right to be forgotten)

`DELETE /voices/{id}` is the user-facing erasure path. It's not a simple row delete — it fans out:

1. Tombstone the voice row (`deleted_at` set, `embedding_status='evicted'`).
2. Send `EvictVoice` to every ScaiInfer node currently warm on this voice.
3. Clear the voice-warm Redis registry.
4. Delete the reference audio + consent audio blobs from object storage.
5. Write an immutable `erasure_audit` row capturing the trigger user, source, warm-replicas-evicted count, blob-bytes-deleted count, and any partial-failure error summary.

The response carries the `audit_id` so GDPR tooling can cross-reference. The tombstoned row gets hard-deleted later by the background tombstone worker; until then, listing endpoints filter it out.

Global voices have a parallel path at `DELETE /admin/voices/global/{id}` (SuperAdmin-only) with a required `trigger`: `license_revoked`, `license_expired`, or `platform_decision`. Same erasure pipeline; the license row's `status` is updated to match the trigger.

## Per-voice pronunciation overrides

A voice can carry a list of pronunciation rules (`pronunciation_overrides` on the voice row) that get applied before language-specific text expansion whenever `normalize_text` is enabled. Use them for brand names, acronyms, or jargon the engine consistently mispronounces — e.g. `{"pattern": "Kubernetes", "replacement": "koo-ber-net-eez"}`. Rules are whole-word matches with optional `case_sensitive` (defaults to true).

Tenant admins can also ship a tenant-wide rule sheet via `PUT /admin/policy → pronunciation_overrides`. The pipeline applies tenant rules first, then per-voice rules on top — so a voice can refine or override a tenant default for its own use case. See *Text normalisation and pronunciation overrides* in the [API reference](../reference/api) for the wire format and supported expander languages.

## Backend portability

Voices are not tied to a single backend. A voice created on Backend A (self-hosted, supports zero-shot cloning) keeps working if the deployment falls back to Backend B (managed relay, preset speakers only) — though the relay can't reproduce a cloned identity; it serves the reference clip's nearest preset match if any. The speaker identity is the reference audio itself, stored in object storage with the voice row.

`POST /voices/{id}/repromote` re-runs the intake pipeline for voices that were created before the current engine and are still in `processing`. Idempotent on already-ready voices.