Design a voice (no reference clip)
You can create a voice from text alone: describe the speaker in natural language and the engine generates a synthetic voice that matches. No reference clip, no consent recording, no risk of mimicking a real person. Useful when you want a fresh voice identity (brand mascots, IVR personalities, audiobook narrators that don't sound like anyone in particular).
This is the "Voice Design" mode. It contrasts with the cloned-voice path (see Clone a voice and synthesise) — same /v1/speak API surface afterwards, just a different intake.
When to use Voice Design#
- You want a voice that sounds specific in character (calm, warm, energetic, broadcaster-style) without picking one of the platform's preset voices.
- You don't have rights to a particular person's voice clip but want a custom identity.
- You're building a brand voice that should be platform-owned, not tied to a real speaker.
- You want to iterate cheaply: each new design takes seconds, no recording session needed.
If you have a real speaker whose voice you want to reproduce, use the cloned-voice path instead. Voice Design generates a new voice from the description; it doesn't try to match a person you've heard before.
Prerequisites#
- An API key with
scaispeak:voice.write(tenant admins have it; otherwise grant explicitly). - That's it. No audio recording, no consent capture, no legal review.
1. Write the description#
The description shapes the voice — pace, tone, gender, accent, perceived age, emotional default. The engine reads it directly, so be specific about the qualities that matter to you.
Working examples:
- "warm professional female narrator, calm pace, mid-Atlantic English accent"
- "energetic male radio host, fast cadence, slightly raspy"
- "older gentleman, deliberate pace, slight Scottish lilt, reading-grandfather warmth"
- "neutral broadcaster voice, no strong regional accent, clear and crisp"
Skip the description-of-content. The prompt describes the speaker, not the script. The script comes later at synthesis time.
Minimum length is a handful of words. Single words like "calm" don't give the engine enough to work with — aim for at least one phrase describing pitch / pace / tone.
2. Create the voice#
1 2 3 4 5 6 7 | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Notable differences from the cloned-voice path:
- No
referencefile part. Don't include one — sending bothvoice_design_promptandreferenceis rejected withSCAISPEAK_AMBIGUOUS_INTAKE_MODE. - No
consent_*fields. Designed voices don't represent a real person, so there's no consent to capture. - No tokenisation wait. The voice lands at
embedding_status: readyin the same request — usable on the very next/v1/speakcall.
3. Synthesise#
Use it the same way as any other voice:
1 2 3 4 5 6 7 8 9 10 11 | |
Play greeting.wav. The voice will match the description — same character on every subsequent call, so a brand voice stays consistent across utterances.
4. Per-call delivery (optional)#
The same per-call control fields work on designed voices — instructions, cfg_value, warmup_trim_ms from the clone tutorial all apply:
1 2 3 4 5 6 7 8 9 | |
The voice keeps its identity from the design prompt; the instructions field changes the per-call delivery (pace, emotion, emphasis).
5. Iterate on the design#
If the voice doesn't match what you wanted, edit the design prompt via PATCH /voices/{voice_id}:
1 2 3 4 | |
Every subsequent /v1/speak against the voice uses the new description. The voice ID stays stable, so any callers that store the ID don't need to change.
6. Switch a cloned voice to design-only#
You can convert an existing cloned voice into a designed voice — drops the reference clip + consent record and replaces the identity with a text prompt. Useful when a tenant decides they no longer want a real person's voice on file:
1 2 3 4 5 6 7 | |
clear_reference: true tombstones the reference WAV and consent recording from storage. Requires voice_design_prompt in the same patch — clearing the reference without a fallback identity is rejected (SCAISPEAK_DESIGN_PROMPT_REQUIRED_ON_CLEAR). The voice ID and voice_id references in your own systems stay valid.
When you're done with it#
Same erasure flow as cloned voices:
1 2 | |
For designed voices there's no consent recording to erase (none was captured), but the row + any stored synth outputs are tombstoned the same way. Audit row carries the proof.
Done#
You have a brand voice with no real-person likeness, no consent overhead, instantly usable, editable in place. The same /v1/speak API serves both designed and cloned voices — your callers don't need to know which kind they're talking to.