Design a voice (no reference clip)

You can create a voice from text alone: describe the speaker in natural language and the engine generates a synthetic voice that matches. No reference clip, no consent recording, no risk of mimicking a real person. Useful when you want a fresh voice identity (brand mascots, IVR personalities, audiobook narrators that don't sound like anyone in particular).

This is the "Voice Design" mode. It contrasts with the cloned-voice path (see Clone a voice and synthesise) — same /v1/speak API surface afterwards, just a different intake.

When to use Voice Design#

You want a voice that sounds specific in character (calm, warm, energetic, broadcaster-style) without picking one of the platform's preset voices.
You don't have rights to a particular person's voice clip but want a custom identity.
You're building a brand voice that should be platform-owned, not tied to a real speaker.
You want to iterate cheaply: each new design takes seconds, no recording session needed.

If you have a real speaker whose voice you want to reproduce, use the cloned-voice path instead. Voice Design generates a new voice from the description; it doesn't try to match a person you've heard before.

Prerequisites#

An API key with scaispeak:voice.write (tenant admins have it; otherwise grant explicitly).
That's it. No audio recording, no consent capture, no legal review.

1. Write the description#

The description shapes the voice — pace, tone, gender, accent, perceived age, emotional default. The engine reads it directly, so be specific about the qualities that matter to you.

Working examples:

"warm professional female narrator, calm pace, mid-Atlantic English accent"
"energetic male radio host, fast cadence, slightly raspy"
"older gentleman, deliberate pace, slight Scottish lilt, reading-grandfather warmth"
"neutral broadcaster voice, no strong regional accent, clear and crisp"

Skip the description-of-content. The prompt describes the speaker, not the script. The script comes later at synthesis time.

Minimum length is a handful of words. Single words like "calm" don't give the engine enough to work with — aim for at least one phrase describing pitch / pace / tone.

2. Create the voice#

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaispeak/voices" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -F "display_name=Acme Brand Voice" \
  -F "language_primary=en" \
  -F "language_supported_json=[\"en\"]" \
  -F "gender_hint=female" \
  -F "voice_design_prompt=warm professional female narrator, calm pace, mid-Atlantic English accent"

python
import httpx, os

r = httpx.post(
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaispeak/voices",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    data={
        "display_name": "Acme Brand Voice",
        "language_primary": "en",
        "language_supported_json": '["en"]',
        "gender_hint": "female",
        "voice_design_prompt": (
            "warm professional female narrator, "
            "calm pace, mid-Atlantic English accent"
        ),
    },
)
r.raise_for_status()
voice = r.json()["data"]
print(voice["voice_id"], voice["embedding_status"])
# → embedding_status: ready (immediately)

javascript
const form = new FormData();
form.append("display_name", "Acme Brand Voice");
form.append("language_primary", "en");
form.append("language_supported_json", '["en"]');
form.append("gender_hint", "female");
form.append("voice_design_prompt",
  "warm professional female narrator, calm pace, mid-Atlantic English accent");

const res = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaispeak/voices`, {
  method: "POST",
  headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` },
  body: form,
});
const { data: voice } = await res.json();
console.log(voice.voice_id, voice.embedding_status);

Notable differences from the cloned-voice path:

No reference file part. Don't include one — sending both voice_design_prompt and reference is rejected with SCAISPEAK_AMBIGUOUS_INTAKE_MODE.
No consent_* fields. Designed voices don't represent a real person, so there's no consent to capture.
No tokenisation wait. The voice lands at embedding_status: ready in the same request — usable on the very next /v1/speak call.

3. Synthesise#

Use it the same way as any other voice:

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaispeak/speak" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_id": "'$VOICE_ID'",
    "text": "Welcome to Acme. Your call is being routed to the next available agent.",
    "response_format": "wav"
  }' \
  | python -c "import sys,json,base64; \
    b=json.load(sys.stdin)['data']['audio_base64']; \
    open('greeting.wav','wb').write(base64.b64decode(b))"

Play greeting.wav. The voice will match the description — same character on every subsequent call, so a brand voice stays consistent across utterances.

4. Per-call delivery (optional)#

The same per-call control fields work on designed voices — instructions, cfg_value, warmup_trim_ms from the clone tutorial all apply:

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaispeak/speak" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_id": "'$VOICE_ID'",
    "text": "Press one for sales, two for support.",
    "response_format": "wav",
    "instructions": "slow and reassuring"
  }'

The voice keeps its identity from the design prompt; the instructions field changes the per-call delivery (pace, emotion, emphasis).

5. Iterate on the design#

If the voice doesn't match what you wanted, edit the design prompt via PATCH /voices/{voice_id}:

bash
curl -X PATCH "$SCAIGRID_HOST/v1/modules/scaispeak/voices/$VOICE_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"voice_design_prompt":"warm female narrator, slightly slower pace, hint of British accent"}'

Every subsequent /v1/speak against the voice uses the new description. The voice ID stays stable, so any callers that store the ID don't need to change.

6. Switch a cloned voice to design-only#

You can convert an existing cloned voice into a designed voice — drops the reference clip + consent record and replaces the identity with a text prompt. Useful when a tenant decides they no longer want a real person's voice on file:

bash
curl -X PATCH "$SCAIGRID_HOST/v1/modules/scaispeak/voices/$VOICE_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_design_prompt": "warm professional voice, similar character to the previous reference",
    "clear_reference": true
  }'

clear_reference: true tombstones the reference WAV and consent recording from storage. Requires voice_design_prompt in the same patch — clearing the reference without a fallback identity is rejected (SCAISPEAK_DESIGN_PROMPT_REQUIRED_ON_CLEAR). The voice ID and voice_id references in your own systems stay valid.

When you're done with it#

Same erasure flow as cloned voices:

bash
curl -X DELETE "$SCAIGRID_HOST/v1/modules/scaispeak/voices/$VOICE_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

For designed voices there's no consent recording to erase (none was captured), but the row + any stored synth outputs are tombstoned the same way. Audit row carries the proof.

Done#

You have a brand voice with no real-person likeness, no consent overhead, instantly usable, editable in place. The same /v1/speak API serves both designed and cloned voices — your callers don't need to know which kind they're talking to.