Text-to-speech (TTS)

Generate speech and play it through decibri's Speaker. TTS is the output side of decibri's audio layer. Where speech-to-text captures microphone audio and feeds it to a recognizer, text-to-speech runs a synthesis model and sends the generated samples to the speaker.

A TTS provider turns text into audio samples; decibri plays those samples on the output device. The model produces raw PCM (commonly float32 at its own sample rate), and decibri's Speaker accepts it directly, so there is no format plumbing between the model and the device. Construct the Speaker at the model's reported sample rate and write the generated samples to it.

Decision framework

Local vs cloud. TTS on the site is currently local-only. The Sherpa-ONNX integration runs the Kokoro model entirely on-device: no API key, no network dependency, and no per-minute cost, in exchange for a one-time model download and local compute. Cloud TTS providers may be added in future; when they are, this is where the trade-offs between them will live.

Non-streaming synthesis. Sherpa-ONNX TTS uses OfflineTts, which synthesises the full clip up front. It can optionally emit audio in chunks as it generates, which allows playback to start before the full clip is ready, but the synthesis itself is not streaming. The provider walkthrough uses the simpler path: generate the full clip, then play it.

Providers

Sherpa-ONNX

Offline text-to-speech with the Kokoro model. Download the model once, generate float32 speech locally, and play it through decibri's Speaker. No API key, no cloud, no network dependency. The Kokoro release bundles its own espeak-ng-data, so no separate espeak-ng install is required.

Troubleshooting

Model files not found. The paths passed to the engine config are resolved relative to your working directory. Confirm kokoro-en-v0_19/model.onnx, voices.bin, and tokens.txt exist at the paths you passed, or use absolute paths.

espeak-ng-data not found. The data_dir (Python) or dataDir (Node) option must point at the espeak-ng-data/ directory bundled inside the extracted model. The Kokoro release ships it, so no separate espeak-ng install is needed; if you relocated the model files, point this option at the bundled directory.

Silent or distorted playback. Almost always a dtype or sample-rate mismatch. Construct the Speaker with dtype float32 to match Kokoro's float32 output, and at the sample rate the model reports (audio.sample_rate or audio.sampleRate, 24000 Hz for this model). Constructing at 16000 Hz, or with int16, produces silence, noise, or wrong-pitch audio.

No audio, or the wrong output device. No output device is available, or the system default is not the one you expect. List outputs with Speaker.devices() and pass an explicit device to the Speaker constructor.

Node: Buffer wrapping. The Node Speaker's writeAsync takes a Buffer. Wrap the Float32Array over its underlying bytes with Buffer.from(audio.samples.buffer, audio.samples.byteOffset, audio.samples.byteLength). Passing the typed array directly, or a Buffer of the wrong length, plays noise.

Text-to-speech (TTS)

Decision framework

Providers

Sherpa-ONNX

Troubleshooting

Related