Sherpa-ONNX (Kokoro)

Local text-to-speech using decibri and sherpa-onnx running the Kokoro model. Generates speech on your machine and plays it through decibri's Speaker. Runs entirely offline with no API key, no cloud service, and no network dependency.

What this does

This integration is the output side of the pipeline. Instead of capturing audio, it generates it: sherpa-onnx runs the Kokoro model to synthesize speech from text, and decibri plays the resulting audio through your system speaker. Everything runs locally on your machine, with no API key and no network call.

Three roles, kept distinct: Kokoro is the text-to-speech model (created by hexgrad), sherpa-onnx is the runtime that loads and runs it, and decibri plays the generated audio through the system Speaker.

The Kokoro release from k2-fsa bundles its own espeak-ng-data directory, so no separate espeak-ng system install is required. Other local Kokoro paths need a manual espeak-ng install; this one does not.

Prerequisites

Install packages

$ pip install decibri sherpa-onnx

Install decibri and the sherpa-onnx runtime for your language. The install command above switches with the language tabs in the code blocks below. On Python, numpy is pulled in transitively by sherpa-onnx, so the numpy import in the code below needs no separate install. There is no model download from a package manager; the model files are fetched in the next step.

Download the Kokoro model

sherpa-onnx loads a pre-trained model. Download the Kokoro English release from the sherpa-onnx TTS models and extract it:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/kokoro-en-v0_19.tar.bz2
tar xvf kokoro-en-v0_19.tar.bz2

This produces a kokoro-en-v0_19/ directory containing model.onnx, voices.bin, tokens.txt, and an espeak-ng-data/ directory. The model is about 345 MB. It covers English with 11 speakers, selected by speaker id (sid): 0 = af, 1 = af_bella, 2 = af_nicole, 3 = af_sarah, 4 = af_sky, 5 = am_adam, 6 = am_michael, 7 = bf_emma, 8 = bf_isabella, 9 = bm_george, 10 = bm_lewis.

Code walkthrough

Build the Kokoro TTS engine, generate speech from text, and play the result through decibri's Speaker. Kokoro outputs float32 samples in the range -1.0 to 1.0 at 24000 Hz mono. Decibri's Speaker accepts float32 directly when constructed with dtype float32, so the audio flows straight from Kokoro to playback with no int16 conversion and no rescaling. Construct the Speaker at the sample rate the model reports (read it from the generated audio rather than hardcoding it), so swapping in another model stays correct.

1. Build the TTS engine

Construct the engine once and reuse it for every clip you synthesise. The config points sherpa-onnx at the four pieces of the Kokoro release: the ONNX model, the voices file, the tokens file, and the bundled espeak-ng-data directory. Python config keys are snake_case; the Node config uses the camelCase equivalents.

import numpy as np
import sherpa_onnx
import decibri

# Build the TTS engine (Kokoro). Config keys are snake_case.
config = sherpa_onnx.OfflineTtsConfig(
    model=sherpa_onnx.OfflineTtsModelConfig(
        kokoro=sherpa_onnx.OfflineTtsKokoroModelConfig(
            model="./kokoro-en-v0_19/model.onnx",
            voices="./kokoro-en-v0_19/voices.bin",
            tokens="./kokoro-en-v0_19/tokens.txt",
            data_dir="./kokoro-en-v0_19/espeak-ng-data",
        ),
        num_threads=2,
    ),
)
tts = sherpa_onnx.OfflineTts(config)

const sherpa = require("sherpa-onnx-node");
const decibri = require("decibri");

// Build the TTS engine (Kokoro). Config keys are camelCase.
const tts = new sherpa.OfflineTts({
  model: {
    kokoro: {
      model: "./kokoro-en-v0_19/model.onnx",
      voices: "./kokoro-en-v0_19/voices.bin",
      tokens: "./kokoro-en-v0_19/tokens.txt",
      dataDir: "./kokoro-en-v0_19/espeak-ng-data",
    },
    numThreads: 2,
  },
});

2. Generate speech from text

Call generate() with the text and a speaker id (sid, 0 to 10). It returns the audio as a float32 sample buffer plus the sample rate it was produced at (24000 Hz for Kokoro). This example generates the full clip and then plays it. sherpa-onnx can also emit audio in chunks as it generates, which would allow playback to start sooner; that callback-based path is an available enhancement.

# Generate speech. Returns a GeneratedAudio with .samples (float32) and .sample_rate.
audio = tts.generate(text="A demonstration of decibri as the audio-output layer for text-to-speech: Kokoro generates the speech, and decibri plays it through the system speaker in real time.", sid=0, speed=1.0)
samples = np.asarray(audio.samples, dtype=np.float32)

// Generate speech. generate() takes an options object; returns { samples: Float32Array, sampleRate }.
const audio = tts.generate({
  text: "A demonstration of decibri as the audio-output layer for text-to-speech: Kokoro generates the speech, and decibri plays it through the system speaker in real time.",
  sid: 0,      // speaker id 0-10
  speed: 1.0,
});

3. Play through decibri's Speaker

Construct the Speaker at the sample rate the audio reports, with dtype float32 so the samples pass straight through with no conversion. The two tabs differ on the playback API, and that difference is deliberate: the Python Speaker is a method object (start(), write(), drain(), stop()), while the Node Speaker is a Node.js Writable stream where construction opens the device, then writeAsync() takes a Buffer over the float32 bytes, drainAsync() waits for playback to finish, and stop() releases the device. There is no start() on the Node Speaker. Each tab uses the idiom native to its ecosystem.

# Play through decibri's Speaker.
# Note: Python Speaker is a method object: start/write/drain/stop.
speaker = decibri.Speaker(sample_rate=audio.sample_rate, channels=1, dtype="float32")
try:
    speaker.start()
    speaker.write(samples)  # float32 ndarray accepted directly with dtype="float32"
    speaker.drain()         # blocks until playback finishes
finally:
    speaker.stop()

// Play through decibri's Speaker.
// Note: Node Speaker is a Writable stream. No start(); construction opens it.
// write takes bytes, so wrap the Float32Array in a Buffer over the same bytes.
const speaker = new decibri.Speaker({
  sampleRate: audio.sampleRate, // 24000 for Kokoro
  channels: 1,
  dtype: "float32",
});
const buf = Buffer.from(audio.samples.buffer, audio.samples.byteOffset, audio.samples.byteLength);
await speaker.writeAsync(buf); // not write()
await speaker.drainAsync();    // waits for playback to finish; not drain()
speaker.stop();                // sync; releases the device

Full example

View complete code

import numpy as np
import sherpa_onnx
import decibri

# Build the TTS engine (Kokoro). Config keys are snake_case.
config = sherpa_onnx.OfflineTtsConfig(
    model=sherpa_onnx.OfflineTtsModelConfig(
        kokoro=sherpa_onnx.OfflineTtsKokoroModelConfig(
            model="./kokoro-en-v0_19/model.onnx",
            voices="./kokoro-en-v0_19/voices.bin",
            tokens="./kokoro-en-v0_19/tokens.txt",
            data_dir="./kokoro-en-v0_19/espeak-ng-data",
        ),
        num_threads=2,
    ),
)
tts = sherpa_onnx.OfflineTts(config)

# Generate speech. Returns a GeneratedAudio with .samples (float32) and .sample_rate.
audio = tts.generate(text="A demonstration of decibri as the audio-output layer for text-to-speech: Kokoro generates the speech, and decibri plays it through the system speaker in real time.", sid=0, speed=1.0)
samples = np.asarray(audio.samples, dtype=np.float32)

# Play through decibri's Speaker.
# Note: Python Speaker is a method object: start/write/drain/stop.
speaker = decibri.Speaker(sample_rate=audio.sample_rate, channels=1, dtype="float32")
try:
    speaker.start()
    speaker.write(samples)  # float32 ndarray accepted directly with dtype="float32"
    speaker.drain()         # blocks until playback finishes
finally:
    speaker.stop()

const sherpa = require("sherpa-onnx-node");
const decibri = require("decibri");

// Build the TTS engine (Kokoro). Config keys are camelCase.
const tts = new sherpa.OfflineTts({
  model: {
    kokoro: {
      model: "./kokoro-en-v0_19/model.onnx",
      voices: "./kokoro-en-v0_19/voices.bin",
      tokens: "./kokoro-en-v0_19/tokens.txt",
      dataDir: "./kokoro-en-v0_19/espeak-ng-data",
    },
    numThreads: 2,
  },
});

// Generate speech. generate() takes an options object; returns { samples: Float32Array, sampleRate }.
const audio = tts.generate({
  text: "A demonstration of decibri as the audio-output layer for text-to-speech: Kokoro generates the speech, and decibri plays it through the system speaker in real time.",
  sid: 0,      // speaker id 0-10
  speed: 1.0,
});

// Play through decibri's Speaker.
// Note: Node Speaker is a Writable stream. No start(); construction opens it.
// write takes bytes, so wrap the Float32Array in a Buffer over the same bytes.
const speaker = new decibri.Speaker({
  sampleRate: audio.sampleRate, // 24000 for Kokoro
  channels: 1,
  dtype: "float32",
});
const buf = Buffer.from(audio.samples.buffer, audio.samples.byteOffset, audio.samples.byteLength);
await speaker.writeAsync(buf); // not write()
await speaker.drainAsync();    // waits for playback to finish; not drain()
speaker.stop();                // sync; releases the device

Attribution

Kokoro is the text-to-speech model, created by hexgrad and released under the Apache-2.0 licence. It has 82 million parameters and is built on the StyleTTS 2 architecture. sherpa-onnx, from the k2-fsa project, is the runtime that loads and runs the Kokoro model; it is not Kokoro itself, and it is not an official Kokoro package. Decibri's role is the final step: it plays the audio that Kokoro generates. See the Kokoro model page for model details, voices, and licence.