Speech-to-text (STT)

Turn microphone audio into text. Decibri captures the audio as a standard stream of 16-bit PCM chunks; an STT provider processes it into words. Both cloud and local providers are supported, covering common trade-offs on accuracy, latency, cost, and offline capability.

Decision framework

Cloud vs local. The local providers (Sherpa-ONNX and Whisper.cpp) keep audio on-device, require no API key, and cost nothing per minute, in exchange for a model download, local compute, and manual model management. The cloud providers (Deepgram, AssemblyAI, OpenAI, AWS Transcribe, Google Cloud Speech-to-Text, Azure AI Speech, Mistral Voxtral) handle scaling, model updates, and higher baseline accuracy, at the cost of an API key, network dependency, and per-minute fees.

Streaming vs buffered. Most providers stream results back progressively as you speak. Whisper.cpp is the exception: it buffers audio into short windows (~3 s by default) and transcribes each window in one shot. Whisper.cpp trades latency for higher accuracy and smaller per-chunk overhead. Pick it when you can tolerate a small delay for better results; pick anything else when live feedback matters.

Providers

Provider	Mode	Deployment	Languages	Notable features
AssemblyAI	Streaming	Cloud	English + multilingual	Universal-3 Pro, turn-based model, EU residency option, keyterm prompting
AWS Transcribe	Streaming	Cloud	100+	IAM auth, HTTP/2 streaming, medical transcription
Azure AI Speech	Streaming	Cloud	100+	Enterprise Azure integration, PushAudioInputStream pattern, event-callback API
Deepgram	Streaming	Cloud	30+	Nova-3 model, WebSocket streaming, diarization, smart formatting
Google Cloud Speech-to-Text	Streaming	Cloud	125+	Simplest integration (`mic.pipe(recognizeStream)`), 5-minute session limit
Mistral Voxtral	Streaming	Cloud	13	Open-weights (Apache 2.0), self-hostable via vLLM
OpenAI	Streaming	Cloud	50+	24 kHz (not 16 kHz), raw WebSocket, word-level deltas
Sherpa-ONNX	Streaming	Local	Multilingual (Zipformer)	Offline, no API key, streaming Zipformer model
Whisper.cpp	Buffered	Local	99+	Offline, no API key, highest local accuracy, ~3 s buffer

Browser considerations

The cloud STT providers work from the browser: point decibri at the browser runtime (same npm package, conditional exports serve an AudioWorklet implementation), capture audio, send chunks to the provider's WebSocket or HTTPS endpoint. CORS and the provider's auth flow apply.

The local providers (Sherpa-ONNX and Whisper.cpp) are Node.js only. They rely on native bindings (ONNX Runtime, whisper.cpp via a native addon) that can't run in a browser. For local STT in the browser, run decibri's browser capture, ship chunks to a Node.js backend, and do transcription server-side.

Troubleshooting

Garbled transcripts. Almost always a sample-rate mismatch. Decibri defaults to 16 kHz mono 16-bit PCM, which matches what every provider here expects, except OpenAI Realtime, which defaults to 24 kHz. If the transcript looks scrambled, verify both sides agree on sample rate and that you haven't accidentally passed 48 kHz (the browser default) to a 16 kHz API.

Silent transcripts. Microphone permission not granted (browser), or the wrong input device is selected (Node.js). Check Microphone.devices() and confirm isDefault points at the mic you expect. Run decibri-cli capture -o test.wav -d 5 to verify the mic works outside of any STT pipeline.

API key errors. Each provider has its own env-var convention. See the provider's page for the exact variable name and where to put it.

Cloud-provider-specific errors. See the “Related” callout on each provider page. Every page links to the upstream provider's troubleshooting docs and console.