Speech-To-Text Configuration
In a Conversational AI session, the STT module captures the user's voice stream in real time and converts it into text, which is then forwarded to the LLM for processing. Powered by TRTC's ultra-low latency audio pipeline (end-to-end audio under 300 ms, conversation latency under 1 s globally) and advanced audio processing — AI noise suppression, echo cancellation, and customizable chat modes — the STT module delivers clear, accurate transcription even in noisy environments. You can plug in TRTC's built-in Tencent ASR or third-party STT providers through the
STTConfig object.Available Providers
Provider | Models | Integration | Best for |
16k_zh_large, 16k_zh, 16k_en, etc. | Built-in | Ultra-low latency, advanced audio processing, flexible engine framework | |
Azure Speech (default) | Third-party | 100+ languages, enterprise SLAs | |
nova-3, nova-2, etc. | Third-party | Speed & accuracy, cost-efficient English | |
stt-rt-v4, etc. | Third-party | Multilingual, code-switching |
Built-in vs Third-party:
All providers share the same
STTConfig structure with top-level fields (Language, VadSilenceTime) and provider-specific settings in CustomParam. See each provider's sub-page for the complete configuration.