Speech to Text

Use Cases

Tencent Real-Time Communication (TRTC) supports the speech-to-text feature, which converts the audio streams of specified users or all users in a room into corresponding Chinese text for effects such as real-time captions.

Prerequisites

Log in to the TRTC console, activate the TRTC service, and create an RTC-Engine application.
Go to the purchase page to buy an RTC-Engine package of any version to unlock the speech-to-text feature.
Note:
The speech-to-text feature incurs fees based on usage. See Fee Details for more information.

Feature Overview

After a task is initiated, TRTC AI Service uses an Automatic Speech Recognition (ASR) bot to enter a TRTC room to pull the streams of specified users or all users for speech-to-text recognition, and then relay the recognition results to the client and server in real time.


Integration Guide

Step 1: Receiving Speech-to-Text Results

Method 1: Receiving Text Messages via Client SDK

Use the custom message receiving feature of the TRTC SDK to listen to callbacks on the client and receive real-time speech-to-text result data.
The client callback message format is as follows, taking the web end as an example:
trtc.on(TRTC.EVENT.CUSTOM_MESSAGE, event => { // Receive custom messages. // event.userId: The userId of the ASR robot. // event.cmdId: The message ID, which is fixed at 1 for transcriptions and captions. // event.seq: The sequence number of a message. // event.data: ArrayBuffer type. For content of transcriptions or captions, see the explanation of the data field below. const data = new TextDecoder().decode(event.data) // Explanation of the data field is as follows. console.log(`received custom msg from ${event.userId}, message: ${ data }`) })
Data field explanation

Real-Time Captions

Field Name
Type
Meaning
type
Integer
10000: When there are real-time captions and a complete sentence, the message type will be delivered.
sender
String
Speaker's userid.
receiver
Array
Recipient's userid list. This message is actually broadcast within a room.
payload.text
String
Recognized text, Unicode encoded.
payload.start_time
String
Message start time. It is the absolute time after a task starts.
payload.end_time
String
Message end time. It is the absolute time after a task starts.
payload.end
Boolean
If true, it indicates that this is a complete sentence.
{
"type": 10000,
"sender": "user_a",
"payload": {
"text":"",
"start_time":"00:00:02",
"end_time":"00:00:05",
"end": true
}
}
Note:
Callback example explanation:
Transcription: A complete sentence will be transcribed and pushed.
"How's the weather today?"
Captions: A sentence will be segmented for pushing, with each subsequent segment containing the previous one to ensure real-time performance.
"Today"
"Today's weather"
"How's the weather today?"
Sequence explanation: Caption message > Caption message > .... > Caption message (end = true)

Method 2: Receiving via Server-side Callbacks

The speech-to-text service also provides server-side event callbacks, facilitating your service to receive real-time conversation messages. See Detailed Callback Events.

Step 2: Initiating a Speech-to-Text Task

TRTC provides the following Tencent Cloud APIs for initiating and managing speech-to-text tasks:
Start a speech-to-text task: StartAITranscription
Query a speech-to-text task: DescribeAITranscription
Stop a speech-to-text task: StopAITranscription
Note:
The speech-to-text feature has a concurrency limit of 100 tasks per SDKAppId. Submit a ticket if you need to increase this limit.