Speech to Text
Use Cases
Tencent Real-Time Communication (TRTC) supports the speech-to-text feature, which converts the audio streams of specified users or all users in a room into corresponding Chinese text for effects such as real-time captions.
Prerequisites
Log in to the TRTC console, activate the TRTC service, and create an RTC-Engine application.
Go to the purchase page to buy an RTC-Engine package of any version to unlock the speech-to-text feature.
Note:
Feature Overview
After a task is initiated, TRTC AI Service uses an Automatic Speech Recognition (ASR) bot to enter a TRTC room to pull the streams of specified users or all users for speech-to-text recognition, and then relay the recognition results to the client and server in real time.
Integration Guide
Step 1: Receiving Speech-to-Text Results
Method 1: Receiving Text Messages via Client SDK
Use the custom message receiving feature of the TRTC SDK to listen to callbacks on the client and receive real-time speech-to-text result data.
The client callback message format is as follows, taking the web end as an example:
trtc.on(TRTC.EVENT.CUSTOM_MESSAGE, event => { // Receive custom messages. // event.userId: The userId of the ASR robot. // event.cmdId: The message ID, which is fixed at 1 for transcriptions and captions. // event.seq: The sequence number of a message. // event.data: ArrayBuffer type. For content of transcriptions or captions, see the explanation of the data field below. const data = new TextDecoder().decode(event.data) // Explanation of the data field is as follows. console.log(`received custom msg from ${event.userId}, message: ${ data }`) })
Data field explanation
Real-Time Captions
Field Name | Type | Meaning |
type | Integer | 10000: When there are real-time captions and a complete sentence, the message type will be delivered. |
sender | String | Speaker's userid. |
receiver | Array | Recipient's userid list. This message is actually broadcast within a room. |
payload.text | String | Recognized text, Unicode encoded. |
payload.start_time | String | Message start time. It is the absolute time after a task starts. |
payload.end_time | String | Message end time. It is the absolute time after a task starts. |
payload.end | Boolean | If true, it indicates that this is a complete sentence. |
{"type": 10000,"sender": "user_a","payload": {"text":"","start_time":"00:00:02","end_time":"00:00:05","end": true}}
Note:
Callback example explanation:
Transcription: A complete sentence will be transcribed and pushed.
"How's the weather today?"
Captions: A sentence will be segmented for pushing, with each subsequent segment containing the previous one to ensure real-time performance.
"Today"
"Today's weather"
"How's the weather today?"
Sequence explanation: Caption message > Caption message > .... > Caption message (end = true)
Method 2: Receiving via Server-side Callbacks
The speech-to-text service also provides server-side event callbacks, facilitating your service to receive real-time conversation messages. See Detailed Callback Events.
Step 2: Initiating a Speech-to-Text Task
TRTC provides the following Tencent Cloud APIs for initiating and managing speech-to-text tasks:
Start a speech-to-text task: StartAITranscription
Query a speech-to-text task: DescribeAITranscription
Stop a speech-to-text task: StopAITranscription
Note:
The speech-to-text feature has a concurrency limit of 100 tasks per SDKAppId. Submit a ticket if you need to increase this limit.