This document describes a solution on how to implement AI real-time conversation based on the Tencent RTC.
Overview
The solution relies on the RTC SDK to call TRTC services, and can achieve AI real-time conversation service with ultra-low latency by calling the AI real-time conversation interface. It provides a highly flexible integration scheme, allowing you to integrate a third-party large language model (LLM) and Text To Speech (TTS) according to actual business needs to achieve the best business practice results. In the overall solution, we have made numerous technical optimizations for real-time voice noise reduction, AI intelligent interruption, and context management to continuously enhance user experiences.
You can call startLocalAudio to enable mic capture. You need to specify the quality parameter to set the capturing mode. Though the parameter is called quality, it does not mean higher quality is always better. Different business scenarios require choosing the most suitable parameter (a more accurate term for this parameter is scene).
It is recommended to use the SPEECH mode in AI conversation scenarios. In this mode, the SDK’s audio module will focus on extracting speech signals and maximizing the filtering of surrounding environmental noise. Additionally, the audio data in this mode will have better resistance to poor network conditions. Therefore, this mode is particularly suitable for video calls and online meetings that emphasize voice communication.
Android
iOS and macOS
// Enable mic capture and set `quality` to `SPEECH` (it has high noise suppression and strong resistance to poor network conditions)
It is recommended that the following APIs be called by the business backend. The client only calls the APIs provided by the business backend to initiate an AI conversation.
TRTC provides the following TencentCloud APIs for initiating and managing conversation tasks, as follows:
Additionally, we will add multiple parameters in the HTTP header to assist customers in supporting more complex logic:
X-Task-Id:<task_id_value>// ID of this task.
X-Rquest-Id:<request_id>// ID of this request. The same requestId will be carried in case of an retry.
X-Sdk-App-Id:SdkAppId
X-User-Id:UserId
X-Room-Id:RoomId
X-Room-Id-Type:"0"// "0" represents a numeric room ID, and "1" represents a string room ID.
TTS Interaction
The customer's own account is used for TTS parameters.
Custom TTS
{
"TTSType":"custom",// Required: string
"APIKey":"ApiKey",// Required: string, for authentication
"APIUrl":"http://0.0.0.0:8080/stream-audio"// Required: string, TTS API URL
"AudioFormat":"wav",// Optional: string, desired audio format, such as mp3, ogg_opus, pcm, or wav, with a default value of wav. Currently, only pcm and wav are supported.
"SampleRate":16000,// Optional: integer, audio sample rate. The default value is 16000 (16k), and the recommended value is 16000.
"AudioChannel":1,// Optional: integer, number of audio channels. Value: 1 or 2, with a default value of 1
"TTSType":"tencent",// String, TTS type. "tencent" and "minixmax" are currently supported, and other vendors are being supported.
"AppId":"Your application ID",// Required: string
"SecretId":"Your key ID",// Required: string
"SecretKey":"Your key",// Required: string
"VoiceType":101001,// Required: integer, voice ID, including standard voices and premium voices. Premium voices have higher realism and are priced differently from standard voices. Refer to TTS Pricing Overview. For a complete list of voice IDs, refer to the TTS Voice List.
"Speed":1.25,// Optional: integer, playback speed. Value range: [-2, 6], corresponding to different speeds: -2: 0.6x, -1: 0.8x, 0: 1.0x (default), 1: 1.2x, 2: 1.5x, and 6: 2.5x. If you need more precise speeds, you can retain up to 2 decimal places. For example, 0.5/1.25/2.81. For the conversion between parameter values and actual speeds, refer to Speed Conversion.
"Volume":5,// Optional: integer, volume level. Value range: [0, 10], corresponding to 11 levels of volume. The default value is 0, representing normal volume.
"PrimaryLanguage":1,// Optional: integer, primary language. Valid values: 1 - Chinese (default), 2 - English, and 3 - Japanese
"FastVoiceType":"xxxx"// Optional: parameter for fast Voice Reproduce (VRS)
}
MiniMax TTS
{
"TTSType":"minimax",// String, TTS type, fixed as "minimax"
"Model":"speech-01-turbo-240228",// String, model used. Valid values: speech-01-turbo, speech-01-turbo-240228, and speech-01-240228
"ApiUrl":"https://api.minimax.chat/v1/t2a_v2",//
"GroupId":"181000000000000",// String, to be obtained from the MiniMax management backend: https://platform.minimaxi.com/user-center/basic-information
"ApiKey":"eyxxxx",// String, to be obtained from the MiniMax management backend: https://platform.minimaxi.com/user-center/basic-information/interface-key
"VoiceType":"audiobook_female_1",// String. For voice selection, you can refer to MiniMax documentation.
"Speed":1.2// Number. Value range: [0.5, 2]. The default value is 1.0.
Through the RTC Engine SDK's Sending and Receiving Messages, listen for callbacks on the client to receive real-time captions and AI status data. cmdID is fixed as 1.
It is also recommended that calls be made by the business backend. The client only calls the APIs provided by the business backend to stop AI conversation.
The callback URL needs to be manually configured during the testing phase. Please contact our development team.
Field Name
Type
Meaning
EVENT_TYPE_AI_SERVICE_START
901
AI task start, for which the start API is called. It is generated when the task is initiated.
EVENT_TYPE_AI_SERVICE_STOP
902
AI task stop, for which the stop API is called. It is generated when the task ends.
EVENT_TYPE_AI_SERVICE_MSG
903
Callback after a complete sentence is recognized
Callback also when the LLM generates a complete response
EVENT_TYPE_AI_SERVICE_START_OF_SPEECH
904
Callback when it is recognized that a user starts speaking
Note:
It can be used with TRTC Event Callbacks to enrich features.
Starting Event 901
{
"EventGroupId":9,// Event group ID, fixed as 9 for AI service, EVENT_GROUP_AI_SERVICE
"EventType":901,// Event type, detailed below
"CallbackTs":1687770730166,// Unix timestamp in milliseconds when the event callback server sends a request to your server
"EventInfo":{
"EventMsTs":1622186275757,// Event trigger timestamp in milliseconds
"TaskId":"xx",// Task ID
"RoomId":"1234",
"RoomIdType":0,// 0 represents a numeric room ID, and 1 represents a string room ID.
"Payload":{
"Status":0
}
}
}
Field
Type
Meaning
Status
Number
0: The AI task started successfully.
1: The AI task failed to start.
Stopping Event 902
{
"EventGroupId":9,// Event group ID, fixed as 9 for AI service, EVENT_GROUP_AI_SERVICE
"EventType":902,// Event type, detailed below
"CallbackTs":1687770730166,// Unix timestamp in milliseconds when the event callback server sends a request to your server
"EventInfo":{
"EventMsTs":1622186275757,// Event trigger timestamp in milliseconds
"TaskId":"xx",// Task ID
"RoomId":"1234",
"RoomIdType":0,// 0 represents a numeric room ID, and 1 represents a string room ID.
"Payload":{
"LeaveCode":0
}
}
}
Field
Type
Meaning
LeaveCode
Number
0: The task exits after the stop API is normally called.
1: The task exits after the business kicks out the transcription robot.
2: The task exits after the business dissolves the room.
3: The TRTC server kicks out the robot.
4: The TRTC server dissolves the room.
98: Internal error. The business is advised to retry.
99: The task exits after a specified time if there are no other user streams in the room except the transcription robot.
Callback After a Complete Sentence Is Recognized 903
{
"EventGroupId":9,// Event group ID, fixed as 9 for AI service, EVENT_GROUP_AI_SERVICE
"EventType":903,// Event type, detailed below
"CallbackTs":1687770730166,// Unix timestamp in milliseconds when the event callback server sends a request to your server
"EventInfo":{
"EventMsTs":1622186275757,// Event trigger timestamp in milliseconds
"TaskId":"xx",// Task ID
"RoomId":"1234",
"RoomIdType":0,// 0 represents a numeric room ID, and 1 represents a string room ID.
"Payload":{
"type":"subtitle",// subtitle represents a caption message, and transcription represents a transcription message.
"userid":"xxx",// The user corresponding to the message
"text":"xxxx",// Source language text
"translation_text":"xxx",// Translated text. It is an empty string if there is no translation.
"start_time":"00:30:00",// Start time
"end_time":"00:30:02"// End time
"roundid":"xxxxx"// Unique ID for a round of conversation
"start_ms_ts":123245678// Start timestamp in milliseconds
"end_ms_ts":123245678// End timestamp in milliseconds (STT represents the end of recognition, and llm represents the end of reply.)
}
}
}
Callback When a User Starts Speaking 904
{
"EventGroupId":9,// Event group ID, fixed as 9 for AI service, EVENT_GROUP_AI_SERVICE
"EventType":904,// Event type. The first character begins to be recognized. "CallbackTs": 1687770730166, // Unix timestamp in milliseconds when the event callback server sends a request to your server
"EventInfo":{
"EventMsTs":1622186275757,// Event trigger timestamp in milliseconds
"TaskId":"xx",// Task ID
"RoomId":"1234",
"RoomIdType":0,// 0 represents a numeric room ID, and 1 represents a string room ID.
"Payload":{
"userid":"xxx",
"start_time":"00:30:00",// Start time
"roundid":"xxxxx"// Unique ID for a round of conversation