このページは現在英語版のみで提供されており、日本語版も近日中に提供される予定です。ご利用いただきありがとうございます。

統合ガイド

This document describes a solution on how to implement AI real-time conversation based on the Tencent RTC.

Overview

The solution relies on the RTC SDK to call TRTC services, and can achieve AI real-time conversation service with ultra-low latency by calling the AI real-time conversation interface. It provides a highly flexible integration scheme, allowing you to integrate a third-party large language model (LLM) and Text To Speech (TTS) according to actual business needs to achieve the best business practice results. In the overall solution, we have made numerous technical optimizations for real-time voice noise reduction, AI intelligent interruption, and context management to continuously enhance user experiences.

Solution Architecture Diagram





Business Flowchart





Integration Guide

Prerequisites

Note:
Contact business personnel to enable AI real-time conversation service.



2. Create Tencent Cloud TTS (a third-party product can be used).
3. Create an LLM application. You can choose an appropriate large model provider to register one, and it will generally offer you a free token.

1. Integrating the RTC Engine SDK

Note:
You can call startLocalAudio to enable mic capture. You need to specify the quality parameter to set the capturing mode. Though the parameter is called quality, it does not mean higher quality is always better. Different business scenarios require choosing the most suitable parameter (a more accurate term for this parameter is scene).
It is recommended to use the SPEECH mode in AI conversation scenarios. In this mode, the SDK’s audio module will focus on extracting speech signals and maximizing the filtering of surrounding environmental noise. Additionally, the audio data in this mode will have better resistance to poor network conditions. Therefore, this mode is particularly suitable for video calls and online meetings that emphasize voice communication.
Android
iOS and macOS
// Enable mic capture and set `quality` to `SPEECH` (it has high noise suppression and strong resistance to poor network conditions)
mCloud.startLocalAudio(TRTCCloudDef.TRTC_AUDIO_QUALITY_SPEECH);
// Enable microphone acquisition and set the current scene to: Voice mode
// For high noise suppression capability, strong and weak network resistance
AppDelegate *appDelegate = (AppDelegate *)[[UIApplication sharedApplication] delegate];
[appDelegate.trtcCloud startLocalAudio:TRTCAudioQualitySpeech];

2. Initiating an AI Conversation

Start AI Conversation Tasks

It is recommended that the following APIs be called by the business backend. The client only calls the APIs provided by the business backend to initiate an AI conversation.
TRTC provides the following TencentCloud APIs for initiating and managing conversation tasks, as follows:
Currently supported TTS and LLM model call methods:

LLM Interaction

OpenAI:
"LLMConfig": {
"LLMType": "openai",
"Model":"gpt-4o",
"APIKey":"api-key",
"APIUrl":"https://api.openai.com/v1/chat/completions",
"Streaming": true,
"SystemPrompt": "You are a personal assistant",
"Timeout": 3.0,
"History": 5 // Up to 50 rounds of conversations are supported, and the default is 0.
}
MiniMax:
"LLMConfig":{
"APIKey": "eyJhbGcixxxx",
"LLMType": "minimax",
"Model": "abab6.5s-chat",
"Streaming": true,
"SystemPrompt": "You are a personal assistant",
"APIUrl": "https://api.minimax.chat/v1/text/chatcompletion_v2",
"History": 5 // Up to 50 rounds of conversations are supported.
}
Hunyuan:
"LLMConfig":{
"LLMType": "openai",
"Model": "hunyuan-standard", # hunyuan-turbo,hunyuan-standard
"APIKey": "hunyuan-apikey",
"APIUrl": "https://hunyuan.cloud.tencent.com/openai/v1/chat/completions",
"Streaming": true,
"History": 10
}
Additionally, we will add multiple parameters in the HTTP header to assist customers in supporting more complex logic:
X-Task-Id: <task_id_value> // ID of this task.
X-Rquest-Id: <request_id> // ID of this request. The same requestId will be carried in case of an retry.
X-Sdk-App-Id: SdkAppId
X-User-Id:UserId
X-Room-Id:RoomId
X-Room-Id-Type: "0" // "0" represents a numeric room ID, and "1" represents a string room ID.


TTS Interaction

The customer's own account is used for TTS parameters.
Custom TTS
{
"TTSType": "custom", // Required: string
"APIKey": "ApiKey", // Required: string, for authentication
"APIUrl": "http://0.0.0.0:8080/stream-audio" // Required: string, TTS API URL
"AudioFormat": "wav", // Optional: string, desired audio format, such as mp3, ogg_opus, pcm, or wav, with a default value of wav. Currently, only pcm and wav are supported.
"SampleRate": 16000, // Optional: integer, audio sample rate. The default value is 16000 (16k), and the recommended value is 16000.
"AudioChannel": 1, // Optional: integer, number of audio channels. Value: 1 or 2, with a default value of 1
}
Specific protocol specifications: Custom TTS Protocol.
Tencent TTS:
{
"TTSType": "tencent", // String, TTS type. "tencent" and "minixmax" are currently supported, and other vendors are being supported.
"AppId": "Your application ID", // Required: string
"SecretId": "Your key ID", // Required: string
"SecretKey": "Your key", // Required: string
"VoiceType": 101001, // Required: integer, voice ID, including standard voices and premium voices. Premium voices have higher realism and are priced differently from standard voices. Refer to TTS Pricing Overview. For a complete list of voice IDs, refer to the TTS Voice List.
"Speed": 1.25, // Optional: integer, playback speed. Value range: [-2, 6], corresponding to different speeds: -2: 0.6x, -1: 0.8x, 0: 1.0x (default), 1: 1.2x, 2: 1.5x, and 6: 2.5x. If you need more precise speeds, you can retain up to 2 decimal places. For example, 0.5/1.25/2.81. For the conversion between parameter values and actual speeds, refer to Speed Conversion.
"Volume": 5, // Optional: integer, volume level. Value range: [0, 10], corresponding to 11 levels of volume. The default value is 0, representing normal volume.
"PrimaryLanguage": 1, // Optional: integer, primary language. Valid values: 1 - Chinese (default), 2 - English, and 3 - Japanese
"FastVoiceType": "xxxx" // Optional: parameter for fast Voice Reproduce (VRS)
}
MiniMax TTS
{
"TTSType": "minimax", // String, TTS type, fixed as "minimax"
"Model": "speech-01-turbo-240228", // String, model used. Valid values: speech-01-turbo, speech-01-turbo-240228, and speech-01-240228
"ApiUrl": "https://api.minimax.chat/v1/t2a_v2", //
"GroupId": "181000000000000", // String, to be obtained from the MiniMax management backend: https://platform.minimaxi.com/user-center/basic-information
"ApiKey": "eyxxxx", // String, to be obtained from the MiniMax management backend: https://platform.minimaxi.com/user-center/basic-information/interface-key
"VoiceType":"audiobook_female_1", // String. For voice selection, you can refer to MiniMax documentation.
"Speed": 1.2 // Number. Value range: [0.5, 2]. The default value is 1.0.
}
For the frequency limit, refer to Rate Limits. It may cause delays in responses.
API Name
T2A v2 (speech generation)
T2A Pro (speech generation)
T2A (speech generation)
T2A Stream (streaming speech generation)
T2A Stream (streaming speech generation)

Model
speech-01-turbo, speech-01-240228, speech-01-turbo-240228
speech-01, speech-02
speech-01, speech-02
speech-01
speech-01
Customer Type\Limit Type
RPM
RPM
RPM
RPM
CONN (maximum number of parallel tasks)
Free Users
3
3
3
3
1
Paying Users
20
20
20
20
3
Azure TTS
{
"TTSType": "azure", // Required: string, TTS type
"SubscriptionKey": "xxxxxxxx", // Required: string, subscription key
"Region": "chinanorth3", // Required: string, subscription region
"VoiceName": "zh-CN-XiaoxiaoNeural", // Required: string, voice name
"Language": "zh-CN", // Required: string, language for TTS
"Rate": 1 // Optional: float, playback speed. Value range: 0.5-2, with a default value of 1
}

Query AI Conversation Tasks

Stop AI Conversation Tasks

Control AI Conversation Tasks

3. Receiving AI Conversations and AI Status

Through the RTC Engine SDK's Sending and Receiving Messages, listen for callbacks on the client to receive real-time captions and AI status data. cmdID is fixed as 1.

Receiving Real-Time Captions

Message Format
{
"type": 10000, // 10000 indicates real-time captions.
"sender": "user_a", // The userID of the speaker
"receiver": [], // The list of receiver userIDs. This message is actually broadcast within a room.
"payload": {
"text":"", // Text from STT
"translation_text":"", // Translated text
"start_time":"00:00:01", // Start time of this sentence
"end_time":"00:00:02", // End time of this sentence
"roundid": "xxxxx", // Unique ID for a round of conversation
"end": true // If it is true, it indicates that this is a complete sentence.
}
}

Receiving Robot Status

Message Format
{
"type": 10001, // Robot status
"sender": "user_a", // The userID of the sender. It is the ID of the robot here.
"receiver": [], // The list of receiver userIDs. This message is actually broadcast within a room.
"payload": {
"roundid": "xxx", // Unique ID for a round of conversation
"timestamp": 123,
"state": 1, // 1: listening, 2: thinking, 3: speaking, and 4: interrupted
}
}


Sample Code

Android
iOS
@Override
public void onRecvCustomCmdMsg(String userId, int cmdID, int seq, byte[] message) {
String data = new String(message, StandardCharsets.UTF_8);
try {
JSONObject jsonData = new JSONObject(data);
Log.i(TAG, String.format("receive custom msg from %s cmdId: %d seq: %d data: %s", userId, cmdID, seq, data));
} catch (JSONException e) {
Log.e(TAG, "onRecvCustomCmdMsg err");
throw new RuntimeException(e);
}
}
func onRecvCustomCmdMsgUserId(_ userId: String, cmdID: Int, seq: UInt32, message: Data) {
if cmdID == 1 {
do {
if let jsonObject = try JSONSerialization.jsonObject(with: message, options: []) as? [String: Any] {
print("Dictionary: \(jsonObject)")
// handleMessage(jsonObject)
} else {
print("The data is not a dictionary.")
}
} catch {
print("Error parsing JSON: \(error)")
}
}
}

4. Sending Custom Messages

TRTC custom messages are sent uniformly on the client side. cmdID is fixed as 2.
By sending custom text, you can skip the STT process and communicate directly with the AI service.
{
"type": 20000, // Custom text messages are sent on the client side.
"sender": "user_a", // The userID of the sender. The server will check whether this userID is valid.
"receiver": ["user_bot"], // The list of receiver userIDs. Only the bot userID needs to be entered. The server will check whether this userID is valid.
"payload": {
"id": "uuid", // Message ID, which can be a UUID. It is used for debugging purposes.
"message": "xxx", // Message content
"timestamp": 123 // Timestamp. It is used for debugging purposes.
}
}

You can interrupt by sending an interruption signal.
{
"type": 20001, // An interruption signal is sent on the client side.
"sender": "user_a", // The userID of the sender. The server will check whether this userID is valid.
"receiver": ["user_bot"], // The list of receiver userIDs. Only the bot userID needs to be entered. The server will check whether this userID is valid.
"payload": {
"id": "uuid", // Message ID, which can be a UUID. It is used for debugging purposes.
"timestamp": 123 // Timestamp. It is used for debugging purposes.
}
}

Sample Code

Android
iOS
public void sendInterruptCode() {
try {
int cmdID = 0x2;

long time = System.currentTimeMillis();
String timeStamp = String.valueOf(time/1000);
JSONObject payLoadContent = new JSONObject();
payLoadContent.put("timestamp", timeStamp);
payLoadContent.put("id", String.valueOf(GenerateTestUserSig.SDKAPPID) + "_" + mRoomId);

String[] receivers = new String[]{robotUserId};

JSONObject interruptContent = new JSONObject();
interruptContent.put("type", AICustomMsgType.AICustomMsgType_Send_Interrupt_CMD);
interruptContent.put("sender", mUserId);
interruptContent.put("receiver", new JSONArray(receivers));
interruptContent.put("payload", payLoadContent);

String interruptString = interruptContent.toString();
byte[] data = interruptString.getBytes("UTF-8");

Log.i(TAG, "sendInterruptCode :" + interruptString);

mTRTCCloud.sendCustomCmdMsg(cmdID, data, true, true);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
} catch (JSONException e) {
throw new RuntimeException(e);
}
}

@objc func interruptAi() {
print("interruptAi")
let cmdId = 0x2
let timestamp = Int(Date().timeIntervalSince1970 * 1000)
let payload = [
"id": userId + "_\(roomId)" + "_\(timestamp)", // Message ID, which can be a UUID. It is used for debugging purposes.
"timestamp": timestamp // Timestamp. It is used for debugging purposes.
] as [String : Any]
let dict = [
"type": 20001,
"sender": userId,
"receiver": [botId],
"payload": payload
] as [String : Any]
do {
let jsonData = try JSONSerialization.data(withJSONObject: dict, options: [])
self.trtcCloud.sendCustomCmdMsg(cmdId, data: jsonData, reliable: true, ordered: true)
} catch {
print("Error serializing dictionary to JSON: \(error)")
}
}

5. Stopping the AI Conversation and Exiting the TRTC Room

1. Stop AI conversation tasks: StopAIConversation
It is also recommended that calls be made by the business backend. The client only calls the APIs provided by the business backend to stop AI conversation.
2. For how to exit a TRTC room, refer to:

6. Using Advanced Features

1. Integrating the Callback API

Note:
The callback URL needs to be manually configured during the testing phase. Please contact our development team.
Field Name
Type
Meaning
EVENT_TYPE_AI_SERVICE_START
901
AI task start, for which the start API is called. It is generated when the task is initiated.
EVENT_TYPE_AI_SERVICE_STOP
902
AI task stop, for which the stop API is called. It is generated when the task ends.
EVENT_TYPE_AI_SERVICE_MSG
903
Callback after a complete sentence is recognized
Callback also when the LLM generates a complete response
EVENT_TYPE_AI_SERVICE_START_OF_SPEECH
904
Callback when it is recognized that a user starts speaking
Note:
It can be used with TRTC Event Callbacks to enrich features.
Starting Event 901
{
"EventGroupId": 9, // Event group ID, fixed as 9 for AI service, EVENT_GROUP_AI_SERVICE
"EventType": 901, // Event type, detailed below
"CallbackTs": 1687770730166, // Unix timestamp in milliseconds when the event callback server sends a request to your server
"EventInfo": {
"EventMsTs": 1622186275757, // Event trigger timestamp in milliseconds
"TaskId": "xx", // Task ID
"RoomId": "1234",
"RoomIdType": 0, // 0 represents a numeric room ID, and 1 represents a string room ID.
"Payload": {
"Status": 0
}
}
}
Field
Type
Meaning
Status
Number
0: The AI task started successfully.
1: The AI task failed to start.
Stopping Event 902
{
"EventGroupId": 9, // Event group ID, fixed as 9 for AI service, EVENT_GROUP_AI_SERVICE
"EventType": 902, // Event type, detailed below
"CallbackTs": 1687770730166, // Unix timestamp in milliseconds when the event callback server sends a request to your server
"EventInfo": {
"EventMsTs": 1622186275757, // Event trigger timestamp in milliseconds
"TaskId": "xx", // Task ID
"RoomId": "1234",
"RoomIdType": 0, // 0 represents a numeric room ID, and 1 represents a string room ID.
"Payload": {
"LeaveCode": 0
}
}
}

Field
Type
Meaning
LeaveCode
Number
0: The task exits after the stop API is normally called.
1: The task exits after the business kicks out the transcription robot.
2: The task exits after the business dissolves the room.
3: The TRTC server kicks out the robot.
4: The TRTC server dissolves the room.
98: Internal error. The business is advised to retry.
99: The task exits after a specified time if there are no other user streams in the room except the transcription robot.
Callback After a Complete Sentence Is Recognized 903
{
"EventGroupId": 9, // Event group ID, fixed as 9 for AI service, EVENT_GROUP_AI_SERVICE
"EventType": 903, // Event type, detailed below
"CallbackTs": 1687770730166, // Unix timestamp in milliseconds when the event callback server sends a request to your server
"EventInfo": {
"EventMsTs": 1622186275757, // Event trigger timestamp in milliseconds
"TaskId": "xx", // Task ID
"RoomId": "1234",
"RoomIdType": 0, // 0 represents a numeric room ID, and 1 represents a string room ID.
"Payload": {
"type": "subtitle", // subtitle represents a caption message, and transcription represents a transcription message.
"userid": "xxx", // The user corresponding to the message
"text": "xxxx", // Source language text
"translation_text": "xxx", // Translated text. It is an empty string if there is no translation.
"start_time": "00:30:00", // Start time
"end_time": "00:30:02" // End time
"roundid": "xxxxx" // Unique ID for a round of conversation
"start_ms_ts": 123245678 // Start timestamp in milliseconds
"end_ms_ts": 123245678 // End timestamp in milliseconds (STT represents the end of recognition, and llm represents the end of reply.)
}
}
}

Callback When a User Starts Speaking 904
{
"EventGroupId": 9, // Event group ID, fixed as 9 for AI service, EVENT_GROUP_AI_SERVICE
"EventType": 904, // Event type. The first character begins to be recognized. "CallbackTs": 1687770730166, // Unix timestamp in milliseconds when the event callback server sends a request to your server
"EventInfo": {
"EventMsTs": 1622186275757, // Event trigger timestamp in milliseconds
"TaskId": "xx", // Task ID
"RoomId": "1234",
"RoomIdType": 0, // 0 represents a numeric room ID, and 1 represents a string room ID.
"Payload": {
"userid": "xxx",
"start_time": "00:30:00", // Start time
"roundid": "xxxxx" // Unique ID for a round of conversation
}
}
}

2. Control and Update of AI Conversation:

3. Introduction to Other Advanced Features

Feature
Directions
Intelligent interruption
Context management
AI conversation monitoring
Real-time captions
Function call