통합 가이드
This document describes a solution on how to implement AI real-time conversation based on the Tencent RTC.
Overview
The solution relies on the RTC SDK to call TRTC services, and can achieve AI real-time conversation service with ultra-low latency by calling the AI real-time conversation interface. It provides a highly flexible integration scheme, allowing you to integrate a third-party large language model (LLM) and Text To Speech (TTS) according to actual business needs to achieve the best business practice results. In the overall solution, we have made numerous technical optimizations for real-time voice noise reduction, AI intelligent interruption, and context management to continuously enhance user experiences.
Solution Architecture Diagram
Business Flowchart
Integration Guide
Prerequisites
Note:
2. Create Tencent Cloud TTS (a third-party product can be used).
3. Create an LLM application. You can choose an appropriate large model provider to register one, and it will generally offer you a free token.
1. Integrating the RTC Engine SDK
Note:
You can call
startLocalAudio
to enable mic capture. You need to specify the quality
parameter to set the capturing mode. Though the parameter is called quality
, it does not mean higher quality is always better. Different business scenarios require choosing the most suitable parameter (a more accurate term for this parameter is scene).It is recommended to use the SPEECH mode in AI conversation scenarios. In this mode, the SDK’s audio module will focus on extracting speech signals and maximizing the filtering of surrounding environmental noise. Additionally, the audio data in this mode will have better resistance to poor network conditions. Therefore, this mode is particularly suitable for video calls and online meetings that emphasize voice communication.
// Enable mic capture and set `quality` to `SPEECH` (it has high noise suppression and strong resistance to poor network conditions)mCloud.startLocalAudio(TRTCCloudDef.TRTC_AUDIO_QUALITY_SPEECH);
// Enable microphone acquisition and set the current scene to: Voice mode// For high noise suppression capability, strong and weak network resistanceAppDelegate *appDelegate = (AppDelegate *)[[UIApplication sharedApplication] delegate];[appDelegate.trtcCloud startLocalAudio:TRTCAudioQualitySpeech];
2. Initiating an AI Conversation
Start AI Conversation Tasks
It is recommended that the following APIs be called by the business backend. The client only calls the APIs provided by the business backend to initiate an AI conversation.
TRTC provides the following TencentCloud APIs for initiating and managing conversation tasks, as follows:
Currently supported TTS and LLM model call methods:
LLM Interaction
OpenAI:
"LLMConfig": {"LLMType": "openai","Model":"gpt-4o","APIKey":"api-key","APIUrl":"https://api.openai.com/v1/chat/completions","Streaming": true,"SystemPrompt": "You are a personal assistant","Timeout": 3.0,"History": 5 // Up to 50 rounds of conversations are supported, and the default is 0.}
MiniMax:
"LLMConfig":{"APIKey": "eyJhbGcixxxx","LLMType": "minimax","Model": "abab6.5s-chat","Streaming": true,"SystemPrompt": "You are a personal assistant","APIUrl": "https://api.minimax.chat/v1/text/chatcompletion_v2","History": 5 // Up to 50 rounds of conversations are supported.}
Hunyuan:
"LLMConfig":{"LLMType": "openai","Model": "hunyuan-standard", # hunyuan-turbo,hunyuan-standard"APIKey": "hunyuan-apikey","APIUrl": "https://hunyuan.cloud.tencent.com/openai/v1/chat/completions","Streaming": true,"History": 10}
Additionally, we will add multiple parameters in the HTTP header to assist customers in supporting more complex logic:
X-Task-Id: <task_id_value> // ID of this task.X-Rquest-Id: <request_id> // ID of this request. The same requestId will be carried in case of an retry.X-Sdk-App-Id: SdkAppIdX-User-Id:UserIdX-Room-Id:RoomIdX-Room-Id-Type: "0" // "0" represents a numeric room ID, and "1" represents a string room ID.
TTS Interaction
The customer's own account is used for TTS parameters.
Custom TTS
{"TTSType": "custom", // Required: string"APIKey": "ApiKey", // Required: string, for authentication"APIUrl": "http://0.0.0.0:8080/stream-audio" // Required: string, TTS API URL"AudioFormat": "wav", // Optional: string, desired audio format, such as mp3, ogg_opus, pcm, or wav, with a default value of wav. Currently, only pcm and wav are supported."SampleRate": 16000, // Optional: integer, audio sample rate. The default value is 16000 (16k), and the recommended value is 16000."AudioChannel": 1, // Optional: integer, number of audio channels. Value: 1 or 2, with a default value of 1}
Tencent TTS:
{"TTSType": "tencent", // String, TTS type. "tencent" and "minixmax" are currently supported, and other vendors are being supported."AppId": "Your application ID", // Required: string"SecretId": "Your key ID", // Required: string"SecretKey": "Your key", // Required: string"VoiceType": 101001, // Required: integer, voice ID, including standard voices and premium voices. Premium voices have higher realism and are priced differently from standard voices. Refer to TTS Pricing Overview. For a complete list of voice IDs, refer to the TTS Voice List."Speed": 1.25, // Optional: integer, playback speed. Value range: [-2, 6], corresponding to different speeds: -2: 0.6x, -1: 0.8x, 0: 1.0x (default), 1: 1.2x, 2: 1.5x, and 6: 2.5x. If you need more precise speeds, you can retain up to 2 decimal places. For example, 0.5/1.25/2.81. For the conversion between parameter values and actual speeds, refer to Speed Conversion."Volume": 5, // Optional: integer, volume level. Value range: [0, 10], corresponding to 11 levels of volume. The default value is 0, representing normal volume."PrimaryLanguage": 1, // Optional: integer, primary language. Valid values: 1 - Chinese (default), 2 - English, and 3 - Japanese"FastVoiceType": "xxxx" // Optional: parameter for fast Voice Reproduce (VRS)}
MiniMax TTS
{"TTSType": "minimax", // String, TTS type, fixed as "minimax""Model": "speech-01-turbo-240228", // String, model used. Valid values: speech-01-turbo, speech-01-turbo-240228, and speech-01-240228"ApiUrl": "https://api.minimax.chat/v1/t2a_v2", //"GroupId": "181000000000000", // String, to be obtained from the MiniMax management backend: https://platform.minimaxi.com/user-center/basic-information"ApiKey": "eyxxxx", // String, to be obtained from the MiniMax management backend: https://platform.minimaxi.com/user-center/basic-information/interface-key"VoiceType":"audiobook_female_1", // String. For voice selection, you can refer to MiniMax documentation."Speed": 1.2 // Number. Value range: [0.5, 2]. The default value is 1.0.}
Refer to: T2A v2 (Speech Generation)
API Name | T2A v2 (speech generation) | T2A Pro (speech generation) | T2A (speech generation) | T2A Stream (streaming speech generation) | T2A Stream (streaming speech generation) |
Model | speech-01-turbo, speech-01-240228, speech-01-turbo-240228 | speech-01, speech-02 | speech-01, speech-02 | speech-01 | speech-01 |
Customer Type\Limit Type | RPM | RPM | RPM | RPM | CONN (maximum number of parallel tasks) |
Free Users | 3 | 3 | 3 | 3 | 1 |
Paying Users | 20 | 20 | 20 | 20 | 3 |
Azure TTS
{"TTSType": "azure", // Required: string, TTS type"SubscriptionKey": "xxxxxxxx", // Required: string, subscription key"Region": "chinanorth3", // Required: string, subscription region"VoiceName": "zh-CN-XiaoxiaoNeural", // Required: string, voice name"Language": "zh-CN", // Required: string, language for TTS"Rate": 1 // Optional: float, playback speed. Value range: 0.5-2, with a default value of 1}
Refer to: Using SSML to Customize Voice and Sound
Query AI Conversation Tasks
Stop AI Conversation Tasks
Control AI Conversation Tasks
3. Receiving AI Conversations and AI Status
Through the RTC Engine SDK's Sending and Receiving Messages, listen for callbacks on the client to receive real-time captions and AI status data. cmdID is fixed as 1.
Receiving Real-Time Captions
Message Format
{"type": 10000, // 10000 indicates real-time captions."sender": "user_a", // The userID of the speaker"receiver": [], // The list of receiver userIDs. This message is actually broadcast within a room."payload": {"text":"", // Text from STT"translation_text":"", // Translated text"start_time":"00:00:01", // Start time of this sentence"end_time":"00:00:02", // End time of this sentence"roundid": "xxxxx", // Unique ID for a round of conversation"end": true // If it is true, it indicates that this is a complete sentence.}}
Receiving Robot Status
Message Format
{"type": 10001, // Robot status"sender": "user_a", // The userID of the sender. It is the ID of the robot here."receiver": [], // The list of receiver userIDs. This message is actually broadcast within a room."payload": {"roundid": "xxx", // Unique ID for a round of conversation"timestamp": 123,"state": 1, // 1: listening, 2: thinking, 3: speaking, and 4: interrupted}}
Sample Code
@Overridepublic void onRecvCustomCmdMsg(String userId, int cmdID, int seq, byte[] message) {String data = new String(message, StandardCharsets.UTF_8);try {JSONObject jsonData = new JSONObject(data);Log.i(TAG, String.format("receive custom msg from %s cmdId: %d seq: %d data: %s", userId, cmdID, seq, data));} catch (JSONException e) {Log.e(TAG, "onRecvCustomCmdMsg err");throw new RuntimeException(e);}}
func onRecvCustomCmdMsgUserId(_ userId: String, cmdID: Int, seq: UInt32, message: Data) {if cmdID == 1 {do {if let jsonObject = try JSONSerialization.jsonObject(with: message, options: []) as? [String: Any] {print("Dictionary: \(jsonObject)")// handleMessage(jsonObject)} else {print("The data is not a dictionary.")}} catch {print("Error parsing JSON: \(error)")}}}
4. Sending Custom Messages
TRTC custom messages are sent uniformly on the client side. cmdID is fixed as 2.
By sending custom text, you can skip the STT process and communicate directly with the AI service.
{"type": 20000, // Custom text messages are sent on the client side."sender": "user_a", // The userID of the sender. The server will check whether this userID is valid."receiver": ["user_bot"], // The list of receiver userIDs. Only the bot userID needs to be entered. The server will check whether this userID is valid."payload": {"id": "uuid", // Message ID, which can be a UUID. It is used for debugging purposes."message": "xxx", // Message content"timestamp": 123 // Timestamp. It is used for debugging purposes.}}
You can interrupt by sending an interruption signal.
{"type": 20001, // An interruption signal is sent on the client side."sender": "user_a", // The userID of the sender. The server will check whether this userID is valid."receiver": ["user_bot"], // The list of receiver userIDs. Only the bot userID needs to be entered. The server will check whether this userID is valid."payload": {"id": "uuid", // Message ID, which can be a UUID. It is used for debugging purposes."timestamp": 123 // Timestamp. It is used for debugging purposes.}}
Sample Code
public void sendInterruptCode() {try {int cmdID = 0x2;long time = System.currentTimeMillis();String timeStamp = String.valueOf(time/1000);JSONObject payLoadContent = new JSONObject();payLoadContent.put("timestamp", timeStamp);payLoadContent.put("id", String.valueOf(GenerateTestUserSig.SDKAPPID) + "_" + mRoomId);String[] receivers = new String[]{robotUserId};JSONObject interruptContent = new JSONObject();interruptContent.put("type", AICustomMsgType.AICustomMsgType_Send_Interrupt_CMD);interruptContent.put("sender", mUserId);interruptContent.put("receiver", new JSONArray(receivers));interruptContent.put("payload", payLoadContent);String interruptString = interruptContent.toString();byte[] data = interruptString.getBytes("UTF-8");Log.i(TAG, "sendInterruptCode :" + interruptString);mTRTCCloud.sendCustomCmdMsg(cmdID, data, true, true);} catch (UnsupportedEncodingException e) {e.printStackTrace();} catch (JSONException e) {throw new RuntimeException(e);}}
@objc func interruptAi() {print("interruptAi")let cmdId = 0x2let timestamp = Int(Date().timeIntervalSince1970 * 1000)let payload = ["id": userId + "_\(roomId)" + "_\(timestamp)", // Message ID, which can be a UUID. It is used for debugging purposes."timestamp": timestamp // Timestamp. It is used for debugging purposes.] as [String : Any]let dict = ["type": 20001,"sender": userId,"receiver": [botId],"payload": payload] as [String : Any]do {let jsonData = try JSONSerialization.data(withJSONObject: dict, options: [])self.trtcCloud.sendCustomCmdMsg(cmdId, data: jsonData, reliable: true, ordered: true)} catch {print("Error serializing dictionary to JSON: \(error)")}}
5. Stopping the AI Conversation and Exiting the TRTC Room
1. Stop AI conversation tasks: StopAIConversation
It is also recommended that calls be made by the business backend. The client only calls the APIs provided by the business backend to stop AI conversation.
2. For how to exit a TRTC room, refer to:
6. Using Advanced Features
1. Integrating the Callback API
Note:
The callback URL needs to be manually configured during the testing phase. Please contact our development team.
Field Name | Type | Meaning |
EVENT_TYPE_AI_SERVICE_START | 901 | AI task start, for which the start API is called. It is generated when the task is initiated. |
EVENT_TYPE_AI_SERVICE_STOP | 902 | AI task stop, for which the stop API is called. It is generated when the task ends. |
EVENT_TYPE_AI_SERVICE_MSG | 903 | Callback after a complete sentence is recognized Callback also when the LLM generates a complete response |
EVENT_TYPE_AI_SERVICE_START_OF_SPEECH | 904 | Callback when it is recognized that a user starts speaking |
Note:
Starting Event 901
{"EventGroupId": 9, // Event group ID, fixed as 9 for AI service, EVENT_GROUP_AI_SERVICE"EventType": 901, // Event type, detailed below"CallbackTs": 1687770730166, // Unix timestamp in milliseconds when the event callback server sends a request to your server"EventInfo": {"EventMsTs": 1622186275757, // Event trigger timestamp in milliseconds"TaskId": "xx", // Task ID"RoomId": "1234","RoomIdType": 0, // 0 represents a numeric room ID, and 1 represents a string room ID."Payload": {"Status": 0}}}
Field | Type | Meaning |
Status | Number | 0: The AI task started successfully. 1: The AI task failed to start. |
Stopping Event 902
{"EventGroupId": 9, // Event group ID, fixed as 9 for AI service, EVENT_GROUP_AI_SERVICE"EventType": 902, // Event type, detailed below"CallbackTs": 1687770730166, // Unix timestamp in milliseconds when the event callback server sends a request to your server"EventInfo": {"EventMsTs": 1622186275757, // Event trigger timestamp in milliseconds"TaskId": "xx", // Task ID"RoomId": "1234","RoomIdType": 0, // 0 represents a numeric room ID, and 1 represents a string room ID."Payload": {"LeaveCode": 0}}}
Field | Type | Meaning |
LeaveCode | Number | 0: The task exits after the stop API is normally called. 1: The task exits after the business kicks out the transcription robot. 2: The task exits after the business dissolves the room. 3: The TRTC server kicks out the robot. 4: The TRTC server dissolves the room. 98: Internal error. The business is advised to retry. 99: The task exits after a specified time if there are no other user streams in the room except the transcription robot. |
Callback After a Complete Sentence Is Recognized 903
{"EventGroupId": 9, // Event group ID, fixed as 9 for AI service, EVENT_GROUP_AI_SERVICE"EventType": 903, // Event type, detailed below"CallbackTs": 1687770730166, // Unix timestamp in milliseconds when the event callback server sends a request to your server"EventInfo": {"EventMsTs": 1622186275757, // Event trigger timestamp in milliseconds"TaskId": "xx", // Task ID"RoomId": "1234","RoomIdType": 0, // 0 represents a numeric room ID, and 1 represents a string room ID."Payload": {"type": "subtitle", // subtitle represents a caption message, and transcription represents a transcription message."userid": "xxx", // The user corresponding to the message"text": "xxxx", // Source language text"translation_text": "xxx", // Translated text. It is an empty string if there is no translation."start_time": "00:30:00", // Start time"end_time": "00:30:02" // End time"roundid": "xxxxx" // Unique ID for a round of conversation"start_ms_ts": 123245678 // Start timestamp in milliseconds"end_ms_ts": 123245678 // End timestamp in milliseconds (STT represents the end of recognition, and llm represents the end of reply.)}}}
Callback When a User Starts Speaking 904
{"EventGroupId": 9, // Event group ID, fixed as 9 for AI service, EVENT_GROUP_AI_SERVICE"EventType": 904, // Event type. The first character begins to be recognized. "CallbackTs": 1687770730166, // Unix timestamp in milliseconds when the event callback server sends a request to your server"EventInfo": {"EventMsTs": 1622186275757, // Event trigger timestamp in milliseconds"TaskId": "xx", // Task ID"RoomId": "1234","RoomIdType": 0, // 0 represents a numeric room ID, and 1 represents a string room ID."Payload": {"userid": "xxx","start_time": "00:30:00", // Start time"roundid": "xxxxx" // Unique ID for a round of conversation}}}
2. Control and Update of AI Conversation:
3. Introduction to Other Advanced Features
Feature | Directions |
Intelligent interruption | |
Context management | |
AI conversation monitoring | |
Real-time captions | |
Function call |