Back to Learning

How to Build a Voice Chat Room App Like Clubhouse | Solution Architecture

Tencent RTC-Dev Team
Oct 8, 2024

Scene Introduction

A voice chat room is a virtual space for online interactive social interaction in a pure audio format. There are usually several microphones in the room. The host and the connected listeners chat on the microphones, and other listeners can enter the room to listen. Different types of rooms have different numbers of microphones and a maximum number of listeners. TRTC supports up to 50 people chatting on the microphone at the same time, with smooth switching between upper and lower microphones, a voice chat delay of less than 300ms and supports voice change, atmosphere sound effects, reverberation, and other audio effects, making the voice chat experience richer. Combined with TRTC Chat, it supports public chat, private chat, group chat, likes, gifts, and other message interaction forms to create a good interaction experience.

Implementation

Usually, to realize a complete voice chat room scenario, multiple functional modules are involved:Room Management, Microphone ManagementAudio stream management, Recording and reviewEtc. The key actions and functional points under each functional module are shown in thefollowing table:

Functional modules

Key actions and functional points

Room Management

Room list, create room, join room, exit room, destroy room

Microphone Management

Actively put on the microphone, hold someone on the microphone, actively leave themicrophone, kick someone off the microphone, mute the microphone, lock the microphone.move the microphone

Audio stream management

Push-pull stream architecture solution, real-time stream subscription mode

Recording and review

TRTC cloud recording, Tianyu content security review

The overall business architecture of the chat room scenario is shown in the figure below. The room owner creates a room, and users can choose the room they are interested in to join. After entering the room, users can use the microphone to interact with the host on the microphone. The voice content in the room needs to be recorded and reviewed due to compliance requirements.

Room Management

The room management module is mainly responsible for the maintenance of the room list, and mainly includes the following functions:

 Creating a room: After the user logs in to the business system, he can create a room. After creating the room, the room list needs to be added.

 Joining a room: The user can choose to join an existing room. After joining the room, the current room personnel list needs to be added.

 Exiting a room: The user can choose to exit the current room. After exiting the room, the current room personnel list needs to be deleted.

 Destroying a room: After all users exit the room, the room needs to be destroyed. After destroying the room, the room list needs to be deleted.

Solution architecture

In the entire room management architecture, the room management mainly involves three modules:

  • Business side room management: Mainly used for the maintenance and management of the room list, such as synchronizing the attributes and status of the business room. The functions include room list query, room entry and exit, and room creation and destruction.
  • Room management: mainly used for room member lists, signaling and message interaction, such as approving/rejecting microphone applications, bringing people on/off the microphone, muting/unblocking microphone sound, blocking/unblocking microphone sound, and also distinguishing by group dimensions, including creating groups, joining groups, leaving groups, and destroying groups.
  • TRTC room management: mainly used for audio stream interaction and transmission, such as sending and listening to the host/listener's voice/music, and also distinguishing by room dimensions, including entering and exiting TRTC rooms.

Specific implementation

In room management, different user roles have different functional permissions and implementation processes. There are two main roles in the voice chat room: room owner and listener. The role descriptions and their differences are detailed in the table below:

Role

Role

Difference

Room Owner

The owner of the highest authority in the room, can create or destroy the room.

 The role must be the host

 Create or destroy business rooms/RTC rooms

Audience

Participants in the room, can also go up to the microphone and become hosts.

 The role can be audience/host

 Enter and exit the room

Implementation process

Home Owner

1. Get the room list.

2. Create the corresponding room through the business interface.

3. Create a room.

4. Enter the business room/RTC room and interact with others.

5. Exit the RTC room/business room.

6.  Destroy the room.

Audience

1. Get the room list.

2. Enter the business room/RTC room and interact with others.

3.  Exit the RTC room/business room.

Microphone management

The microphones in the voice chat room are generally orderly and limited. For example, the room listeners need to get the consent of the room owner before they can get on the microphone in order. The number of microphones in the room is generally not more than 10. Microphone management is mainly responsible for defining the number of microphones in the room according to the business scenario, as well as the status management of all microphones in the current room.

The main functions of microphone management include: active microphone, holding someone to get on the microphone, active microphone removal, kicking someone off the microphone, microphone muting, microphone locking, microphone movement, etc.

 After the user enters the room, the user can apply to get on the microphone only when there is an idle microphone.

 After the room owner agrees to the user to get on the microphone, the microphone status needs to be changed to a non-idle state.

 After the user stops pushing the stream and gets off the microphone, the microphone status needs to be reset.

 The room owner has the right to lock the microphone, invite the microphone, force the microphone to leave, mute the microphone, etc.

Solution architecture

The following is an introduction to TRTC's solution architecture for microphone management. In the entire room management structure, the room owner has the highest authority and can invite people to the microphone/kick people out/mute and unlock the microphone audio/ban and unlock the microphone. Listeners can also apply to go on the microphone and become the host, interacting with other hosts in the room.

Specific implementation

In microphone management, different user roles have different functional permissions and implementation processes. There are mainly two roles: host and audience. The role descriptions and their differences are detailed in the table below:

Role

Description

Difference

Home owner

The person with the highest authority over the microphone positions is responsible for the management of all microphone positions. All microphone positions will be automatically disbanded after the host checks out.

 Role must be host

 Actively go on the microphone when entering the room

 Approve/reject microphone application

 Hold someone on/off the microphone

 Mute/unmute the microphone sound

 Block/unmute the microphone

Audience

Participants in the room can interact by turning on and off the microphone.

 Role can be audience/host

 Apply for going on/off the microphone

Implementation process

Home Owner

1. The host enters the room lobby and obtains the room list.

2. The host creates a room as the host and joins the room.

3. The host relies on the group attribute to obtain the microphone list and takes the initiative to go on the microphone.

4. The listener goes on the microphone. After going on the microphone, he can interact with other users on the microphone. There are two ways for the listener to go on the microphone: the listener actively applies to go on the microphone and the host agrees; the host actively invites the listener to go on the microphone and the listener agrees.

5. The listener leaves the microphone. There are two ways to leave the microphone: the listener takes the initiative to leave the microphone; the host forcibly takes the listener off the microphone.

6. The host exits and destroys the room (the room is disbanded and all users are forced to leave the microphone and check out).

Audience

1. The listener enters the room lobby and obtains the room list.

2. The listener selects and enters the room.

3. The listener obtains the microphone list based on the group attribute.

4. The listener applies to go on the microphone. After the host agrees, the listener interacts with other users on the microphone.

5. The listener leaves the microphone and exits the room.

Audio stream management

For voice chat interaction scenarios, the RTC stream access solution is usually selected. The access is simple and fast, and the low latency characteristics of real-time interaction can be experienced. As shown in the figure below, a relatively classic push-pull stream architecture solution for real-time interactive voice chat is shown with two roles: the user on the microphone and the audience off the microphone.

For real-time streaming subscriptions in the room, TRTC has two subscription modes to choose from: automatic subscription and manual subscription.

 Automatic subscription: After the user enters the room, he will immediately receive the audio and video streams in the room, the audio will play automatically, and the video will start decoding automatically.

 Manual subscription: After the user enters the room, he needs to manually call startRemoteView to start the subscription and decoding of the video stream, and he needs to manually call muteRemoteAudio to start the audio playback.

In most scenarios, TRTC uses the automatic subscription mode by default. After entering the room, the user will subscribe to the audio and video streams of all anchors in the room in order to obtain a better "second opening experience". The manual subscription mode has better flexibility and customizability, and users can selectively subscribe to audio and video streams.

Recording and Review

If you need to record and store media content in the cloud, or if you need to conduct real-time security review of online interactive content, you can promptly control illegal chat rooms, thereby making online social platforms more standardized.

TRTC Cloud Recording

TRTC's latest upgraded cloud recording does not rely on the ability of cloud live broadcast, does not need to bypass and retweet cloud live broadcast, and uses TRTC's internal real-time recording cluster for audio and video recording, providing a more complete and unified recording experience.

 Single stream recording: With TRTC's cloud recording function, you can record the audio stream of each user in the room into a separate file.

 Mixed stream recording: Mix the audio media streams in the same room into one file.

Tianyu Content Security Audit

TRTC, in conjunction with T-Sec Tianyu, provides real-time audio and video content identification and alarm services. When using real-time audio and video services, it supports global automatic or manual initiation of strategies for audio and video content identification and alarm:

Global automatic audit

Customers can specify audit strategies and audit stream types. TRTC cloud automatically completes the audio and video content audit in all rooms under the application, and sends the violation information to the callback URL specified by the customer through callback, without the need to manually initiate the audit. This method is simple and easy to use, saving the workload of code access, but the flexibility is poor.

The implementation principle of the combination of TRTC and Tianyu content security audit platform is shown in the figure below: Live content security enters the designated TRTC room in the form of a "dumb terminal", pulls audio and video streams as an "audience", and performs content audit on the pulled audio and video streams, and then sends the violation information to the HTTP/HTTPS service specified by the user through callback.

Manual custom audit

Customers only need to call the Tianyu audio and video stream interface to detect whether there is any illegal content in the audio and video stream in real time. The audio and video security audit service will send the violation information to the callback URL specified by the customer through callback. This method is more flexible and more customizable, but it requires calling the REST API to initiate the audit task, which has a certain access complexity.

Ghost microphone processing solution

Ghost microphone, also known as fried microphone or black microphone, means that users who are not on the microphone can speak, and other users can hear the voice of the user under the microphone. The root cause of the ghost microphone phenomenon is that the microphone status of the business is inconsistent with the user role status of TRTC. There are several possible reasons for the occurrence of this problem.

 The listener left the microphone and updated the microphone list, but because the microphone information callback was not reached or was intercepted, the listener did not perform the TRTC switch audience role and turn off the microphone locally, resulting in the listener being able to speak while under the microphone.

 The listener left the microphone and updated the microphone list. After receiving the microphone information callback, the listener failed to call the TRTC switch audience role interface locally, resulting in the listener being able to speak while under the microphone.

 The App was brute-force cracked, resulting in the UserSig being intercepted by hackers, which in turn enabled hackers to enter the TRTC room as a host and speak at will.

We can detect ghost microphones to actively identify and promptly handle ghost microphones. The following introduces the detection and processing solutions for ghost microphones on the client and server respectively.

Client processing solution

Solution principle: Through the TRTC volume callback, compare the current uplink audio user list and the business microphone status list to identify the ghost microphone that is not on the microphone but has audio uplink. The process is shown in the figure below.

When a ghost microphone is detected, the client locally mutes the user's remote audio stream and reports it to the business server. The business server can decide whether to ban the user or kick him out of the room.

Server-side processing solution

Solution principle: User roles in voice chat interaction scenarios are divided into anchors and audiences. Only anchor roles can upload local audio, so ghost microphones can be detected by comparing the business microphone list and the TRTC user role list.

TRTC provides room and media event callbacks on the server side. You can maintain a real-time anchor list in the current room by monitoring events such as entering the room, switching roles, and exiting the room. Then compare the TRTC real-time anchor list with the business full microphone list to easily detect and identify ghost microphones, and then perform operations such as kicking out of the room or mute.

1. Real-time audio and video TRTC console supports self-configuration callback information. After the configuration is completed, you can receive event callback notifications.

2.  Receive and parse the callback event package, pay attention to 103/104/105 events, and count the real-time online anchor role user list in the current room

{
    "EventGroupId": 1, #Room event group
    "EventType": 103, #Room entry event
    "CallbackTs": 1687679847972, #Callback time, in milliseconds
    "EventInfo":     {
        "RoomId": "123456", #Room number
        "EventTs": 1687679847, #Event occurrence time, in seconds
        "EventMsTs": 1687679847899, #Event occurrence time, in milliseconds
        "UserId": "1a99b0a9", #User name
        "Role": 20, #User role 20: host; 21: audience
        "TerminalType": 2, #Terminal type
        "UserType": 3, #User type
        "Reason": 1 #Specific reason
    }
}

3.  Finally, you can identify ghost microphones and ban or kick them out of the room at specific times (such as when the microphone list changes) or by periodically polling and comparing the business microphone list of each room with the TRTC real-time anchor list.

Solution to prevent lag when switching microphones up and down

Problem description

Due to differences in the system mechanisms of mobile devices, the performance of switching microphones up and down in voice chat scenarios is inconsistent between Android and iOS. A brief audio freeze may occur on the iOS side when switching microphones up and down.

Cause analysis

This is related to the audio mechanism of the iOS system. The startLocalAudio and stopLocalAudio operations obtain and release microphone device permissions. The SDK's audio re-collection causes AVAudioSession to restart the audio driver, resulting in a brief audio freeze when switching microphones up and down.

Solution

The timing of the conventional TRTC solution for switching microphones up and down is shown in the figure below. When switching roles, the collection and release of local audio are started or stopped. This solution can be used normally on the Android side.

On the iOS side, you can stop streaming by switching the audience role during the microphone down operation without calling stopLocalAudio to stop audio collection and release microphone permissions, so as to avoid the lag of switching microphones up and down.

Best Practices for Audio Configuration

Audio quality and volume type are two different concepts in audio configuration. In TRTC, audio quality can be set when starting local audio acquisition and publishing, by starting local audio acquisition and publishing, or by setting audio quality separately through setAudioQuality(TRTCAudioQuality); the volume type is determined by a combination of factors such as the room entry scene and audio quality settings. In addition, a certain volume type can be forced to be specified through setSystemVolumeType(TRTCSystemVolumeType).

Best Practices for Audio Quality Configuration

The TRTC SDK currently provides three carefully tuned sound quality modes to meet the differentiated pursuit of sound quality in various vertical scenarios.

Sound quality mode

Sound quality enumeration value

Sound quality parameter

Sound quality description

Vocal mode

TRTCAudioQualitySpeech

Sampling rate: 16k; mono;

Encoding rate: 16kbps

It has strong network resistance and good fluency in weak network environment. It is suitable for application scenarios mainly based on human voice communication, such as online meetings, voice calls, etc.

Default mode

TRTCAudioQualityDefault

Sampling rate: 48k; mono;

Encoding rate: 50kbps

The SDK default mode has better music restoration than the vocal mode. At the same time, the amount of data transmitted is much lower than that of the music mode, and it has good adaptability to various scenarios.

Music mode

TRTCAudioQualityMusic

Sampling rate: 48k; full-band stereo;

Encoding rate: 128kbps

In this mode, the amount of audio data transmitted is very large, ensuring that the music signal can achieve high-fidelity detail restoration in all frequency bands, which is suitable for scenarios that require high-fidelity music transmission.

As can be seen from the table above, the sound quality effect increases from the human voice mode to the music mode, but the amount of audio data transmitted also increases.

 In the voice chat room scenario, it is recommended to use the human voice mode for pure human voice communication, which can achieve better fluency under weak network conditions;

 For voice chat rooms that need to play background music, it is recommended to use the default mode or music mode to obtain good audio detail restoration;

 Considering the network bandwidth pressure of the downstream audience, in order to ensure a good user experience, it is recommended to use the music mode with caution in business scenarios with more than ten microphones.

Best Practices for Volume Type Configuration

TRTC SDK currently provides three system volume type control modes to meet the differentiated needs for volume types in different scenarios.

Volume Type Mode

Volume Type Mode Enumeration Value

Volume Type Mode Description

Full Call Volume

TRTCSystemVolumeTypeVOIP

The advantage of this solution is that the audio module does not need to switch working modes when the user switches on and off the microphone, and can achieve seamless switching, which is suitable for application scenarios where users need to switch on and off the microphone frequently. If the scene selected when entering the room is TRTCAppSceneVideoCall or TRTCAppSceneAudioCall, the SDK will automatically use this mode.

Automatic Switching Mode

TRTCSystemVolumeTypeAuto

Also known as "call on microphone, media off microphone", that is, the call volume is used when the host is on the microphone, and the media volume is used when the audience is not on the microphone, which is suitable for online live broadcast scenarios. If the scene selected when entering the room is TRTCAppSceneLIVE or TRTCAppSceneVoiceChatRoom, the SDK will automatically use this mode.

Full Media Volume

TRTCSystemVolumeTypeMedia

The media volume is used throughout the call, which is suitable for music scenarios with demanding sound quality requirements. If most of your users use external devices (such as external sound cards), you can use this mode.

 In the call scenario, it is recommended to use the default full-time call volume, and the audio module does not need to be switched at this time;

 In the voice chat room scenario, it is recommended to use the default automatic switching mode for pure human voice communication, that is, the microphone is on for calls and the microphone is off for media;

 In the voice chat room that needs to play background music, you can consider setting the full-time media volume to avoid users perceiving remote music freezes and sudden changes in volume when the microphone is on and off.

Single-stream volume evaluation

In the chat room scenario, some customers may choose to push and pull the RTC single stream to reduce bandwidth and save costs, and the audience will pull the mixed stream in the room. However, the chat room scenario usually requires corresponding prompts on the UI based on the volume of the user on the microphone, such as "sound wave graph" or "volume bar". The volume evaluation feedback function of single-channel audio is easy to implement in the TRTC room, but some special methods are required to implement it in pure audio mixed streams. The specific implementation of the two solutions will be introduced below.

Single-stream volume evaluation in the RTC room

Step 1: Enable volume prompt

Enable the volume callback through the enableAudioVolumeEvaluation interface, and optionally enable the local voice detection function. After enabling this function, the SDK will feedback the volume of the local user and the remote push stream user, the maximum volume value, and the local voice detection result in the onUserVoiceVolume callback.

Step 2: Listen for volume callbacks

Listen for the onUserVoiceVolume callback in TRTCCloudListener. The callback will provide feedback on the volume of the local user and the remote streaming user, as well as the maximum volume of the remote user. You can display the corresponding sound waves on the UI based on the volume.

Pure audio mixed stream single stream volume evaluation

The implementation process of pure audio mixed stream evaluation of single stream volume is shown in the figure above. The host on the microphone needs to monitor the volume callback, and judge the local volume and remote volume, insert the local volume value and user information into the audio stream in the form of SEI message, and then transparently transmit it to the audience after mixed stream. Or the room owner can send the callback volume values ​​of all hosts on the microphone through SEI. The following figure shows the timing diagram of the whole process:

As shown in the figure below, the volume of the corresponding speaker will be displayed in the SEI message parsed from the mixed stream on the audience end.

Best Voice Chat Room Use Cases

As a provider of cloud-based real-time audio and video call services, Tencent RTC has enabled clients to launch innovative voice chat products. Here are some popular social audio use cases that have resonated globally:

a. Voice Chat Room + Mini-Games

Combining voice chat rooms with games enhances the social attributes of platforms by integrating interactive gaming elements, bringing new integrated gameplay and solving ice-breaking challenges, thus increasing the duration of connected microphone time.

Not only used as a voice tool for large multiplayer games like battle royales, but also commonly applied in online mini-games such as murder mystery games, truth or dare, and table tennis. Users participate in game relays through voice rotation, enhancing the fun and interactivity of the games.

b. Voice Radio Stations

Voice radio is a popular social audio feature. Hosts can broadcast live audio streams to an audience on voice radio stations and invite certain audience members (usually paid users or those who send virtual gifts) to engage in conversations. Hosts create content such as discussing current events, playing music, storytelling, or conducting interviews, and broadcast it live. Besides one-way broadcasting, hosts can invite the audience to participate in real-time discussions, enhancing the interactive and engaging nature of the programs.

c. Karaoke

Beyond basic voice chatting, users can also engage in karaoke-style interactions in voice chat rooms combined with online KTV. Hosts can play background music for interactive voice chats, commonly used in scenarios like music rooms, study rooms, or listening rooms. Additionally, users can sing solo by selecting songs to create a playlist, with the singer performing live while others engage in text chat and send gifts.

d. Voice Chat Room + Avatar

In traditional voice chat rooms where voice is the sole medium of communication, integrating avatars allows for the transmission of more information such as user images, expressions, and gestures, providing a face-to-face communication experience without showing real faces. This helps break the ice and deepen user relationships, further enriched by 3D gifts and virtual backgrounds.

Conclusion

The diversity and innovation of voice chat rooms have redefined digital communications. From intimate chats to lively group discussions, to voice radio and karaoke, they promote global connections and rich online experiences through innovative entertainment methods. With the continuous advancement of technology, we look forward to more creative use cases for voice chat rooms.

Want to build a similar app or platform? Get your free 10,000 minutes now

Get Started for Free

If you have any questions or need assistance online, our support team is always ready to help. Please feel free to Contact us or join us on Telegram or Discord. For technical problems, you can also get help directly from developers on Stack Overflow.