Starter Deal! First 3 month from only $9.9 /month!
Starter Deal! First 3 month from only $9.9 /month!
Grab It Now 
Tencent RTC Blog
Tencent RTC Blog
Products and Solutions

GPT-4o & RTC: Leading a new era of real-time multi-modal interaction (Startup Enterprise Plan)

Tencent RTC - Dev Team

封面.png

📅 On May 13th, OpenAI announced the GPT-4o model, which can perform real-time reasoning across audio, visual, and textual inputs. It accepts any combination of text, audio, and images as input and generates any combination of text, audio, and images as output.

🤔 OpenAI has also adopted RTC technology for the first time in GPT-4o. The model demonstrates extremely low latency performance when interacting with large models, opening up possibilities for richer applications of real-time interaction with large models.

👀 OpenAI has made it possible for the GPT-4o model to respond to audio inputs in as little as 232 milliseconds, with an average response time of 320 milliseconds, similar to human response times in conversations, and it can even be interrupted at any time. This is not just a simple speech-to-text processing mode; it can understand tone and intonation, act as an emotional conversational assistant, and even serve as a real-time simultaneous interpreter without any issues. During a live demo, the demonstrator pretended to breathe rapidly, and GPT-4o was able to understand his breathing pattern and promptly offered suggestions to help him relax.

gpt4o1.png

GPT-4o: Multimodal Model ⚙️

The ultimate responsive experience comes from GPT-4o, which unifies various modalities to form a complete multimodal base model.

Before GPT-4o, users could use Voice Mode (consisting of three separate models) to interact with ChatGPT, going through a process of "speech-to-text, question-answering, and text-to-speech":

1. Speech recognition or ASR: audio -> text1

2. LLM that plans what to say next: text1 -> text2

3. Speech synthesis or TTS: text2 -> audio

However, the average latency of this process reached 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4), and a lot of information was lost in the process. For example, GPT-4 could not directly observe pitch, multiple speakers, or background noise, nor could it output laughter, singing, or emotional expressions. GPT-4o feeds speech in real-time to the large model, greatly improving response time and achieving a speed similar to human interaction.

Compared to existing models, GPT-4o is particularly outstanding in visual and audio understanding. GPT-4o has significantly improved speech recognition performance across all languages, especially for those with fewer resources.

gpt4o2.png

GPT-4o has achieved state-of-the-art performance in speech translation and outperforms Whisper-v3 in the MLS benchmark test.

gpt4o3.png

With GPT-4o, OpenAI has trained a new end-to-end model across text, visual, and audio modalities, meaning that all inputs and outputs are handled by the same neural network. The key to achieving this leap in technological advancement lies in two aspects: the evolution of large models and the application of RTC (Real-Time Communication) capabilities.

GPT-4o Leads the Trend of Real-Time Multimodal Development 📈

The release of GPT-4o signifies that support for end-to-end real-time multimodal processing will become a new direction for the development of large models. Real-time text, audio, and video transmission capabilities are gradually becoming standard configurations for real-time large models. By integrating multiple data modalities and achieving instant responses, this technology can understand and respond to user needs from multiple angles, providing a more natural and efficient user experience. In the future, other large model manufacturers may also follow up and launch products with end-to-end real-time multimodal capabilities. Among these, RTC technology plays an extremely crucial role in multimodal processing.

Chat
RTC Engine