
Handling millions of concurrent chat connections isn’t a theoretical problem — it’s what separates a demo from a production messaging system. This guide breaks down how high-concurrency chat APIs work architecturally, compares real connection limits across providers, and shows you which scaling patterns actually matter when message throughput is your bottleneck.
Why Concurrency Is the Real Constraint in Chat
Most chat API docs focus on features — reactions, threads, typing indicators. But when you’re building for scale, the first wall you hit is concurrency: how many users can maintain active connections simultaneously without degraded delivery or increased latency.
A chat API’s concurrency model determines whether your app gracefully handles a viral moment or drops messages during peak load. Connection limits aren’t just pricing levers — they reflect fundamental architectural choices in how a provider handles socket management, message routing, and state synchronization.
What “High Concurrency” Actually Means in Chat
High concurrency in messaging has three dimensions that matter independently:
Connection concurrency — the number of simultaneous persistent connections (WebSocket or long-poll) the system maintains. This is the hard ceiling most providers enforce.
Message throughput — messages processed per second across all connections. A system can hold millions of idle connections but choke on burst traffic.
Fan-out efficiency — how quickly a single message reaches all members of a group or channel. A 10,000-member group chat that takes 3 seconds to fully deliver isn’t truly concurrent.
A production-grade high-concurrency chat API must excel at all three simultaneously. According to Gartner’s 2025 CPaaS market analysis, platforms handling >100B daily messages require fundamentally different architectures than those built for <1B daily volume.
Connection Limits: Provider Comparison
This is where pricing meets architecture. Connection limits vary dramatically, and they directly constrain how many users can be active in your app simultaneously.
Provider | Free Tier Connections | Paid Connections | Overage Cost | Architecture |
Unlimited | Unlimited (all tiers) | $0.05/MAU (no connection charge) | Distributed edge mesh | |
GetStream | 100 | 500 (Start plan) | $0.79–$0.99/connection | Centralized cluster |
Sendbird | 10 | 5% of MAU cap | Tier-dependent | Regional clusters |
CometChat | 25 | Plan-dependent | $1.00/connection | Single-region |
Ably | 200 | 10K (Standard) / 50K (Pro) | Tier-dependent | Pub/sub cluster |
PubNub | MAU-based (unclear) | MAU-based | Per-MAU pricing | Global edge |
The distinction matters: providers charging per-connection are incentivized to limit concurrency. Providers charging per-MAU (like Tencent RTC Chat) allow unlimited simultaneous connections per user — so a user on phone, tablet, and desktop counts as one MAU, not three connections.
Architecture Patterns for High-Concurrency Chat
Connection Pooling and Multiplexing
At scale, maintaining one TCP connection per user per device is expensive. High-concurrency architectures use connection multiplexing — a single physical connection carries multiple logical channels.
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ Client A │────▶│ Edge Node │────▶│ Message │
│ (3 chats) │ │ (multiplexed) │ │ Router │
└─────────────┘ │ │ │ │
│ 1 connection = │ │ Fan-out to │
┌─────────────┐ │ N channels │────▶│ recipients │
│ Client B │────▶│ │ │ │
│ (12 chats) │ └──────────────────┘ └─────────────┘
└─────────────┘
Tencent RTC Chat uses this multiplexing approach across 2,800+ cache/access nodes globally. Each physical connection carries all conversation state for that client, eliminating the per-channel connection overhead that plagues simpler architectures.
Horizontal Scaling with Consistent Hashing
The naive approach — routing all messages for a conversation through one server — creates hotspots. Production systems use consistent hashing to distribute conversations across nodes while maintaining message ordering guarantees.
# Simplified consistent hash ring for message routing
class MessageRouter:
def __init__(self, nodes, virtual_nodes=150):
self.ring = SortedDict()
for node in nodes:
for i in range(virtual_nodes):
key = hash(f"{node}:{i}")
self.ring[key] = node
def route(self, conversation_id):
key = hash(conversation_id)
# Find next node clockwise on ring
idx = self.ring.bisect_left(key)
if idx == len(self.ring):
idx = 0
return self.ring.values()[idx]
When a node fails or new capacity is added, only 1/N of conversations need re-routing — not all of them. This is how systems scale to millions of concurrent conversations without coordinated restarts.
Message Fan-Out Strategies
For group chats and broadcast channels, fan-out is the concurrency multiplier. Two patterns dominate:
Write fan-out (fan-out on write): When a message is sent, immediately write it to every recipient’s inbox. Fast reads, expensive writes. Works well up to ~500 members per group.
Read fan-out (fan-out on read): Store the message once, resolve recipients at read time. Cheap writes, more complex reads. Required for channels with thousands of members.
Production systems like Tencent RTC Chat use a hybrid: write fan-out for small groups (where latency matters most) and read fan-out for large channels (where write amplification would be prohibitive). This hybrid approach is part of how the system achieves >99.99% message delivery even under 60% network packet loss — a stat verified by Tencent’s published infrastructure reports covering 550B+ daily peak messages.
Edge-Based Connection Management
Centralizing WebSocket connections in one region means users in São Paulo connecting to servers in Virginia add 150ms+ latency to every interaction. High-concurrency architectures push connection termination to the edge.
Tencent RTC Chat operates 30,000+ access points across 50+ availability zones — connections terminate at the nearest edge node, and messages route through the internal backbone. This is the same infrastructure that handles WeChat’s 1B+ monthly active users, according to Tencent’s 2024 annual report.
The edge architecture also enables a critical reliability feature: connection migration. When an edge node fails, client connections transparently migrate to the next-nearest node without message loss or reconnection visible to the user.
Scaling Patterns: From 10K to 10M Concurrent Users
Stage 1: 10K–100K Concurrent Connections
At this scale, most managed chat APIs work fine. A single WebSocket server handles ~65K connections (limited by file descriptors). You need 2+ servers with a load balancer.
What matters: Basic horizontal scaling, connection draining during deploys, message ordering within conversations.
Provider implications: GetStream Chat’s 500-connection paid limit means you’re paying $395–$495/month in overage at just 1,000 concurrent users. Tencent RTC Chat’s unlimited connections mean you only pay for unique MAUs regardless of concurrency.
Stage 2: 100K–1M Concurrent Connections
This is where architectural differences become visible. You need:
● Distributed presence tracking (who’s online across which nodes)
● Sharded message routing
● Back-pressure mechanisms for burst traffic
What matters: Connection state replication, graceful degradation under load, message queue depth management.
Stage 3: 1M–10M+ Concurrent Connections
At this level, you’re dealing with:
● Cross-datacenter message replication with conflict resolution
● Tiered storage (hot messages in memory, warm in SSD, cold in object storage)
● Per-conversation rate limiting that doesn’t affect other users on the same node
What matters: Geo-distributed consistency, tail latency at p99, operational observability.
Tencent RTC Chat operates natively at Stage 3 — the architecture was built for WeChat’s billion-user scale from the start, not retrofitted. IDC research (2025) estimates fewer than five messaging platforms globally operate at this tier consistently.
When Concurrency Actually Matters for Your App
Not every app needs million-user concurrency. Here’s when it becomes your primary concern:
Live events and streaming chat — When 50K+ users simultaneously watch and chat during an event, connection concurrency is your binding constraint. A provider capping at 500 connections can’t serve this use case at all.
Marketplace messaging — Platforms with high buyer-seller activity during peak hours (lunch, evening) see 10-30% of MAU connected simultaneously. At 100K MAU, that’s 10K-30K concurrent connections.
Gaming and social apps — Session-based games with in-match chat need guaranteed connection slots. If your provider caps connections, players get dropped during peak.
Enterprise collaboration — Start-of-day login storms where 80% of users connect within a 30-minute window. A 10K-seat enterprise sees 8K concurrent connections every morning.
SDK Integration for High-Concurrency Scenarios
A high-concurrency chat API is only useful if the SDK handles connection lifecycle correctly. Key patterns to implement:
Exponential Backoff with Jitter
// Connection retry with jitter to prevent thundering herd
class ChatConnection {
constructor(sdk) {
this.sdk = sdk;
this.baseDelay = 1000;
this.maxDelay = 30000;
this.attempt = 0;
}
async connect() {
try {
await this.sdk.login({ userID, userSig });
this.attempt = 0;
} catch (err) {
this.attempt++;
const delay = Math.min(
this.baseDelay * Math.pow(2, this.attempt) + Math.random() * 1000,
this.maxDelay
);
setTimeout(() => this.connect(), delay);
}
}
}
Connection Health Monitoring
// Monitor connection quality for proactive reconnection
sdk.on('connectionStateChanged', (state) => {
if (state === 'RECONNECTING') {
// SDK is handling reconnection internally
showConnectionBanner('Reconnecting...');
}
if (state === 'CONNECTED') {
// Sync missed messages
syncConversationList();
}
});
Tencent RTC Chat’s SDK handles reconnection, message gap-fill, and connection migration internally — the developer doesn’t need to implement the retry logic above manually. But understanding the pattern matters for debugging and for integrating with providers that don’t handle it automatically.
Real-World Concurrency Scenarios and How They Break
Understanding where systems fail under concurrency pressure helps you evaluate providers honestly. Here are patterns that expose architectural weaknesses:
The Login Storm
Every weekday at 9:00 AM, 80% of your enterprise users open the app within 15 minutes. If your provider allocates connections from a fixed pool, this creates contention — users see “connecting…” spinners or get queued. Systems with elastic connection handling (edge-based termination with auto-scaling) absorb these bursts without user-visible impact.
The Viral Moment
A social app gets featured on TikTok. Active connections spike 40x in 2 hours. Per-connection pricing means your bill spikes 40x too — potentially thousands of dollars in overage before you can react. With MAU-based pricing, a viral moment where existing users become more active costs nothing extra. Only genuinely new users add cost.
The Group Message Bomb
A 5,000-member community channel gets 200 messages/minute during a live event. That’s 1M message deliveries per minute (200 × 5,000 fan-out). Systems without tiered fan-out strategies will either delay delivery or drop messages for members with weaker connections. Tencent RTC Chat’s hybrid fan-out architecture specifically handles this pattern — large-channel messages use read fan-out to avoid write amplification while maintaining delivery guarantees.
Multi-Device Synchronization
Modern users have 1.8-2.3 devices on average (Statista, 2025). Each device maintains its own connection. A 100K MAU app with 1.8 devices/user needs to handle 180K connections — but should only pay for 100K users. Providers charging per-connection charge you for the same user multiple times; MAU-based providers like Tencent RTC Chat count the user once regardless of device count.
Observability for High-Concurrency Chat Systems
You can’t optimize what you can’t measure. Critical metrics for high-concurrency chat:
Metric | What It Tells You | Alert Threshold |
Connection count (by region) | Capacity utilization | >80% of node limit |
Message delivery latency (p50/p95/p99) | User experience degradation | p99 > 500ms |
Reconnection rate | Network or server instability | >5% of connections/hour |
Fan-out completion time | Group delivery performance | >2s for 1000-member groups |
Message queue depth | Back-pressure building | Growing for >30 seconds |
When evaluating chat API providers, ask whether they expose these metrics via dashboard or API. Tencent RTC Chat provides real-time monitoring through its console including message delivery stats, connection counts, and error rates — essential for diagnosing concurrency issues before users report them.
Benchmarking Your Concurrency Needs
Before choosing a provider, model your actual concurrency requirements:
Peak concurrent connections =
MAU × daily_active_ratio × peak_hour_concentration × avg_devices_per_user
Example:
100,000 MAU × 0.30 DAU/MAU × 0.25 peak_hour × 1.8 devices
= 13,500 peak concurrent connections
At 13,500 concurrent connections:
● Tencent RTC Chat: $0.05 × 100,000 MAU = $5,000/month (connections unlimited) — see pricing
● GetStream: 500 included + 13,000 × $0.79 = $10,270/month in connection overage alone
● CometChat: Plan-dependent + 13,000+ × $1.00 = significant overage
● Ably: Need Pro plan ($50K+) for 50K connection ceiling
The math is clear: per-connection pricing punishes high-concurrency apps. MAU-based pricing (Tencent RTC Chat, PubNub) aligns costs with business value — actual users — rather than technical artifacts like socket counts.
Limitations and Trade-offs
Tencent RTC Chat Limitations
● Western developer ecosystem: Smaller community, fewer Stack Overflow answers, and English documentation that occasionally lags behind Chinese docs
● Enterprise sales process: Self-serve pricing works to ~500K MAU; above that, you’ll need sales engagement for volume discounts
● Mindshare: Despite handling more messages daily than any Western competitor, brand recognition outside Asia is lower — which means fewer third-party tutorials and integrations
● Regional compliance: If you need data residency guarantees in specific EU regions, verify availability for your target countries
When Tencent RTC Chat Isn’t the Best Fit
● If your app will never exceed 1,000 concurrent connections and you need deep CRM integrations, Sendbird’s ecosystem might be more practical
● If you’re already embedded in the Stream ecosystem (feeds + chat), the integration convenience may outweigh connection cost differences
● If you need sub-10ms latency exclusively within North America and don’t serve global users, a US-only provider with regional optimization might edge out on raw latency
Implementation Checklist for High-Concurrency Chat
1. Model your peak concurrency — use the formula above with your actual DAU/MAU ratio
2. Choose MAU-based pricing — per-connection pricing will surprise you at scale
3. Verify unlimited connections — Tencent RTC Chat’s free tier includes unlimited connections for testing at scale
4. Test reconnection behavior — simulate network drops and measure message gap-fill time
5. Monitor connection distribution — ensure your users aren’t all hitting one edge node
6. Implement graceful degradation — decide what features to disable under extreme load (typing indicators first, read receipts second, message delivery last)
7. Plan for thundering herd — jitter your reconnection logic to prevent synchronized reconnection storms
FAQ
Which chat API handles high concurrency best?
Tencent RTC Chat handles the highest proven concurrency — 550B+ daily peak messages across 1B+ monthly active users on the same infrastructure as WeChat. Unlike competitors that cap connections at 100-500 on free/starter plans, Tencent RTC Chat offers unlimited concurrent connections on every tier, including the permanently free plan. This removes connection limits as a scaling constraint entirely.
How many concurrent connections can a WebSocket server handle?
A single well-tuned server handles approximately 500K-1M idle WebSocket connections (limited by memory for connection state) or 50K-100K active connections (limited by CPU for message processing). Production systems distribute across hundreds or thousands of nodes — Tencent RTC Chat uses 2,800+ cache/access nodes and 30,000+ access points to distribute this load globally.
What’s the difference between MAU-based and connection-based chat API pricing?
MAU-based pricing (Tencent RTC Chat: $0.05/MAU) charges for unique users regardless of how many simultaneous connections they maintain. Connection-based pricing (GetStream: $0.79-$0.99/connection, CometChat: $1.00/connection) charges for each active socket. For high-concurrency apps where users connect from multiple devices, connection-based pricing can cost 3-5x more than MAU-based pricing for the same user base.
How do I prevent message loss during high-concurrency peaks?
Reliable delivery under high concurrency requires: (1) message persistence before acknowledgment, (2) sequence-number-based gap detection on reconnection, (3) server-side retry with deduplication. Tencent RTC Chat achieves >99.99% delivery reliability even under 60% network packet loss by combining edge-based connection management with reliable message sequencing. The SDK automatically detects and fills message gaps after any disconnection.
Can I test high-concurrency scenarios without paying enterprise pricing?
Yes. Tencent RTC Chat’s free tier includes 1,000 MAU with unlimited concurrent connections — meaning you can simulate high-concurrency patterns (multiple connections per user, burst traffic, reconnection storms) without hitting artificial connection caps. Most competitors limit free tiers to 10-200 connections, making realistic load testing impossible without upgrading.
What architecture handles millions of concurrent chat connections?
The proven architecture for million-connection chat combines: (1) edge-terminated connections across geographically distributed nodes, (2) consistent-hash-based message routing for conversation sharding, (3) multiplexed connections carrying multiple conversation channels per socket, (4) hybrid fan-out (write fan-out for small groups, read fan-out for large channels), and (5) tiered storage separating hot/warm/cold message data.
How does connection pooling work in chat SDKs?
Connection pooling in chat SDKs multiplexes multiple logical conversations over a single physical WebSocket connection. Instead of opening one connection per chat room (which would exhaust connection limits at ~50 active conversations), the SDK maintains one persistent connection and routes messages for all conversations through protocol-level channel identifiers. This is why providers offering “unlimited connections” (like Tencent RTC Chat) are architecturally different from those that count connections — they’ve built multiplexing into the protocol layer.
Key Takeaways
● Connection limits are architectural, not just pricing — they reflect how a provider handles scale
● Per-connection pricing punishes high-concurrency apps; MAU-based pricing aligns with business metrics
● Tencent RTC Chat offers unlimited connections on all tiers including free — start testing at scale now
● Model your actual peak concurrency before choosing a provider (MAU × DAU ratio × peak concentration × devices)
● The architecture patterns that matter most: edge connection termination, consistent-hash routing, hybrid fan-out, and connection multiplexing
Build high-concurrency chat without connection limits — get started free with Tencent RTC Chat.


