Products
Solutions
Developers
Demo
Pricing
Company

What is WebRTC Insertable Stream: Comprehensive Guide to WebRTC Encoded Transform

20 min read
Dec 11, 2024

What is WebRTC Insertable Stream

WebRTC Insertable Streams allows users to manipulate WebRTC encoded data. The latest specification is here https://w3c.github.io/webrtc-encoded-transform/. It has been renamed WebRTC Encoded Transform. Let's first look at the video processing flow of WebRTC.

Sending process:

 (S1) Get frame-by-frame data from media devices/other acquisition sources

 (S2) Encode the original data (VP8 H264 AV1)

 <- Insert logic here

 (S3) Pack the encoded video frames into RTP

 (S4) Encrypt

 (S5) Send

Receiving process:

 (R1) Receive network RTP packet

 (R2) Decrypt

 (R3) RTP packet

 <- Insert logic here

 (R4) Decode data

 (R5) Render data

WebRTC Insertable Streams allows us to add the ability to process encoded data between S2 and S3 in the sending process and between R3 and R4 in the receiving process. It was originally designed for end-to-end encryption, but its usage scenarios can be further expanded.

Basic use of WebRTC Insertable Streams

WebRTC Insertable Streams was introduced in Chrome M82, but has always been in an experimental state and can be experienced in Chrome Canary. The basic usage is as follows:

Special parameters need to be added when initializing PeerConnection:

var pc = new RTCPeerConnection({
    encodedInsertableStreams: true,  
});

Upstream RTCRtpSender creates EncodedStreams:

        let transceiver = await pc.addTransceiver(stream.getVideoTracks()[0], {
            direction: "sendonly",
            streams: [stream],
        });
        
        setupSenderTransform(transceiver.sender);


        function setupSenderTransform(sender) {
            console.log('sender kind=%s', sender.track.kind);
            const senderStreams = sender.createEncodedStreams();
            const readableStream = senderStreams.readableStream;
            const writableStream = senderStreams.writableStream;

            const transformStream = new TransformStream({
                transform: encodeFunction,
            });
            readableStream
                .pipeThrough(transformStream)
                .pipeTo(writableStream);
        }


        function encodeFunction(chunk, controller) {

            const tmp = new DataView(chunk.data);
            if (tmp.getUint32(0) == 1) {  //  h264 start code '0001'
                console.log("h264 =======")
            }
            const newData = new ArrayBuffer(chunk.data.byteLength + 4);
            const newView = new DataView(newData);

            let metadata = new ArrayBuffer(4);
            let metaView = new DataView(metadata);
            metaView.setUint32(0, frames++);

            const data = new Uint8Array(newData);
            data.set(new Uint8Array(chunk.data));
            data.set(new Uint8Array(metadata), chunk.data.byteLength);
            chunk.data = newData;

            controller.enqueue(chunk);
            console.log("Send frame index ===", frames);
        }

Downstream RTCRtpReceiver creates EncodedStreams:

        const transceiver = await pc.addTransceiver("video", {
            direction: "recvonly",
        });

        setupReceiverTransform(transceiver.receiver);


        function setupReceiverTransform(receiver) {
            console.log('receiver kind=%s', receiver.track.kind);
            const receiverStreams = receiver.createEncodedStreams();
            const readableStream = receiverStreams.readableStream;
            const writableStream = receiverStreams.writableStream;

            const transformStream = new TransformStream({
                transform: decodeFunction,
            });
            readableStream
                .pipeThrough(transformStream)
                .pipeTo(writableStream);
        }

        function decodeFunction(chunk, controller) {

            const view = new DataView(chunk.data);
            //last 4 bytes
            const count = view.getUint32(chunk.data.byteLength - 4);
            chunk.data = chunk.data.slice(0, chunk.data.byteLength - 4);
            controller.enqueue(chunk);

            console.log("Receive frame index ===", count);
        }

WebRTC "pipelining"

After experiencing WebRTC Insertable Streams, the word that comes to my mind is "pipelining". WebRTC's audio and video acquisition, pre-processing, post-processing, encoding and decoding, and rendering can no longer rely on the default implementation of WebRTC. You can implement the acquisition logic yourself, use your own encoder solution, and finally feed the encoded audio and video data to WebRTC. WebRTC can only be used for network transmission, retransmission, FEC, JitterBuffer, NetEQ, and then call back the remote audio and video data. The WebRTC protocol stack itself can be used only as a transmission channel, which will greatly expand the use scenarios of WebRTC.

WebRTC Insertable Streams Use Scenarios

1. End-to-end encryption

This is the scenario that WebRTC Insertable Streams was originally designed to support, but end-to-end encryption will cause great trouble for server-side recording and interoperability with the existing live broadcast infrastructure. At present, domestic service providers will not follow up on end-to-end encryption, but end-to-end encryption is a basic item for overseas scenarios.

2. Frame-level information synchronization

We can add some meta information to the encoded data and send it together with the audio and video frames, and then take out the meta information when the receiving end receives the audio and video frames.

 Whiteboard synchronization in educational scenarios is a very suitable scenario, which can make up for the regret that SEI cannot be used in the Web.

 In the piano teaching scenario, the key information and audio and video are completely synchronized.

 In VR/AR scenarios, camera information, coordinate information, etc. need to be synchronized with audio and video.

 In remote audio and video control scenarios, control signaling can also be packaged into audio and video information.

3. End-to-end delay statistics

In WebRTC call scenarios, especially those that pass through multiple hops on the server, it is difficult for us to detect end-to-end delays, which causes great trouble for our data reporting. We can package the absolute timestamp into the frame information on the sending end, transmit it through the entire link, and take out the absolute timestamp on the playback end to count the delay of the entire link.

4. Custom input and rendering

WebRTC Insertable Streams allows us to customize acquisition and encoding. In this way, we can bypass the original limitations of WebRTC, use WebAudio to collect audio, add our own noise reduction, echo cancellation algorithms, and even increase the effect of voice change, and then hand it over to WebRTC for transmission. Similarly, video can add its own acquisition and encoding logic, such as adding beauty filters to videos, using its own optimized encoders, adding regional encoding, etc. The rendering link can also add rendering logic, such as adding video borders, video overlays and other special effects.

5. Bypass the WebRTC audio processing module and transmit high-quality music audio

The fifth item should be an extension of the fourth item. In the web, we cannot turn off the APM module of WebRTC, which means that all the audio we collect must be processed by the APM module. The APM module will filter the non-human voice part, which is very unfriendly to music. Since we can customize audio collection and encoding, we only need to encode high-quality music ourselves and feed it to WebRTC through WebRTC Insertable Streams. We can bypass the processing of the APM module and enable WebRTC to transmit high-quality music. (This idea is theoretically feasible, but it has not been further verified. Interested partners can verify it.)

The above scenarios are what I can think of immediately. I believe that there will be various innovative uses in the industry. I am looking forward to seeing those innovative ways of playing after the popularization of WebRTC Insertable Streams.

Any other questions?

WebRTC Insertable Streams allows us to modify the encoded audio and video data, but WebRTC packages data through RTP when sending data, and RTP has requirements for the format of the bitstream data when packaging, which makes it impossible for you to modify the encoded data arbitrarily. For example, the bitstream data of H264 needs to start with "0001". If you modify this startbit, it will obviously destroy the RTP packetization logic and cause transmission failure. Therefore, adding meta information is not something that can be added casually, and the RTP packaging logic of WebRTC itself cannot be destroyed. For example, in the H264 scenario, we can add some custom data after the entire frame data. On the playback side, parse it out according to the opposite logic.

The added meta information cannot be too much, otherwise it may affect the RTP packetization logic.

Adding custom meta information to the frame will cause some troubles to the recording and retweeting system. When recording and retweeting, the corresponding meta information needs to be filtered out on the service side.

Show me the Code

I have implemented a WebRTC Insertable stream demo. The server uses medooze-media-server. The pusher will pack the index of the current video frame into the encoded data frame. After the server transfer, the puller will parse the index of the current video frame and print it to the console. Students who are interested can try it out by themselves. The project address is GitHub - notedit/webrtc-insertable-stream-play: webrtc insertable stream play

Want to build a similar app or platform? Get your free 10,000 minutes now

Get Started for Free

If you have any questions or need assistance online, our support team is always ready to help. Please feel free to Contact us or join us on Telegram or Discord. For technical problems, you can also get help directly from developers on Stack Overflow.