Aashish Gottipati

Problem & Motivation

Delivering consistently high-quality video calls is a major challenge because network conditions are incredibly diverse and constantly changing. Traditional heuristic-based algorithms, like those in WebRTC, are robust but rigid; manually tuning their many parameters for each user's unique environment is practically impossible. On the other hand, modern Reinforcement Learning (RL) approaches can adapt, but they learn from scratch. This "trial-and-error" process is extremely inefficient, requiring hundreds of thousands of live calls to converge, which is too slow and risky for real-world user experience.

Our Approach & Contribution

We introduce Merlin, a system that achieves the best of both worlds. Instead of building an algorithm from scratch, we transform a battle-tested, expert-designed heuristic (an Unscented Kalman Filter from Microsoft Teams) into a neural network. We use a technique called Behavioral Cloning, which allows Merlin to learn the heuristic's sophisticated policy entirely from offline telemetry logs, with zero live network interaction during initial training. This cloned neural network captures years of domain expertise. The key innovation is that this policy can then be efficiently fine-tuned on a small amount of live data to personalize it for a user's specific network, avoiding the manual process of tuning network parameters.

Results & Impact

Merlin successfully generalized from offline simulations to the real world, matching the performance of the original heuristic in live, intercontinental video calls with no statistical difference in user Quality of Experience (QoE). By fine-tuning on just 75 calls, Merlin improved user QoE by up to 7.8% compared to the baseline heuristic, demonstrating effective personalization. The most significant result is its efficiency: Merlin converges to a high-quality policy using 80% fewer video calls than state-of-the-art online RL methods, making continuous end-user personalization a practical reality.

Chart showing QoE improvement after finetuning Merlin

Fig. 2: Personalization in action. After fine-tuning on specific network types (Low and High Bandwidth), Merlin significantly improves QoE by up to 7.8% over the original heuristic policy.

Chart comparing data efficiency of online vs offline learning

Fig. 3: Merlin's approach is far more sample-efficient. Online RL models require ~50,000 calls to converge, whereas Merlin achieves a high-quality policy with a fraction of the live data, thanks to offline pretraining.

Offline to Online Learning for Real-Time Bandwidth Estimation

Problem & Motivation

Our Approach & Contribution

Results & Impact