Bandits and RL for adaptivity
Social posts described applying multi-armed bandits and an epsilon-greedy DQN approach to context-dependent strategy selection for adaptive systems. (x.com) A new paper called eBandit—Kernel-Driven Reinforcement Learning for Adaptive Video Streaming—was shared alongside threads on real-time systems that detect learner states and adjust next steps continuously. ( | )
Adaptive systems are being built to pick their next move on the fly, using trial-and-error methods that switch strategies as conditions change. (arxiv.org) (sciencedirect.com) One common setup is the “multi-armed bandit,” a slot-machine analogy where software tests several options and keeps leaning toward the one paying off best. Another is reinforcement learning, where a model learns action values over time and still makes occasional exploratory choices through an epsilon-greedy rule. (arxiv.org) (proceedings.neurips.cc) The new paper in this cluster, “eBandit: Kernel-Driven Reinforcement Learning for Adaptive Video Streaming,” was submitted to arXiv on April 9, 2026. It moves network monitoring and bitrate-policy selection into the Linux kernel with extended Berkeley Packet Filter, or eBPF, instead of leaving those decisions in user-space video players. (arxiv.org 1) (arxiv.org 2) The paper says user-space adaptive bitrate systems cannot directly see transport-layer signals such as minimum round-trip time and instantaneous delivery rate, and often react only after a playback buffer has already taken a hit. Its kernel-resident controller runs an epsilon-greedy multi-armed bandit that scores three bitrate heuristics against live Transmission Control Protocol metrics. (arxiv.org 1) (arxiv.org 2) On a synthetic adversarial trace, eBandit reported cumulative quality-of-experience of 416.3 plus or minus 4.9, beating the best fixed heuristic by 7.2 percent. On 42 real-world sessions, the paper reported the highest mean quality-of-experience per chunk, at 1.241. (arxiv.org) (alphaxiv.org) The same basic logic is showing up outside video. Recent reviews of adaptive learning platforms describe systems that continuously collect learner interactions, estimate skill or engagement, and change the next lesson, hint, or assessment item in real time. (sciencedirect.com) (springer.com) A February 5, 2026 Scientific Reports paper applied a multi-armed bandit framework to computerized adaptive testing, combining deep learning and reinforcement learning to choose questions during an exam. That places education and streaming in the same engineering pattern: observe state, test actions, keep the better policy, and keep some room for exploration. (nature.com) (proceedings.neurips.cc) The technical split matters because bandits usually choose among a small set of strategies based on immediate reward, while Deep Q-Network systems try to learn longer-term action values across sequences of states. Theory work presented at NeurIPS 2023 found that higher epsilon values widen the region where Deep Q-Network training converges, but slow that convergence. (proceedings.neurips.cc) What ties these projects together is speed. In video, the reward signal is stalls, bitrate, and smoothness measured from live traffic; in learning software, it is correctness, pace, or engagement measured from each new interaction. (arxiv.org) (sciencedirect.com) The result is a narrower claim than “artificial intelligence adapts everything.” These systems are being designed to pick among concrete next actions under uncertainty, one choice at a time, and April 2026’s eBandit paper gives that approach a fresh test in one of the internet’s busiest workloads. (arxiv.org)