ā OpenReview ā NeurIPS Homepage ā Chat
TL;DR#
Traditional contextual bandit algorithms often struggle with complex feedback structures. This paper addresses this limitation by considering contextual bandits with graph feedback where choosing an action reveals rewards for neighboring actions. Previous research primarily focused on multi-armed bandits with graph feedback, leaving the contextual bandit case largely unexplored. A key challenge is understanding how the statistical complexity of learning changes with the number of contexts and the feedback graph’s structure.
The researchers introduce a new graph-theoretic quantity, Ī²ā(G), which captures the learning limits of this problem. This quantity smoothly transitions between known bounds based on the independence number and the maximum acyclic subgraph number as the number of contexts varies. They also provide algorithms that achieve nearly optimal performance for significant classes of context sequences and feedback graphs and demonstrate the importance of the maximum acyclic subgraph number in characterizing statistical complexity when dealing with many contexts.
Key Takeaways#
Why does it matter?#
This paper is crucial for researchers working on contextual bandits and graph feedback, as it provides tight theoretical bounds on the regret, which is the measure of the algorithm’s performance. It also introduces novel graph-theoretic quantities to precisely characterize the statistical complexity of learning under different feedback structures and the number of contexts. The results will guide the development of better algorithms with improved performance and provide a stronger theoretical foundation to the field. Moreover, the new algorithm proposed for the tabular settings may also inspire similar methods for non-tabular settings.