Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization

2021-02-23 21:59:56 阅读：202 来源： 互联网

标签：right mathbf Linear Algorithm ft Bandit mathcal xt left

Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization

Feb. 23, 2021

Aim ‾ \underline{\text{Aim}} Aim

In this paper, an efficient Bandit Online Linear Optimization algorithm is proposed, which achieves an optimal O ∗ ( T 1 2 ) O^*(T^{\frac{1}{2}}) O∗(T21) regret. Actually the existence of an efficient algorithm has already been posed in a few papers. This paper exploit a self-concordant potential function to the difficulties encountered in the previous studys.

Background ‾ \underline{\text{Background}} Background

A sequential decision making problem, termed “the multiarmed bandit problem”, inherits from a model that, on each round in a sequence, a gambler must pull the arm on one of several slot machines (“one-armed bandits”) that each returns a reward chosen stochastically from a fixed distribution, The gambler does not know the best arm a priori, his goal is to maximize the reward of his strategy relative to reward he would receive had he known the optimal arm.

Several authors have proposed a very natural generalization of the multi-armed bandit problem to the field of convex optimization, and this is called “bandit linear optimization”. In this setting we imagine that, on each round t, an adversary chooses some linear function f t ( ⋅ ) f_t(\cdot) ft(⋅) which is not revealed to the player. The player then chooses a point x t \mathbf{x}_t xt within some given convex set K ∈ R n \mathcal{K} \in \mathbb{R}^n K∈Rn. The player then suffers f t ( x t ) f_t(\mathbf{x}_t) ft(xt) and this quantity is reveled to him. This process continues for T rounds, and at the end the learner’s payoff is his regret:
R T = ∑ t = 1 T f t ( x t ) − min ⁡ x ∗ ∈ K ∑ t = 1 T f t ( x ∗ ) R_{T}=\sum_{t=1}^{T} f_{t}\left(\mathbf{x}_{t}\right)-\min _{\mathbf{x}^{*} \in \mathcal{K}} \sum_{t=1}^{T} f_{t}\left(\mathbf{x}^{*}\right) RT=t=1∑Tft(xt)−x∗∈Kmint=1∑Tft(x∗)

In the full-information model, it has been known for some time that the optimal regret bound is O ( T 1 2 ) O(T^{\frac{1}{2}}) O(T21). It had been conjectured that this O ( T 1 2 ) O(T^{\frac{1}{2}}) O(T21) bound also holds for the bandit version. However, several algorithms proposed only achieve O ( T 3 4 ) O(T^{\frac{3}{4}}) O(T43) or O ( T 2 3 ) O(T^{\frac{2}{3}}) O(T32). The one achieves O ( p o l y ( n ) T 1 2 ) O(poly(n)T^{\frac{1}{2}}) O(poly(n)T21) is, unfortunately, not efficient.

This paper propose an algorithm which achieves high efficiency and an O ( p o l y ( n ) T 1 2 ) O(poly(n)T^{\frac{1}{2}}) O(poly(n)T21) regret bound. Moreover, the paper discovers a link between the Bregman divergences and self-concordant barriers: divergence functions provide the right perspective for the problem of managing uncertainty given limited feedback.

Brief Project Description ‾ \underline{\text{Brief Project Description}} Brief Project Description

The terms “full-information version” and “bandit version” were mentioned above. Here they will be explained after the definition of an online linear optimization problem. This problem is is defined as the following repeated game between the learner (player) and the environment (adversary).

At each time step t = 1 t=1 t=1 to T T T,

∙ \bullet ∙ Player chooses x t ∈ K \mathbf{x}_t\in\mathcal{K} xt∈K
∙ \bullet ∙ Adversary independently chooses f t ∈ R n \mathbf{f}_t\in\mathbb{R}^n ft∈Rn
∙ \bullet ∙ Player suffers loss f t ⊤ x t \mathbf{f}_t^\top\mathbf{x}_t ft⊤xt and observes feedback ℑ \Im ℑ.

In this game, the Player’s goal is to minimize his regret R T R_T RT defined as

R T : = ∑ t = 1 T f t ⊤ x t − min ⁡ x ∗ ∈ K ∑ t = 1 T f t ⊤ x ∗ R_{T}:=\sum_{t=1}^{T} \mathbf{f}_{t}^{\top} \mathbf{x}_{t}-\min _{\mathbf{x}^{*} \in \mathcal{K}} \sum_{t=1}^{T} \mathbf{f}_{t}^{\top} \mathbf{x}^{*} RT:=t=1∑Tft⊤xt−x∗∈Kmint=1∑Tft⊤x∗

Now, the The full-information version, the Player may observe the entire function f t \mathbf{f}_t ft as his feedback ℑ \Im ℑ and can exploit this in making his decisions. In comparison, the player can only observe a scalar value feedback f t x t \mathbf{f}_t\mathbf{x}_t ftxt after he has made the decision x t \mathbf{x}_t xt at that round.

Though the algorithm proposed in this paper can deal with the bandit version problem, it is still reasonable to utilize a reduction to the full-information setting, as any algorithm that aimed for low-regret in the bandit setting would necessarily have to achieve low regret given full information. For example, the well know Follow The Leader (FTL) stragety using the “select the best choice so far”:
x t + 1 : = arg ⁡ min ⁡ x ∈ K ∑ s = 1 t f s ⊤ x .                               ( 1 ) \mathbf{x}_{t+1}:=\arg \min _{\mathbf{x} \in \mathcal{K}} \sum_{s=1}^{t} \mathbf{f}_{s}^{\top} \mathbf{x}.\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1) xt+1:=argx∈Kmins=1∑tfs⊤x.                             (1)
And the Follow The Regularized Leader (FTRL):
x t + 1 : = arg ⁡ min ⁡ x ∈ K [ ∑ s = 1 t f s ⊤ x + λ R ( x ) ] .          ( 2 ) \mathbf{x}_{t+1}:=\arg \min _{\mathbf{x} \in \mathcal{K}}\left[\sum_{s=1}^{t} \mathbf{f}_{s}^{\top} \mathbf{x}+\lambda \mathcal{R}(\mathbf{x})\right]. \ \ \ \ \ \ \ \ (2) xt+1:=argx∈Kmin[s=1∑tfs⊤x+λR(x)].        (2)
Given that R \mathcal{R} R is convex and differentiable, the general form of the update of FTRL is as follow:
x ‾ t + 1 = ∇ R ∗ ( ∇ R ( x ‾ t ) − η f t ) ,                       ( 3 ) \overline{\mathbf{x}}_{t+1}=\nabla \mathcal{R}^{*}\left(\nabla \mathcal{R}\left(\overline{\mathbf{x}}_{t}\right)-\eta \mathbf{f}_{t}\right),\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) xt+1=∇R∗(∇R(xt)−ηft),                     (3)
followed by a projection onto K \mathcal{K} K with respect to the divergence D R D_\mathcal{R} DR:
x t + 1 = arg ⁡ min ⁡ u ∈ K D R ( u , x ‾ t + 1 ) . \mathbf{x}_{t+1}=\arg \min _{\mathbf{u} \in \mathcal{K}} D_{\mathcal{R}}\left(\mathbf{u}, \overline{\mathbf{x}}_{t+1}\right). xt+1=argu∈KminDR(u,xt+1).
Here R ∗ \mathcal{R}^* R∗ is the Fenchel dual function and η \eta η is a parameter. This procedure is known as the mirror descent.

For an online learning algorithm A \mathcal{A} A, “explore or exploit” is a serious problem. A player first choose some fullinformation online learning algorithm A \mathcal{A} A. A \mathcal{A} A will receive input vectors f 1 , ⋯ , f t \mathbf{f}_1,\cdots, \mathbf{f}_t f1,⋯,ft corresponding to previously observed functions, and will return some point x t + 1 ∈ K \mathbf{x}_{t+1}\in\mathcal{K} xt+1∈K to predict. It is assumed that f 1 , ⋯ , f t \mathbf{f}_1,\cdots, \mathbf{f}_t f1,⋯,ft are just realizations of the random variable (vector) f ~ t \tilde{\mathbf{f}}_{t} f~t. So, the prediction will be more accurate if there are more “new” f \mathbf{f} f input vectors. Here comes the dilemma of “explore or exploit”: whether to follow the advice of A \mathcal{A} A of predicting x t \mathbf{x}_t xt, or to try to estimate f t \mathbf{f}_t ft by sampling in a wide region around K \mathcal{K} K, possibly hurting its performance on the given round. This exploration exploitation trade-off is the primary source of difficulty in obtaining O ( T 1 2 ) O(T^{\frac{1}{2}}) O(T21) guarantees on the regret.

Roughly two categories of approaches, namely Alternating Explore/Exploit and Simultaneous Explore/Exploit, perform both exploration and exploitation. The first category fail to obtain the desired O ( p o l y ( n ) T 1 2 ) O(poly(n)T^{\frac{1}{2}}) O(poly(n)T21)., so the second one will be the focus. The two Simultaneous-Explore/Exploit-type algorithms, proposed by Auer et at [1] and Flaxman et al [2] respectively, are reviewed. Both of their schedules are: Query A \mathcal{A} A for x t \mathbf{x}_t xt and construct a random vector X t \bm{X}_t Xt such that E ( X t ) = x t \mathbb{E}(\bm{X}_t) = \mathbf{x}_t E(Xt)=xt. Construct f ~ t \tilde{\mathbf{f}}_t f~t randomly based on the outcome of X t \bm{X}_t Xt and the learned value f t ⊤ X t \mathbf{f}_t^\top\bm{X}_t ft⊤Xt.

It is pointed out in the paper that the estimates of f ~ t \tilde{\mathbf{f}}_t f~t in both methods are reversely proportional to the distance of x t \mathbf{x}_t xt to the boundary, which implies high variance of the estimated functions. Indeed, most full-information algorithms scale linearly with the magnitude of the functions played by the environment. Fortunately, if If we restrict our search to a regularization algorithm of type (2), the expected regret can be proved to be equal to an expression involving E D R ( x t , x t + 1 ) \mathbb{E} D_{\mathcal{R}}\left(\mathbf{x}_{t}, \mathbf{x}_{t+1}\right) EDR(xt,xt+1) terms. For R ( x ) ∝ ∥ x ∥ 2 \mathcal{R}(\mathbf{x}) \propto\|\mathbf{x}\|^{2} R(x)∝∥x∥2, the paper recovers the method of Flaxman et al with its insurmountable hurdle of E ∥ f ~ t ∥ 2 \mathbb{E}\left\|\tilde{\mathbf{f}}_{t}\right\|^{2} E∥∥∥f~t∥∥∥2.

The main result of this paper is an algorithm for online linear optimization in the bandit setting for an arbitrary compact convex set K \mathcal{K} K, which is as follows:
在这里插入图片描述
In Section 4 the regularization framework is discussed in detail and it will be shown that how the regret can be computed in terms of Bregman divergences. The theory and main properties of self-concordant functions will be presented in Section 5. In Section 6, several key elements of the proof of the regret bound of the proposed algorithm in this paper will be given. In Section 7 the paper shows how this algorithm can be used for one interesting case, namely the bandit version of the Online Shortest Path problem. The precise analysis of our algorithm is given in Section 8. Finally, in Section 9 is the implementation of the algorithm.

The main result of the paper is as follows:

Theorem 1 Let K \mathcal{K} K be a convex set and R \mathcal{R} R be a ℑ \Im ℑ-self-concordant barrier on K \mathcal{K} K. Let u \mathbf{u} u be any vector in K ′ = K T − 1 / 2 \mathcal{K}' = \mathcal{K}_{T^{-1/2}} K′=KT−1/2. Suppose we have the property that ∣ f t ⊤ x ∣ ≤ 1 \left|\mathbf{f}_{t}^{\top} \mathbf{x}\right| \leq 1 ∣∣ft⊤x∣∣≤1 for any x ∈ K \mathbf{x}\in\mathcal{K} x∈K. Setting η = ϑ log ⁡ T 4 n T \eta=\frac{\sqrt{\vartheta \log T}}{4 n \sqrt{T}} η=4nT ϑlogT , the regret of Algorithm 1 is bounded as
E ∑ t = 1 T f t ⊤ y t ≤ min ⁡ u ∈ K ′ E ( ∑ t = 1 T f t ⊤ u ) + 16 n ϑ T log ⁡ T \mathbb{E} \sum_{t=1}^{T} \mathbf{f}_{t}^{\top} \mathbf{y}_{t} \leq \min _{\mathbf{u} \in \mathcal{K}^{\prime}} \mathbb{E}\left(\sum_{t=1}^{T} \mathbf{f}_{t}^{\top} \mathbf{u}\right)+16 n \sqrt{\vartheta T \log T} Et=1∑Tft⊤yt≤u∈K′minE(t=1∑Tft⊤u)+16nϑTlogT
whenever T > 8 ϑ log ⁡ T T>8 \vartheta \log T T>8ϑlogT.

Here the definition of the scaled version of K \mathcal{K} K and the ℑ \Im ℑ-self-concordant function are used. Thr scaled version of K \mathcal{K} Kis define as:
K δ = { u : π x 1 ( u ) ≤ ( 1 + δ ) − 1 } \mathcal{K}_{\delta}=\left\{\mathbf{u}: \pi_{\mathbf{x}_{1}}(\mathbf{u}) \leq(1+\delta)^{-1}\right\} Kδ={u:πx1(u)≤(1+δ)−1}
To define a ℑ \Im ℑ-self-concordant function, first we give the definition of a self-concordant function as follows:

Definition (self-concordant function) A self-concordant function R \mathcal{R} R: i n t K → R int \ \mathcal{K} \rightarrow\mathbb{R} int K→R is a C 3 C^3 C3 convex function such that
∣ D 3 R ( x ) [ h , h , h ] ∣ ≤ 2 ( D 2 R ( x ) [ h , h ] ) 3 / 2 \left|D^{3} \mathcal{R}(\mathbf{x})[\mathbf{h}, \mathbf{h}, \mathbf{h}]\right| \leq 2\left(D^{2} \mathcal{R}(\mathbf{x})[\mathbf{h}, \mathbf{h}]\right)^{3 / 2} ∣∣D3R(x)[h,h,h]∣∣≤2(D2R(x)[h,h])3/2
Here, the third-order differential is defined as

D 3 R ( x ) [ h 1 , h 2 , h 3 ] : = ∂ 3 ∂ t 1 ∂ t 2 ∂ t 3 ∣ t 1 = t 2 = t 3 = 0 R ( x + t 1 h 1 + t 2 h 2 + t 3 h 3 ) D^{3} \mathcal{R}(\mathbf{x})\left[\mathbf{h}_{1}, \mathbf{h}_{2}, \mathbf{h}_{3}\right] := \left.\frac{\partial^{3}}{\partial t_{1} \partial t_{2} \partial t_{3}}\right|_{t_{1}=t_{2}=t_{3}=0} \mathcal{R}\left(\mathbf{x}+t_{1} \mathbf{h}_{1}+t_{2} \mathbf{h}_{2}+t_{3} \mathbf{h}_{3}\right) D3R(x)[h1,h2,h3]:=∂t1∂t2∂t3∂3∣∣∣∣t1=t2=t3=0R(x+t1h1+t2h2+t3h3)

Now we can define the ℑ \Im ℑ-self-concordant function as follows:

Definition ( ℑ \Im ℑ-self-concordant function) A ℑ \Im ℑ-self-concordant barrier R \mathcal{R} R is a self-concordant function with
∣ D R ( x ) [ h ] ∣ ≤ ϑ 1 / 2 [ D 2 R ( x ) [ h , h ] ] 1 / 2 . |D \mathcal{R}(\mathbf{x})[\mathbf{h}]| \leq \vartheta^{1 / 2}\left[D^{2} \mathcal{R}(\mathbf{x})[\mathbf{h}, \mathbf{h}]\right]^{1 / 2}. ∣DR(x)[h]∣≤ϑ1/2[D2R(x)[h,h]]1/2.

Significance of Paper ‾ \underline{\text{Significance of Paper}} Significance of Paper

This is the first paper to achieve both high efficiency and a O ( p o l y ( n ) T ) O(poly(n) \sqrt{T}) O(poly(n)T ) regret bound. The bound O ( T ) O(\sqrt{T}) O(T ) is a regret bound for the full-information model and now it becomes the one for bandit setting as well. This is surely a breakthrough since what a player can observe at the end of each round int the bandit setting is far less than that in a full-information setting. Also, as the paper reviewed, only bounds like O 3 / 4 O^{3/4} O3/4, O 2 / 3 O^{2/3} O2/3 are obtained in quite a few previous papers. Now this “goal” bound is achieved efficiently, finally.

Reference {\text{\Large Reference}} Reference

[1] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77, 2003.

[2] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA ’05: Proceedings ofthe sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394, Philadelphia, PA, USA, 2005. Society for Industrial and Applied Mathematics.

[3] Abernethy J D, Hazan E, Rakhlin A. Competing in the dark: An efficient algorithm for bandit linear optimization[J]. 2009.

标签：right,mathbf,Linear,Algorithm,ft,Bandit,mathcal,xt,left
来源： https://blog.csdn.net/datou1596/article/details/113997788

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization