The goal of this chapter is to formally introduce stochastic bandits. The model introduced here provides the foundation for the remaining chapters that treat stochastic bandits.
Core Assumptions
A stochastic bandit is a collection of distributions $$\nu=\left(P_{a}: a \in \mathcal{A}\right)$$, where A \mathcal{A} A is the set of available actions. The learner and the environment interact sequentially over n rounds.
The interactions between learner and environment induce a probability measure on outcomes: A 1 , X 1 , A 2 , X 2 … A n , X n A_1,X_1,A_2,X_2 \dots A_n,X_n A 1 , X 1 , A 2 , X 2 … A n , X n
2 main assumptions of Stochastic Bandits:
(a) The conditional distribution of reward X t X_t X t given A 1 , X 1 , … , A t − 1 , X t − 1 , A t A_1, X_1, \dots, A_{t-1}, X_{t-1}, A_{t} A 1 , X 1 , … , A t − 1 , X t − 1 , A t is P A t P_{A_{t}} P A t , which captures the intuition that the environment samples X t X_t X t from P A t P_{A_{t}} P A t in round t . t . t .
reward X t X_t X t 只和 A t A_t A t 有关
(b) The conditional law of action A t A_{t} A t given A 1 , X 1 , … , A t − 1 , X t − 1 A_{1}, X_{1}, \ldots, A_{t-1}, X_{t-1} A 1 , X 1 , … , A t − 1 , X t − 1 is π t ( ⋅ ∣ A 1 , X 1 , … , A t − 1 , X t − 1 ) \pi_{t}\left(\cdot \mid A_{1}, X_{1}, \ldots, A_{t-1}, X_{t-1}\right) π t ( ⋅ ∣ A 1 , X 1 , … , A t − 1 , X t − 1 ) , where π 1 , π 2 , … \pi_{1}, \pi_{2}, \ldots π 1 , π 2 , … is a sequence of probability kernels that characterise the learner. The most important element of this assumption is the intuitive fact that the learner cannot use the future observations in current decisions.
决策者在每一期的决策只基于History,不考虑future,也即:
π t : H t − 1 → A \pi_t : H_{t-1} \rightarrow \mathcal{A}
π t : H t − 1 → A
Main learning objectives:
Learner的目标是maximize $$S_{n}=\sum_{t=1}^{n} X_{t}$$
关于n n n 的取值,learner并不能提前知道要玩几轮博弈
Reward 是一个随机量,我们需要定义一个measure来衡量reward的好坏
Learner 并不知道每个arm真实的reward distribution
Environment
Environment /Instance: $$\quad \nu=\left(P_{1}, \ldots, P_{k}\right) \in \varepsilon$$
The set E \mathcal{E} E is called environment class.
environment 描述了每个arm背后对应的分布的“选择范围”
Example:
ε k B = { ( μ 1 , … , μ k ) ∈ [ 0 , 1 ] k , P i ∼ Bernoull ( μ i ) } \varepsilon_{k}^{B}=\left\{\left(\mu_{1}, \ldots, \mu_{k}\right) \in[0,1]^{k}, P_{i} \sim \text { Bernoull }\left(\mu_{i}\right)\right\}
ε k B = { ( μ 1 , … , μ k ) ∈ [ 0 , 1 ] k , P i ∼ Bernoull ( μ i ) }
Regret
R n ( ν , π ) = n μ ( ν ) ∗ − E [ ∑ t = 1 n μ A t ] R_{n}(\nu, \pi)=n \mu_{(\nu)}^{*}-E\left[\sum_{t=1}^{n} \mu_{A_{t}}\right]
R n ( ν , π ) = n μ ( ν ) ∗ − E [ t = 1 ∑ n μ A t ]
最优选择的期望 - 按照policy
π \pi π 每轮的真实选择的期望收益
⭐ Notice that : regret 同时也是environment的函数
为什么跟v有关?比如考虑一个情形
Minimax regret:
min π ∈ ε max v ∈ ε R n ( v , π ) \min _{\pi \in \varepsilon} \max {v \in \varepsilon} \quad R_{n}(v, \pi)
π ∈ ε min max v ∈ ε R n ( v , π )
不同的v对regret的结果有影响,我们期望找到一个mini max的策略
Concentration Inequality
concentration inequality 在证明regret收敛性上非常重要,在这里进行介绍。
Theorem 1 Law of large number:
LLN 说的是:
As n → ∞ 1 n ( X 1 + ⋯ + X n ) → X i ˉ \text{As}\quad n \rightarrow \infty\quad \frac{1}{n}\left(X_{1}+\cdots+X_{n}\right) \rightarrow \bar{X_i}
As n → ∞ n 1 ( X 1 + ⋯ + X n ) → X i ˉ
仅仅只说明了一个“收敛性”
Theorem 2 Central limit theorem:
CLT: X 1 , … , X n X_{1}, \ldots, X_{n} X 1 , … , X n iid, with mean μ var σ 2 \mu \operatorname{var} \sigma^{2} μ v a r σ 2
n ( x ˉ n − μ ) σ → d N ( 0 , 1 ) \sqrt{n} \frac{\left(\bar{x}_{n}-\mu\right)}{\sigma} \stackrel{d}{\rightarrow} N(0,1)
n σ ( x ˉ n − μ ) → d N ( 0 , 1 )
CLT 更近一步说明了“收敛速率”为n \sqrt{n} n
Theorem 3 Concentration inequality:
之前的2个theorem说的都是当n趋于无穷时的收敛性。而我们需要的工具是:“在中心附近的一个区间上的概率是多少”。解决这一问题的工具就是 Concentration Inequality。其形如:
P ( ∣ x ˉ n − μ ∣ ⩾ ε n ) ⩽ σ n P\left(\left|\bar{x}_{n}-\mu\right| \geqslant \varepsilon_{n}\right) \leqslant \sigma_{n}
P ( ∣ x ˉ n − μ ∣ ⩾ ε n ) ⩽ σ n
Thm 1 Markov Inequality
P ( ∣ x ∣ ≥ ε ) ⩽ E [ ∣ x ∣ ] ε P(|x| \geq \varepsilon) \leqslant \frac{ E[|x|]}{\varepsilon}
P ( ∣ x ∣ ≥ ε ) ⩽ ε E [ ∣ x ∣ ]
Proof:
E [ ∣ x ∣ ] = E [ ∣ x ∣ 1 ∣ x ∣ ≤ ϵ ] + E [ ∣ x ∣ 1 ∣ x ∣ ≥ ϵ ] ≥ 0 + E [ ϵ 1 ∣ x ∣ ≥ ϵ ] = ϵ P ( ∣ x ∣ ≥ ϵ ) \begin{aligned}
E[|x|] &=E\left[|x| \mathcal{1}_{|x| \leq\epsilon}\right]+E\left[|x| \mathcal{1}_{|x| \geq\epsilon}\right]\\
&\geq 0 + E\left[\epsilon \mathcal{1}_{|x| \geq\epsilon}\right]\\
&=\epsilon P(|x| \geq \epsilon)
\end{aligned}
E [ ∣ x ∣ ] = E [ ∣ x ∣ 1 ∣ x ∣ ≤ ϵ ] + E [ ∣ x ∣ 1 ∣ x ∣ ≥ ϵ ] ≥ 0 + E [ ϵ 1 ∣ x ∣ ≥ ϵ ] = ϵ P ( ∣ x ∣ ≥ ϵ )
Thm 2 Chebychev Inequality
P ( ∣ X − E [ x ] ∣ ⩾ ε ) ≤ Var ( x ) ε 2 P(|X-E[x]| \geqslant \varepsilon) \leq \frac{\operatorname{Var}(x)}{\varepsilon^{2}}
P ( ∣ X − E [ x ] ∣ ⩾ ε ) ≤ ε 2 V a r ( x )
但我们会发现,用chebychev确定的bound往往太宽了,以如下case为例:
Let
x = x n ˉ Var ( X ˉ n ) = 1 n Var ( X 1 ) x=\bar{x_n} \quad \operatorname{Var}\left(\bar{X}_{n}\right)=\frac{1}{n} \operatorname{Var}\left(X_{1}\right)
x = x n ˉ V a r ( X ˉ n ) = n 1 V a r ( X 1 )
切比雪夫得到的bound:
P ( ∣ X ˉ n − μ ∣ > ε ) ≤ 1 n ε 2 Var ( X 1 ) P\left(\left|\bar{X}_{n}-\mu\right|>\varepsilon\right) \leq \frac{1}{n \varepsilon^{2}} \operatorname{Var}\left(X_{1}\right)
P ( ∣ ∣ ∣ X ˉ n − μ ∣ ∣ ∣ > ε ) ≤ n ε 2 1 V a r ( X 1 )
然而实际上for X 1 ∼ N ( 0 , 1 ) X_1 \sim N(0,1) X 1 ∼ N ( 0 , 1 ) 我们积分一下:
P ( x 1 ⩾ ε ) = ∫ ε + ∞ 1 2 π exp ( − x 2 2 ) d x ≤ 1 ε 2 π ∫ ε + ∞ x exp ( − x 2 2 ) d x = 1 ϵ 2 π exp ( − ε 2 2 ) \begin{aligned}P\left(x_{1} \geqslant \varepsilon\right) &=\int_{\varepsilon}^{+\infty} \frac{1}{\sqrt{2 \pi}} \exp \left(-\frac{x^{2}}{2}\right) d x \\& \leq \frac{1}{\varepsilon \sqrt{2 \pi}} \int_{\varepsilon}^{+\infty} x \exp \left(-\frac{x^{2}}{2}\right) d x \\&=\frac{1}{\epsilon\sqrt{2 \pi} } \operatorname{exp}\left(-\frac{\varepsilon^{2}}{2}\right)\end{aligned}
P ( x 1 ⩾ ε ) = ∫ ε + ∞ 2 π 1 exp ( − 2 x 2 ) d x ≤ ε 2 π 1 ∫ ε + ∞ x exp ( − 2 x 2 ) d x = ϵ 2 π 1 e x p ( − 2 ε 2 )
对于X n ˉ ∼ N ( 0 , 1 / n ) \bar{X_n}\sim N(0,1/n) X n ˉ ∼ N ( 0 , 1 / n )
P ( X n ˉ < ϵ ) ≤ 1 ϵ 2 π n exp ( − n ε 2 2 ) ∼ exp ( − c n ) P(\bar{X_n}<\epsilon)\leq \frac{1}{\epsilon\sqrt{2 \pi n } } \operatorname{exp}(-\frac{n\varepsilon^{2}}{2})\sim \text{exp}(-cn)
P ( X n ˉ < ϵ ) ≤ ϵ 2 π n 1 e x p ( − 2 n ε 2 ) ∼ exp ( − c n )
是一个exponential的速率,而不是切比雪夫的1/n的bound
Definition: σ \sigma σ -subgaussian
if for all λ ∈ R \lambda \in R λ ∈ R we have:
E [ exp ( λ x ) ] ⩽ exp ( λ 2 σ 2 2 ) E[\exp (\lambda x)] \leqslant \exp \left(\frac{\lambda^{2} \sigma^{2}}{2}\right)
E [ exp ( λ x ) ] ⩽ exp ( 2 λ 2 σ 2 )
Then x is called σ \sigma σ subgaussian.
或者写作moment generating function 相关的形式
ψ X ( λ ) = log M X ( λ ) ≤ 1 2 λ 2 σ 2 for all λ ∈ R \psi_{X}(\lambda)=\log M_{X}(\lambda) \leq \frac{1}{2} \lambda^{2} \sigma^{2} \quad \text { for all } \lambda \in \mathbb{R} ψ X ( λ ) = log M X ( λ ) ≤ 2 1 λ 2 σ 2 for all λ ∈ R
这也可以理解为什么叫 “subgaussian”
A random variable X X X is heavy tailed if M X ( λ ) = ∞ M_{X}(\lambda)=\infty M X ( λ ) = ∞ for all λ > 0 \lambda>0 λ > 0 . Otherwise it is light tailed.
The tails of a σ-subgaussian random variable decay approximately as fast as that of a Gaussian with zero mean and the same variance.
特别的x ∼ N ( 0 , σ 2 ) x\sim N(0,\sigma^2) x ∼ N ( 0 , σ 2 ) 是σ \sigma σ -subgaussian
为什么需要subgaussian? 因为subgaussian的性质可以得到很多concentration inequality,在常见的bandit证明中都很依赖subgaussian的性质。
Thm 3 Chernoff bound for Gaussian
If X X X is σ \sigma σ -subgaussian, then ∀ ε > 0 \forall \varepsilon>0 ∀ ε > 0 ,
\begin{align}
&P(X>\varepsilon) \leqslant \exp \left(-\frac{\varepsilon^{2}}{2 \sigma^{2}}\right) \\
&P(X<-\varepsilon) \leqslant \exp \left(-\frac{\varepsilon^{2}}{2 \sigma^{2}}\right)
\end{align}
Proof:
\begin{align}
P(X>\varepsilon)&=P(\exp (\lambda x) \geqslant \exp (\lambda \varepsilon))\\
&\leq \exp (-\lambda \varepsilon) E[\exp (\lambda x)]\text{ (By Markov)}\\
&\leqslant \exp \left(\frac{\lambda^{2} \sigma^{2}}{2}-\lambda \varepsilon\right)\text{ (By Subgaussian)}
\end{align}
minimize over λ \lambda λ , let λ \lambda λ = ϵ / σ 2 \epsilon/\sigma^2 ϵ / σ 2
Lemma: subgaussian property
Suppose X 1 , X 2 X_1,X_2 X 1 , X 2 are σ 1 , σ 2 \sigma_1,\sigma_2 σ 1 , σ 2 subgaussian:
c X 1 cX_1 c X 1 is ∣ c ∣ σ 1 |c|\sigma_1 ∣ c ∣ σ 1 subgaussian
X 1 + X 2 X_1+X_2 X 1 + X 2 is σ 1 2 + σ 2 2 \sqrt{\sigma_1^2+\sigma_2^2} σ 1 2 + σ 2 2 subgaussian