Online Learning 2 SB and Concentration Inequality

The goal of this chapter is to formally introduce stochastic bandits. The model introduced here provides the foundation for the remaining chapters that treat stochastic bandits.

Core Assumptions

A stochastic bandit is a collection of distributions $$\nu=\left(P_{a}: a \in \mathcal{A}\right)$$, where A\mathcal{A} is the set of available actions. The learner and the environment interact sequentially over n rounds.

The interactions between learner and environment induce a probability measure on outcomes: A1,X1,A2,X2An,XnA_1,X_1,A_2,X_2 \dots A_n,X_n

2 main assumptions of Stochastic Bandits:

(a) The conditional distribution of reward XtX_t given A1,X1,,At1,Xt1,AtA_1, X_1, \dots, A_{t-1}, X_{t-1}, A_{t} is PAtP_{A_{t}}, which captures the intuition that the environment samples XtX_t from PAtP_{A_{t}} in round t.t .

reward XtX_t 只和 AtA_t 有关

(b) The conditional law of action AtA_{t} given A1,X1,,At1,Xt1A_{1}, X_{1}, \ldots, A_{t-1}, X_{t-1} is πt(A1,X1,,At1,Xt1)\pi_{t}\left(\cdot \mid A_{1}, X_{1}, \ldots, A_{t-1}, X_{t-1}\right), where π1,π2,\pi_{1}, \pi_{2}, \ldots is a sequence of probability kernels that characterise the learner. The most important element of this assumption is the intuitive fact that the learner cannot use the future observations in current decisions.

决策者在每一期的决策只基于History,不考虑future,也即:

πt:Ht1A\pi_t : H_{t-1} \rightarrow \mathcal{A}

Main learning objectives:

Learner的目标是maximize $$S_{n}=\sum_{t=1}^{n} X_{t}$$

  • 关于nn的取值,learner并不能提前知道要玩几轮博弈
  • Reward 是一个随机量,我们需要定义一个measure来衡量reward的好坏
  • Learner 并不知道每个arm真实的reward distribution

Environment

Environment /Instance: $$\quad \nu=\left(P_{1}, \ldots, P_{k}\right) \in \varepsilon$$

The set E\mathcal{E} is called environment class.

environment 描述了每个arm背后对应的分布的“选择范围”

Example:

εkB={(μ1,,μk)[0,1]k,Pi Bernoull (μi)}\varepsilon_{k}^{B}=\left\{\left(\mu_{1}, \ldots, \mu_{k}\right) \in[0,1]^{k}, P_{i} \sim \text { Bernoull }\left(\mu_{i}\right)\right\}

Regret

Rn(ν,π)=nμ(ν)E[t=1nμAt]R_{n}(\nu, \pi)=n \mu_{(\nu)}^{*}-E\left[\sum_{t=1}^{n} \mu_{A_{t}}\right]

最优选择的期望 - 按照policy
π\pi 每轮的真实选择的期望收益

⭐ Notice that : regret 同时也是environment的函数

为什么跟v有关?比如考虑一个情形

Minimax regret:

minπεmaxvεRn(v,π)\min _{\pi \in \varepsilon} \max {v \in \varepsilon} \quad R_{n}(v, \pi)

不同的v对regret的结果有影响,我们期望找到一个mini max的策略

Concentration Inequality

concentration inequality 在证明regret收敛性上非常重要,在这里进行介绍。

Theorem 1 Law of large number:

LLN 说的是:

Asn1n(X1++Xn)Xiˉ\text{As}\quad n \rightarrow \infty\quad \frac{1}{n}\left(X_{1}+\cdots+X_{n}\right) \rightarrow \bar{X_i}

仅仅只说明了一个“收敛性”

Theorem 2 Central limit theorem:

CLT: X1,,XnX_{1}, \ldots, X_{n} iid, with mean μvarσ2\mu \operatorname{var} \sigma^{2}

n(xˉnμ)σdN(0,1)\sqrt{n} \frac{\left(\bar{x}_{n}-\mu\right)}{\sigma} \stackrel{d}{\rightarrow} N(0,1)

CLT 更近一步说明了“收敛速率”为n\sqrt{n}

Theorem 3 Concentration inequality:

之前的2个theorem说的都是当n趋于无穷时的收敛性。而我们需要的工具是:“在中心附近的一个区间上的概率是多少”。解决这一问题的工具就是 Concentration Inequality。其形如:

P(xˉnμεn)σnP\left(\left|\bar{x}_{n}-\mu\right| \geqslant \varepsilon_{n}\right) \leqslant \sigma_{n}

Thm 1 Markov Inequality

P(xε)E[x]εP(|x| \geq \varepsilon) \leqslant \frac{ E[|x|]}{\varepsilon}

Proof:

E[x]=E[x1xϵ]+E[x1xϵ]0+E[ϵ1xϵ]=ϵP(xϵ)\begin{aligned} E[|x|] &=E\left[|x| \mathcal{1}_{|x| \leq\epsilon}\right]+E\left[|x| \mathcal{1}_{|x| \geq\epsilon}\right]\\ &\geq 0 + E\left[\epsilon \mathcal{1}_{|x| \geq\epsilon}\right]\\ &=\epsilon P(|x| \geq \epsilon) \end{aligned}

Thm 2 Chebychev Inequality

P(XE[x]ε)Var(x)ε2P(|X-E[x]| \geqslant \varepsilon) \leq \frac{\operatorname{Var}(x)}{\varepsilon^{2}}

但我们会发现,用chebychev确定的bound往往太宽了,以如下case为例:

Let

x=xnˉVar(Xˉn)=1nVar(X1)x=\bar{x_n} \quad \operatorname{Var}\left(\bar{X}_{n}\right)=\frac{1}{n} \operatorname{Var}\left(X_{1}\right)

切比雪夫得到的bound:

P(Xˉnμ>ε)1nε2Var(X1)P\left(\left|\bar{X}_{n}-\mu\right|>\varepsilon\right) \leq \frac{1}{n \varepsilon^{2}} \operatorname{Var}\left(X_{1}\right)

然而实际上for X1N(0,1)X_1 \sim N(0,1) 我们积分一下:

P(x1ε)=ε+12πexp(x22)dx1ε2πε+xexp(x22)dx=1ϵ2πexp(ε22)\begin{aligned}P\left(x_{1} \geqslant \varepsilon\right) &=\int_{\varepsilon}^{+\infty} \frac{1}{\sqrt{2 \pi}} \exp \left(-\frac{x^{2}}{2}\right) d x \\& \leq \frac{1}{\varepsilon \sqrt{2 \pi}} \int_{\varepsilon}^{+\infty} x \exp \left(-\frac{x^{2}}{2}\right) d x \\&=\frac{1}{\epsilon\sqrt{2 \pi} } \operatorname{exp}\left(-\frac{\varepsilon^{2}}{2}\right)\end{aligned}

对于XnˉN(0,1/n)\bar{X_n}\sim N(0,1/n)

P(Xnˉ<ϵ)1ϵ2πnexp(nε22)exp(cn)P(\bar{X_n}<\epsilon)\leq \frac{1}{\epsilon\sqrt{2 \pi n } } \operatorname{exp}(-\frac{n\varepsilon^{2}}{2})\sim \text{exp}(-cn)

是一个exponential的速率,而不是切比雪夫的1/n的bound

Definition: σ\sigma-subgaussian

if for all λR\lambda \in R we have:

E[exp(λx)]exp(λ2σ22)E[\exp (\lambda x)] \leqslant \exp \left(\frac{\lambda^{2} \sigma^{2}}{2}\right)

Then x is called σ\sigma subgaussian.

或者写作moment generating function 相关的形式

ψX(λ)=logMX(λ)12λ2σ2 for all λR\psi_{X}(\lambda)=\log M_{X}(\lambda) \leq \frac{1}{2} \lambda^{2} \sigma^{2} \quad \text { for all } \lambda \in \mathbb{R}

这也可以理解为什么叫 “subgaussian”

A random variable XX is heavy tailed if MX(λ)=M_{X}(\lambda)=\infty for all λ>0\lambda>0. Otherwise it is light tailed.

The tails of a σ-subgaussian random variable decay approximately as fast as that of a Gaussian with zero mean and the same variance.

特别的xN(0,σ2)x\sim N(0,\sigma^2)σ\sigma-subgaussian

为什么需要subgaussian? 因为subgaussian的性质可以得到很多concentration inequality,在常见的bandit证明中都很依赖subgaussian的性质。

Thm 3 Chernoff bound for Gaussian

If XX is σ\sigma-subgaussian, then ε>0\forall \varepsilon>0,

\begin{align} &P(X>\varepsilon) \leqslant \exp \left(-\frac{\varepsilon^{2}}{2 \sigma^{2}}\right) \\ &P(X<-\varepsilon) \leqslant \exp \left(-\frac{\varepsilon^{2}}{2 \sigma^{2}}\right) \end{align}

Proof:

\begin{align} P(X>\varepsilon)&=P(\exp (\lambda x) \geqslant \exp (\lambda \varepsilon))\\ &\leq \exp (-\lambda \varepsilon) E[\exp (\lambda x)]\text{ (By Markov)}\\ &\leqslant \exp \left(\frac{\lambda^{2} \sigma^{2}}{2}-\lambda \varepsilon\right)\text{ (By Subgaussian)} \end{align}

minimize over λ\lambda , let λ\lambda = ϵ/σ2\epsilon/\sigma^2

Lemma: subgaussian property

Suppose X1,X2X_1,X_2 are σ1,σ2\sigma_1,\sigma_2 subgaussian:

  • cX1cX_1 is cσ1|c|\sigma_1 subgaussian
  • X1+X2X_1+X_2 is σ12+σ22\sqrt{\sigma_1^2+\sigma_2^2} subgaussian