主要介绍ϵ \epsilon ϵ -greedy算法和UCB算法以及他们的regret 证明。非常重要的是介绍了一个很普适的证明框架(UCB 分析1)
算法思想
简单的greedy算法:
for t = 1 , ⋯ , k t=1, \cdots, k t = 1 , ⋯ , k pull arm t t t
for t = k + 1 , … , n t=k + 1, \ldots, n t = k + 1 , … , n , pull $argmax_{i} $$ $$ \hat{\mu}_{i}(t-1)$
每个arm实验一次,然后只拉最好的arm(边拉边更新均值),但这显然是会有问题的。有可能最好的arm因为一开始的realize的reward不好导致永远不会再被拉到。
所以要设计一个算法让我们在后期的“exploit”阶段,也会保持explore的可能性。
算法描述
Input: ε 1 ε 2 … ε n \varepsilon_{1} \varepsilon_{2} \ldots \varepsilon_{n} ε 1 ε 2 … ε n
if t ∈ { 1 , … , k } t\in \{1, \ldots, k\} t ∈ { 1 , … , k } , pull arm t t t
if t ∈ { k + 1 , … , n } t\in \{k+1, \ldots, n\} t ∈ { k + 1 , … , n } pull arm A t = { argmax i μ ^ i ( t − 1 ) w.p. 1 − ε t i w.p. ε t k A_t= \begin{cases}\operatorname{argmax}_i \hat{\mu}_i(t-1) & \text { w.p. } 1-\varepsilon_t \\ i & \text { w.p. } \frac{\varepsilon_t}{k}\end{cases} A t = { a r g m a x i μ ^ i ( t − 1 ) i w.p. 1 − ε t w.p. k ε t
在后期也会保持一个ϵ t \epsilon_t ϵ t 的概率尝试其他arm。
why ϵ t \epsilon_t ϵ t not ϵ \epsilon ϵ ? 可以证得,当取常数ϵ \epsilon ϵ 得时候regret会趋于无穷。
Regret 分析
若取ϵ ∼ 1 t \epsilon \sim \frac{1}{t} ϵ ∼ t 1
then:
Instance Independent ∼ O ( n ) \sim O(\sqrt{n}) ∼ O ( n ) (只知道μ 1 , μ 2 ⋯ μ k \mu_1,\mu_2\cdots\mu_k μ 1 , μ 2 ⋯ μ k 取值空间∼ [ 0 , 1 ] \sim [0,1] ∼ [ 0 , 1 ] )
Instance Dependent ∼ O ( log n ) \sim O(\log n) ∼ O ( log n ) (知道μ 1 , μ 2 ⋯ μ k \mu_1,\mu_2\cdots\mu_k μ 1 , μ 2 ⋯ μ k 是哪些数,但不知道各自对应哪个arm)
Upper Confidence Bound Algorithm
著名的上置信度边界(UCB)算法,它克服了ETC的策略的所有限制,包括需要知道水平线和次优差距。该算法有许多不同的形式,取决于对噪声分布的假设。
算法思想
UCB算法给人们的一个启示是“永远对未知保持乐观态度”
这样做的直观原因是,当乐观地行事时,会发生两种情况之一。
要么乐观是合理的,在这种情况下,学习者的行为是最优的。
要么乐观是不合理的。在这一情况下,learner采取了他们认为可能会带来大量奖励的行动,但事实上却并不是最优的。如果这种情况发生得足够频繁,那么学习者就会知道这个行动的真正回报并不最优,因此在未来不会选择它。
算法描述
Recall that if X 1 , X 2 , … , X n X_{1}, X_{2}, \dots, X_{n} X 1 , X 2 , … , X n are independent and 1-subgaussian (which means that E [ X i ] = 0 \mathbb{E}[X_{i}]=0 E [ X i ] = 0 and V a r ( X i ) = 1 Var(X_i) = 1 V a r ( X i ) = 1 ) and μ ^ = ∑ t = 1 n X t / n \hat{\mu}=\sum_{t=1}^{n} X_{t} / n μ ^ = ∑ t = 1 n X t / n , then by chernoff bound of gaussian:
P ( μ ^ ≥ ε ) ≤ exp ( − n ε 2 / 2 ) \mathbb{P}(\hat{\mu} \geq \varepsilon) \leq \exp (-n \varepsilon^{2} / 2)
P ( μ ^ ≥ ε ) ≤ exp ( − n ε 2 / 2 )
Equating the right-hand side with δ \delta δ and solving for ε \varepsilon ε leads to an estimate of UCB at period t :
μ ^ i U C B ( t ) = μ ^ i ( t − 1 ) + 2 T i ( t − 1 ) log ( 1 δ ) \hat{\mu}_{i}^{UCB}(t) = \hat{\mu}_{i}(t-1)+\sqrt{\frac{2}{T_{i}(t-1)} \log (\frac{1}{\delta})}
μ ^ i U C B ( t ) = μ ^ i ( t − 1 ) + T i ( t − 1 ) 2 log ( δ 1 )
此时μ i < μ ^ i U C B ( t ) \mu_i<\hat{\mu}_{i}^{UCB}(t) μ i < μ ^ i U C B ( t ) with probability 1 − δ 1-\delta 1 − δ
\begin{equation}
A_{t}= \begin{cases}\operatorname{argmax}_{i}(\hat{\mu}_{i}(t-1)+\sqrt{\frac{2 \log \frac{1}{\delta}}{T_{i}(t-1)}}), & \text { if } t>K \\ t, & \text { otherwise }\end{cases}
\end{equation}
Besides the slightly vague "optimism guarantees optimality or learning" intuition we gave before, it is worth exploring other intuitions for this choice of index. At a very basic level, we should explore arms more often if they are:
(a) promising (in that μ ^ i ( t − 1 ) \hat{\mu}_{i}(t-1) μ ^ i ( t − 1 ) is large)
(b) not well explored ( T i ( t − 1 ) T_{i}(t-1) T i ( t − 1 ) is small).
在开始介绍证法之前,先总结一下proof sketch:
定义good event (G G G )和bad event(G c G^c G c )
把regret 分解为在good event 下的regret 和 bad event 下的regret
Regret = Regret | Good Event + Regret | Bad Event
\begin{equation}
\begin{aligned}
R_{n} &=\sum_{t=1}^{n} E[\mu_{1}-\mu_{A_{t}}] \\
&=\sum_{t=1}^{n} E[(\mu_{1}-\mu_{A_{t}}) 1_{G}]+\sum_{t=1}^{n} E[(\mu_{1}-\mu_{A_{t}}) 1_{G^{c}}]
\end{aligned}
\end{equation}
证明good event 下 regret 很小,bad event 下发生的概率很小
这套证明思路非常常见,关键的技巧是如何构造“good event”
Regret Analysis (1)
定义好事件为真实的arm 取值落在置信区间内:
G = { μ ^ i ( t − 1 ) − 2 log ( 1 σ ) T i ( t − 1 ) ⩽ μ i ⩽ μ ^ i ( t − 1 ) + 2 log ( 1 σ ) T i ( t − 1 ) , ∀ t , i } G=\{\hat{\mu}_{i}(t-1)-\sqrt{\frac{2 \log ({\frac{1}{\sigma}})}{T_{i}(t-1)}} \leqslant \mu_{i} \leqslant \hat{\mu}_{i}(t-1)+\sqrt{\frac{2 \log ({\frac{1}{\sigma}})}{T_{i}(t-1)}}, \forall t, i\} G = { μ ^ i ( t − 1 ) − T i ( t − 1 ) 2 l o g ( σ 1 ) ⩽ μ i ⩽ μ ^ i ( t − 1 ) + T i ( t − 1 ) 2 l o g ( σ 1 ) , ∀ t , i }
在好事件下的regret
我们得到在好事件下收益差距不会过大:
μ 1 − μ A t ≤ 2 2 log ( 1 σ ) 1 T A t ( t − 1 ) \mu_{1}-\mu_{A_{t}} \leq 2 \sqrt{2 \log (\frac{1}{\sigma})} \sqrt{\frac{1}{T_{A_{t}}(t-1)}} μ 1 − μ A t ≤ 2 2 log ( σ 1 ) T A t ( t − 1 ) 1
We know T i ( t − 1 ) ≤ t − 1 T_{i}(t-1) \leq t-1 T i ( t − 1 ) ≤ t − 1
∑ t = 1 n E [ 1 T A t ( t − 1 ) ] ⩽ ∑ t = 1 n ∑ i = 1 k 1 t ∼ k n \sum_{t=1}^{n} E[\sqrt{\frac{1}{T_{A_{t}}(t-1)}}] \leqslant \sum_{t=1}^{n} \sum_{i=1}^{k} \sqrt{\frac{1}{t}} \sim k \sqrt{n}
t = 1 ∑ n E [ T A t ( t − 1 ) 1 ] ⩽ t = 1 ∑ n i = 1 ∑ k t 1 ∼ k n
这一步放缩过大,每个arm i拉的次数小于t次,直接都放缩到t次
Regret | Good Event ≤ \leq ≤ ∑ t = 1 n 2 2 log ( 1 δ ) E [ 1 T A t ( t − 1 ) ] ∼ 2 2 log 1 σ k n \sum_{t=1}^{n} 2 \sqrt{2 \log (\frac{1}{\delta})} E[\sqrt{\frac{1}{T_{A_{t}}(t-1)}}]\sim2 \sqrt{2 \log \frac{1}{\sigma }} k \sqrt{n} ∑ t = 1 n 2 2 log ( δ 1 ) E [ T A t ( t − 1 ) 1 ] ∼ 2 2 log σ 1 k n
在坏事件下的概率
坏事件下要尝试证明,坏事件概率不会过大
Regret | Bad Event ⩽ ∑ t = 1 n E [ ( μ 1 − μ A t ) 1 G c ] ≤ ∑ t = 1 n P ( G c ) = n P ( G c ) \leqslant \sum_{t=1}^{n} E[(\mu_{1}-\mu_{A t}) 1_{G^c} ] \leq \sum_{t=1}^{n} P(G^{c})=n P(G^{c}) ⩽ ∑ t = 1 n E [ ( μ 1 − μ A t ) 1 G c ] ≤ ∑ t = 1 n P ( G c ) = n P ( G c )
为了方便讨论,我们denote一下每个arm的生成序列(每个arm都考虑到n期)
arm 1
$X_{11} $
X 21 X_{21} X 2 1
…
X n 1 X_{n1} X n 1
arm 2
X 12 X_{12} X 1 2
X 22 X_{22} X 2 2
…
X n 2 X_{n2} X n 2
…
arm k
$X_{1k} $
X 2 k X_{2k} X 2 k
…
X n k X_{nk} X n k
Let E t i = ∣ ∑ s = 1 t X s i t − μ i ∣ ≤ 2 log ( 1 σ ) t E_{t_{i}}=|\frac{\sum_{s=1}^{t} X _{si}}{t}-\mu_{i}| \leq \sqrt{\frac{2 \log (\frac{1}{\sigma})}{t}} E t i = ∣ t ∑ s = 1 t X s i − μ i ∣ ≤ t 2 l o g ( σ 1 )
E 单独定义每个arm的good event
if ⋂ t = 1 n ⋂ i = 1 k E t i \bigcap_{t=1}^{n} \bigcap_{i=1}^{k} E_{ti} ⋂ t = 1 n ⋂ i = 1 k E t i holds, then G G G holds(E取交集是一个更强的限制)
⇒ ⋂ t k ⋂ i n E t i ⊂ G ⇒ G i c ⊂ ⋃ i = 1 n ⋃ i = 1 k E t i c \begin{aligned}\Rightarrow \quad \bigcap_{t}^{k} \bigcap_{i}^{n} E_{t i} \subset G \\\Rightarrow \quad G_{i}^{c} \subset \bigcup_{i=1}^{n} \bigcup_{i=1}^{k} E_{t i}^{c}\end{aligned}
⇒ t ⋂ k i ⋂ n E t i ⊂ G ⇒ G i c ⊂ i = 1 ⋃ n i = 1 ⋃ k E t i c
且根据置信区间的定义我们可以知道 E t i c ≤ δ E_{t i}^{c}\leq \delta E t i c ≤ δ 所以 P ( G c ) ≤ n k δ P(G^c) \leq nk\delta P ( G c ) ≤ n k δ
好坏加和
好坏regret 加和 ≤ 2 2 log ( 1 δ ) k n + n 2 k δ \leq 2 \sqrt{2 \log (\frac{1}{\delta})} k \sqrt{n}+n^{2} k \delta ≤ 2 2 log ( δ 1 ) k n + n 2 k δ
令δ ∼ 1 / n 2 \delta \sim 1/n^2 δ ∼ 1 / n 2 R n ≤ 4 log n k n + k R_{n} \leq 4 \sqrt{\log n} k \sqrt{n}+k R n ≤ 4 log n k n + k
Regret 分析2 (bound收紧到k n \sqrt{kn} k n )
小咕🐦