Online Learning 3 ETC Algorithm

Explore Then Commit (ETC) Algorithm

算法描述:

大体想法为先进行explore phase 尝试很多的arm,然后在commit phase 中只使用之前发现的最好的arm。

Formal:

Input: mm In round t, pull

At={t( mod k)+1 if tmkargmaxμ^i(mk) if t>mkA_{t}= \begin{cases}t (\text{ mod } k)+1 & \text { if } t \leqslant m k \\ \operatorname{argmax}\quad \widehat{\mu}_{i}(m k) & \text { if } t>m k\end{cases}

Regret Analysis

定义一些notation:

μ^i(t)=1Ti(t)s=1t1(As=i)XsTi(t)=s=1t1(As=i)Δi=uμi (假设arm 1 是optimal)\begin{aligned}\hat{\mu}_{i}(t)&=\frac{1}{T_{i}(t)} \sum_{s=1}^{t} \mathcal{1}(A_{s}=i) X_{s} \\T_{i}(t)&=\sum_{s=1}^{t} \mathcal{1}(A_{s}=i)\\\Delta_{i}&=u^{*}-\mu_{i}\text{ (假设arm 1 是optimal)}\end{aligned}

Step 1:对于给定的k,找出regret 的bound

分解一下regret

regret = 每个arm拉的次数* 每个arm 对应的regret

Rn=nμE[t=1nμAt]=t=1nE[μμAt]=i=1KΔiE[Ti(n)]R_{n}=n \mu^{}-E[\sum_{t=1}^{n} \mu_{A_{t}}]=\sum_{t=1}^{n} E[\mu^{}-\mu_{A_{t}}] = \sum_{i=1}^{K} \Delta_{i} E[T_{i}(n)]

每个arm被拉的次数

E[Ti(n)]=m+(nmk)P(Amk+1=i)=m+(nmk)P(μ^i(mk)maxjiμ^j(mk))\begin{aligned}E[T_{i}(n)] &=m+(n-m k) P(A_{m k+1}=i) \\&=m+(n-m k) P(\hat{\mu}_{i}(m k) \geqslant \max _{j \neq i} \hat{\mu}_{j}(m k))\end{aligned}

\begin{align}&P(\hat{\mu}_{i}(m k) \geqslant \max _{j \neq i} \hat{\mu}_{j}(m k))\\ &\leq P(\hat{\mu}_{i}(m k) \geqslant \hat{\mu}_{1}(m k))\text{ 第一步放缩,只关注跟最优arm相比}\\&\leq.P(\hat{\mu}_{i}(m k)-\mu_{i})-(\hat{\mu}_{1}(m k)-\mu_{1}) \geqslant \mu_{1}-\mu_{i})\end{align}

μ^i(mk)\hat{\mu}_{i}(m k): 由于假设每个xix_i 是1-subgaussian,μ^i(mk)\hat{\mu}_{i}(m k)是1/m-subgaussian

(μ^i(mk)μi)(μ^1(mk)μ1)(\hat{\mu}_{i}(m k)-\mu_{i})-(\hat{\mu}_{1}(m k)-\mu_{1}) : (1/m+1/m)-subgaussian

因此:exp(Δi22(.2m)2.)=exp(m4Δi2)~\leq\exp (-\frac{\Delta_{i}^{2}}{2(\sqrt{.\frac{2}{m})^{2}}.})=\exp (-\frac{m}{4} \Delta_{i}^{2})

Rni=1kΔi(m+(nmk)exp(m4Δi2))kΔˉ(m+(nmk)exp(m4Δˉ2))\begin{aligned}&R_{n} \leqslant \sum_{i=1}^{k} \Delta_{i}(m+(n-m k) \exp (-\frac{m}{4} \Delta_{i}^{2})) \\&\leqslant k \bar{\Delta}(m+(n-m k) \exp (-\frac{m}{4} \bar{\Delta}^{2})) \end{aligned}

最后求导取optimal的k就可以了

For Example

k=2RnmΔ+nΔexp(m4Δ2)k=2 \quad R_{n} \leqslant m \Delta+n \Delta \exp (-\frac{m}{4} \Delta^{2})

Given Δ,m\Delta, m minimizes {mΔ+Δexp(m4Δ2)}\{m \Delta+\Delta \exp (-\frac{m}{4} \Delta^{2})\}

which is m=4Δ2log(nΔ24)m=\lceil\frac{4}{\Delta^{2}} \log (\frac{n \Delta^{2}}{4})\rceil

and for this choice the regret is bounded by RnΔ+4Δ(1+log(nΔ24))R_{n} \leq \Delta+\frac{4}{\Delta}(1+\log (\frac{n \Delta^{2}}{4}))

此时得到的bound 称为: instance-dependent

Δ1nmaxΔminm{Rn}O(n)\Delta \sim \frac{1}{\sqrt{n}} \quad \max _{\Delta} \min _{m}\{R_{n}\} \sim O(\sqrt{n})

此时得到的bound称为instance-independent / worst case performance

可以发现,都对Δ\Delta 的信息有要求,对于无Δ\Delta 信息的情况下得到的bound:

minmmaxΔ{mΔ+nΔexp(mΔ24)}o(n23)\min_{m} \max_{\Delta}\{m \Delta+n \Delta \exp (\frac{-m \Delta^{2}}{4})\}\sim o(n^{\frac{2}{3}}) worse than O(n)O(\sqrt{n}) ( adversarial )

Doubling Trick:

有时的问题是,我们也不知道n是多少,该如何解决?

Dowbling trick: RnR_{n}

  • For t=1, run algorithm with n=1
  • For t=2,3,t=2,3, \ldots with n=2n=2
  • For t=4.5.6,7,t=4.5 .6,7, \cdots with n=4n=4
  • \cdots