Online Learning(1) Intro

发表于 2022-07-17 更新于 2024-06-17 阅读次数： Artalk：

开坑online learning and bandits，一直想着系统地学习下，利用2022 summer争取拿下。🤤🤤🤤

Basic Setting

A bandit problem is a sequential game between a learner and an environment.

The game is played over $n$ rounds, where $n$ is a positive natural number called the horizon. In each round $t \in n$ , the learner first chooses an action $A_t$ from a given set $\mathcal{A}$ , and the environment then reveals a reward $X_t \in \mathbb{R}$ .

The learner cannot peek into the future when choosing their actions. The action $A_t$ should only depend on the history $H_{t-1} = (A_{1}, X_{1}, \ldots, A_{t-1}, X_{t-1})$ .

Regret

DEFINITION : The regret of the learner relative to a policy $\pi$ (not necessarily that followed by the learner) is the difference between the total expected reward using policy $\pi$ for $n$ rounds and the total expected reward collected by the learner over $n$ rounds. The regret relative to a set of policies $\Pi$ is the maximum regret relative to any policy $\pi \in \Pi$ in the set.

The set $\Pi$ is often called the competitor class. ( 与 $\Pi$ 中的strategy 对抗)

Regret measures the performance of the learner relative to the best policy in the competitor class.

Example:

Suppose the action set is $\mathcal{A}=\{1,2, \ldots, k\}$ . An environment is called a stochastic Bernoulli bandit if the reward $X_{t} \in\{0,1\}$ is binary valued and there exists a vector $\mu \in[0,1]^{k}$ such that the probability that $X_{t}=1$ given the learner chose action $A_{t}=a$ is $\mu_{a}$ . The class of stochastic Bernoulli bandits is the set of all such bandits, which are characterised by their mean vectors.

If you knew the mean vector associated with the environment, then the optimal policy is to play the fixed action $a^*=\arg\max_{a \in \mathcal{A}} \mu_{a}$ . This means that for this problem the natural competitor class is the set of $k$ constant polices $\Pi=\{\pi_{1}, \ldots, \pi_{k}\}$ , where $\pi_{i}$ chooses action $i$ in every round. The regret over $n$ rounds becomes

R_{n}=n \max _{a \in \mathcal{A}} \mu_{a}-\mathbb{E}[\sum_{t=1}^{n} X_{t}]

Asymptotic analysis of regret

A good learner achieves sublinear regret. This means that $R_n =o(n)$ or $\lim _{n arrow \infty} R_{n} / n=0$ . The interesting question is under what circumstances is $R_{n}=O(\sqrt{n}) \text { or } R_{n}=O(\log (n)) $?

Different settings:

Stochastic Stationary Bandits

A simple problem setting is that of stochastic stationary bandits. In this case the environment is restricted to generate the reward in response to each action from a distribution that is specific to that action and independent of the previous action choices and rewards.

X_{t}=\langle a, \theta\rangle+\eta_{t}

Adversarial bandits

An extreme idea is to drop all assumptions on how the rewards are generated, except that they are chosen without knowledge of the learner’s actions and lie in a bounded set.

The learner is not expected to find the best sequence of actions, which may be like finding a needle in a haystack. Instead, we usually choose Π to be the set of constant policies and demand that the learner is not much worse than any of these. By defining the regret in this way, the stationarity assumption is transported into the definition of regret rather than constraining the environment.

Reinforcement Learning

在一般的问题中考虑 learner’s available choices and rewards tomorrow are not affected by their decisions today. 但RL 考虑long term profit

Partial Monitoring

The setting where the reward is not observed is called partial monitoring.

Application

A/B Testing

The designers of a company website are trying to decide whether the ‘buy it now’ button should be placed at the top of the product page or at the bottom.

One way to apply bandits to this problem is to view the two versions of the site as actions. Each time t a user makes a request, a bandit algorithm is used to choose an action $A_t \in \mathcal{A} = [Site_A, Site_B]$ , and the reward is $X_t = 1$ if the user purchases the product and $X_t = 0$ otherwise.

Advert Placement

In advert placement, each round corresponds to a user visiting a website, and the set of actions $\mathcal{A}$ is the set of all available adverts. One could treat this as a standard multi-armed bandit problem, where in each round a policy chooses $A_{t} \in \mathcal{A}$ , and the reward is $X_{t}=1$ if the user clicked on the advert and $X_{t}=0$
Dynamic Pricing

Users arrive sequentially, and the learner sets the price.
- (a) the learner never actually observes the valuation of the product, only the binary signal that the price was too low/too high, and
- (b) there is a monotonicity structure in the pricing. If a user purchased an item priced at 10, then they would surely purchase it for 5, but whether or not it would sell when priced at 11$ is uncertain.
Network Routing

Another problem with an interesting structure is network routing, where the learner tries to direct internet traffic through the shortest path on a network. In each round the learner receives the start/end destinations for a packet of data. The set of actions is the set of all paths starting and ending at the appropriate points on some known graph.

Waiting Problems

公交车来的时间、通行时间不确定，但如果坐上了就时间很短。步行时间长，但用时确定。（Learn CDF of uncertain event）

应该等多久公车后决定步行？
Resource Allocation
Recommendation(类似广告问题，不过问题更复杂)

Netflix has to decide which movies to recommend to customers.

The reward can be measured as some function of (a) whether or not you watched a movie and (b) whether or not you rated it positively.
Tree Search (?)

Basic Setting

Regret

Asymptotic analysis of regret

Different settings:

Stochastic Stationary Bandits

Adversarial bandits

Reinforcement Learning

Partial Monitoring

Application

A/B Testing

Advert Placement

Dynamic Pricing

Network Routing

Waiting Problems

Resource Allocation

Recommendation(类似广告问题，不过问题更复杂)

Tree Search (?)