Online Learning(1) Intro

开坑online learning and bandits,一直想着系统地学习下,利用2022 summer争取拿下。🤤🤤🤤
Basic Setting
A bandit problem is a sequential game between a learner and an environment.
The game is played over rounds, where is a positive natural number called the horizon. In each round , the learner first chooses an action from a given set , and the environment then reveals a reward .
The learner cannot peek into the future when choosing their actions. The action should only depend on the history .
Regret
DEFINITION : The regret of the learner relative to a policy (not necessarily that followed by the learner) is the difference between the total expected reward using policy for rounds and the total expected reward collected by the learner over rounds. The regret relative to a set of policies is the maximum regret relative to any policy in the set.
The set is often called the competitor class. ( 与 中的strategy 对抗)
Regret measures the performance of the learner relative to the best policy in the competitor class.
Example:
Suppose the action set is . An environment is called a stochastic Bernoulli bandit if the reward is binary valued and there exists a vector such that the probability that given the learner chose action is . The class of stochastic Bernoulli bandits is the set of all such bandits, which are characterised by their mean vectors.
If you knew the mean vector associated with the environment, then the optimal policy is to play the fixed action . This means that for this problem the natural competitor class is the set of constant polices , where chooses action in every round. The regret over rounds becomes
Asymptotic analysis of regret
A good learner achieves sublinear regret. This means that or . The interesting question is under what circumstances is $R_{n}=O(\sqrt{n}) \text { or } R_{n}=O(\log (n)) $?
Different settings:
Stochastic Stationary Bandits
A simple problem setting is that of stochastic stationary bandits. In this case the environment is restricted to generate the reward in response to each action from a distribution that is specific to that action and independent of the previous action choices and rewards.
Adversarial bandits
An extreme idea is to drop all assumptions on how the rewards are generated, except that they are chosen without knowledge of the learner’s actions and lie in a bounded set.
The learner is not expected to find the best sequence of actions, which may be like finding a needle in a haystack. Instead, we usually choose Π to be the set of constant policies and demand that the learner is not much worse than any of these. By defining the regret in this way, the stationarity assumption is transported into the definition of regret rather than constraining the environment.
Reinforcement Learning
在一般的问题中考虑 learner’s available choices and rewards tomorrow are not affected by their decisions today. 但RL 考虑long term profit
Partial Monitoring
The setting where the reward is not observed is called partial monitoring.
Application
The designers of a company website are trying to decide whether the ‘buy it now’ button should be placed at the top of the product page or at the bottom.
One way to apply bandits to this problem is to view the two versions of the site as actions. Each time t a user makes a request, a bandit algorithm is used to choose an action , and the reward is if the user purchases the product and otherwise.
-
Advert Placement
In advert placement, each round corresponds to a user visiting a website, and the set of actions is the set of all available adverts. One could treat this as a standard multi-armed bandit problem, where in each round a policy chooses , and the reward is if the user clicked on the advert and
-
Dynamic Pricing
Users arrive sequentially, and the learner sets the price.
- (a) the learner never actually observes the valuation of the product, only the binary signal that the price was too low/too high, and
- (b) there is a monotonicity structure in the pricing. If a user purchased an item priced at 10, then they would surely purchase it for 5, but whether or not it would sell when priced at 11$ is uncertain.
-
Network Routing
Another problem with an interesting structure is network routing, where the learner tries to direct internet traffic through the shortest path on a network. In each round the learner receives the start/end destinations for a packet of data. The set of actions is the set of all paths starting and ending at the appropriate points on some known graph.
-
Waiting Problems
公交车来的时间、通行时间不确定,但如果坐上了就时间很短。步行时间长,但用时确定。(Learn CDF of uncertain event)
应该等多久公车后决定步行?
-
Resource Allocation
-
Recommendation(类似广告问题,不过问题更复杂)
Netflix has to decide which movies to recommend to customers.
The reward can be measured as some function of (a) whether or not you watched a movie and (b) whether or not you rated it positively.
-
Tree Search (?)