First dive into online learning

Oliver Burkhard
4 min readJun 18, 2021

TL;DR: For real valued time varying regression over Euclidian space, use River’s AdaptiveRandomForestRegressor.

Code to reproduce can be found here.

Goal

The goal is to have good predictions for a real-valued, time-varying function in Euclidian space. That is, a regression where the input-output relationship changes over time with no indication which observations from the past are still useful and which are obsolete.

Problem statement

The problem is posed in a bandit setting: There are two arms who have a hidden value function over the (2d-) input space and whenever they are pulled, they return that value plus some random noise. The goal is to pull the arm with the higher expected value give a point in the input space.

The functions underlying the arms are scaled density functions of normal mixtures (2 humps in 2d-space each). Those functions are then reversed (multiplied by -1) either gradually or abruptly. The regressors have to detect this concept drift and adapt to it.

Contestants

The contestants are drawn from the River package. The choices were:

  • AdaptiveRandomForestRegressor , discussed here, where every tree of the ensemble has a drift detector that, if a threshold is surpassed, starts training a backup-tree and if a second threshold is surpassed, kills of the current tree and instead starts using the backup tree.
  • BaggingRegressor , using bagged linear regressors
  • EpsilonGreedyRegressor , also using an ensemble of linear regressors (this time with varying learning rates). The idea is to define a metric (here I use RMSE) and pose the regression problem itself as a bandit problem: Use the regressor assumed to be best (lowest RMSE) with probability 1-epsilon . Note that as far as I read the code, there is no forgetting happening, so one would expect this regressor to react worse to shifts in the input-output relationship over time. In Sutton & Barto, which is referenced as the source here, there are extensions that explicitly deal with this issue, so should this become relevant, be sure to check out those.

Note that since linear regressors are notoriously bad at detecting non-linear value functions, gave the two contestants based on them the advantage of having polynomial features (degree 4) of the two original input dimensions. Note that this becomes infeasible as the input dimensions get bigger. As a side note, this alone lead to very bad results, so I had to throw in a StandardScaler from river, to avoid exploding values resulting from the higher order terms.

Task

Ground truth to learn

Initially the second arm is to be preferred for values at (-1, 1) and (1, -1) and the diagonal between those two points. In the abrupt scenario, the relationship switches (values become negative) after half the examples are seen. In the gradual scenario, the same transition happens, but smoothly distributed between one third and two thirds of the total number of examples shown.

Results

I present two ways of looking at the results: First the fraction of correct choices (exponentially smoothed), second the map of where the methods would choose which arm (vs the ground truth of what would be best).

Fraction of correct choices (exponentially smoothed), sudden change after 4000 examples
Fraction of correct choices (exponentially smoothed), gradual change between 2666 and 5333 examples
Choice maps of the different estimators after varying numbers of examples, sudden change after 4000 examples
Choice maps of the different estimators after varying numbers of examples, gradual change between 2666 and 5333 examples

Discussion

  • The adaptive random forest not only has the highest fraction of correct choices, but also recovers the fastest
  • While the experts ensemble does learn, it does so slower and needs a lot more time to recover
  • The bagged base learner never really seems to learn much of anything

Open questions and outlook

  • How well does the AdaptiveRandomForestRegressor fare under shifts in the input distribution but constant input-output relationship? If the shift in input distribution leads to a trigger of the drift detectors, the model is constantly re-trained which could lead to a deterioration in performance.
  • Can we detect the expected quality degradation of the EpsilonGreedyRegressor if changes happen late? The epsilon decay together with the average approach can be expected to make re-learning harder as time goes on.
  • The setup is of course an obvious entry point to active learning where the random choice is not uniformly done but depends on the certainty of the model at the point in question.

--

--