Ewen Gallic: ewen.gallic[at]univ-amu.fr
Pierre Michel: pierre.michel[at]univ-amu.fr
Multi-armed bandit algorithms are getting more and more popular in the digital world where collecting data adaptively is technically feasible. It has been demonstrated that balancing between exploration and exploitation can achieve higher average outcomes than the "experiment first, exploit later" approach of the traditional treatment choice literature. However, there is little work on how data arising from bandits can be used to estimate treatment effects (instead of just determining the arm with the highest outcome). This paper contributes to this growing literature by a systematic simulation exercise that aims to characterize the behavior of the standard average treatment effect estimator on adaptively collected data. I show that the treatment effect estimation - that results from two negatively biased means - are biased away from zero, and illustrate how this bias depends on the magnitude of the treatment effect. I also provide intuitive explanations for these phenomena. I show that propensity score weighting can even exacerbate the bias. Finally, I suggest an easy to implement modication of the propensity score weighting to improve on the estimator.