*If anything, multi-armed bandit shines because it can adapt to trends you don't...

paulddraper · on Jan 20, 2017

> The point is that for A/B tests, all these problems have simple fixes: make sure to run the A/B test for long enough.

Multi-bandit has the same fix: make sure the test has run for long enough before adjusting sampling proportions.

yummyfajitas · on Jan 20, 2017

So what I'm proposing to do is run A/B with a 50/50 split for a full week, then when B wins shift to 0/100 in favor of B.

You seem to be proposing to run A/B with a 50/50 split for a full week, then when B does a lot better shift to 10/90 in favor of B and maybe a few weeks later shift to 1/99.

What practical benefit do you see to this approach? From my perspective this just slows down the experimental process and keeps losing variations (and associated code complexity) around for a lot longer.

paulddraper · on Jan 20, 2017

First, Google Analytics (for example) runs content experiments for a minimum of two weeks regardless of results. It's hardly an unrealistic timeframe for reliable conclusions.

> What practical benefit do you see to this approach?

Statistically rigorous results, with minimal regret.

In your example, you reach the end of the week, and your 50/50 split has one-sided p=0.10, double the usual p<0.05 criteria. What do you do?

(a) Call it in favor of B, despite being uncertain about the outcome. (b) Keep running the test. This compromises the statistical rigor of your test. (c) Keep running the test, but use sequential hypothesis testing, e.g. http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigo... This significantly increases the time to reach a conclusion, and costs you conversions in the meantime.

(a) and (b) are the most popular choices, despite them being statistically unjustifiable. http://www.evanmiller.org/how-not-to-run-an-ab-test.html

---

The essential difference when choosing the approach is that 50/50 split optimizes for shortest time to conclusion, and multi-bandit optimizes for fewest failures.

In web A/B testing, the latter is usually the most applicable, and for that, you cannot beat Thompson sampling on the average, no matter how diabolically clever your scheme. https://www.lucidchart.com/blog/2016/10/20/the-fatal-flaw-of...

There are times when the former is more important, e.g. marketing wants to know how to brand a product that is being released next month. These are the clinical-like experiments that frequentist approaches were formulated for.

yummyfajitas · on Jan 20, 2017

Statistically rigorous results, with minimal regret.

The results are only statistically rigorous provided your bandit obeys relatively strong assumptions.

As another example, suppose you ran a 2-week test. Suppose that from week 1 to week 2, both conversion rates changed, but the delta between them remained roughly the same. A 50/50 A/B split doesn't mind this, and in fact still returns the right answer. Bandits do mind.

I don't do p-values. I do Bayesian testing, same as you. I just recognize that in the real world, weaker assumptions are more robust to experimenter or model error, both of which are generally the dominant error mode.

In web A/B testing, the latter is usually the most applicable, and for that, you cannot beat Thompson sampling on the average, no matter how clever yur scheme.

This is simply not true. The Gittins Index beats Thompson sampling, subject again to the same strong assumptions.

Look, I know the theoretical advantages of bandits and I advocate their use under some limited circumstances. I just find the stronger assumptions they require (or alternately the much heavier math requirements) mean they aren't a great replacement for A/B tests which are much simpler and easier to get right.