Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If anything, multi-armed bandit shines because it can adapt to trends you don't anticipate.

It can, but the time it takes is exp(# of samples already passed).

You can improve this by using a non-stationary Bayesian model (i.e. one that assumes conversion rates change over time) but this usually involves solving PDEs or something equally difficult.

For every single problem, the author admits "A/B tests have the same problem", and then somehow concludes that multi-bandit tests are harder because of these design decisions, despite the fact they affect any experiment process.

The point the author (me) is trying to make is not that bandits are fundamentally flawed. The point is that for A/B tests, all these problems have simple fixes: make sure to run the A/B test for long enough.

For bandits, the fixes are not nearly as simple. It usually involves non-simple math, or at the very least non-intuitive things (for instance not actually running a bandit until 1 week has passed).

At VWO we realized that most of our customers are not sophisticated enough to get all this stuff right, which is why we didn't switch to bandits.



> The point is that for A/B tests, all these problems have simple fixes: make sure to run the A/B test for long enough.

Multi-bandit has the same fix: make sure the test has run for long enough before adjusting sampling proportions.


So what I'm proposing to do is run A/B with a 50/50 split for a full week, then when B wins shift to 0/100 in favor of B.

You seem to be proposing to run A/B with a 50/50 split for a full week, then when B does a lot better shift to 10/90 in favor of B and maybe a few weeks later shift to 1/99.

What practical benefit do you see to this approach? From my perspective this just slows down the experimental process and keeps losing variations (and associated code complexity) around for a lot longer.


First, Google Analytics (for example) runs content experiments for a minimum of two weeks regardless of results. It's hardly an unrealistic timeframe for reliable conclusions.

> What practical benefit do you see to this approach?

Statistically rigorous results, with minimal regret.

In your example, you reach the end of the week, and your 50/50 split has one-sided p=0.10, double the usual p<0.05 criteria. What do you do?

(a) Call it in favor of B, despite being uncertain about the outcome. (b) Keep running the test. This compromises the statistical rigor of your test. (c) Keep running the test, but use sequential hypothesis testing, e.g. http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigo... This significantly increases the time to reach a conclusion, and costs you conversions in the meantime.

(a) and (b) are the most popular choices, despite them being statistically unjustifiable. http://www.evanmiller.org/how-not-to-run-an-ab-test.html

---

The essential difference when choosing the approach is that 50/50 split optimizes for shortest time to conclusion, and multi-bandit optimizes for fewest failures.

In web A/B testing, the latter is usually the most applicable, and for that, you cannot beat Thompson sampling on the average, no matter how diabolically clever your scheme. https://www.lucidchart.com/blog/2016/10/20/the-fatal-flaw-of...

There are times when the former is more important, e.g. marketing wants to know how to brand a product that is being released next month. These are the clinical-like experiments that frequentist approaches were formulated for.


Statistically rigorous results, with minimal regret.

The results are only statistically rigorous provided your bandit obeys relatively strong assumptions.

As another example, suppose you ran a 2-week test. Suppose that from week 1 to week 2, both conversion rates changed, but the delta between them remained roughly the same. A 50/50 A/B split doesn't mind this, and in fact still returns the right answer. Bandits do mind.

I don't do p-values. I do Bayesian testing, same as you. I just recognize that in the real world, weaker assumptions are more robust to experimenter or model error, both of which are generally the dominant error mode.

In web A/B testing, the latter is usually the most applicable, and for that, you cannot beat Thompson sampling on the average, no matter how clever yur scheme.

This is simply not true. The Gittins Index beats Thompson sampling, subject again to the same strong assumptions.

Look, I know the theoretical advantages of bandits and I advocate their use under some limited circumstances. I just find the stronger assumptions they require (or alternately the much heavier math requirements) mean they aren't a great replacement for A/B tests which are much simpler and easier to get right.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: