Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

He also is using 95% confidence as a cut-off. Don't do that. You don't need much more data to massively increase the confidence level, and so if the cost of collecting it is not prohibitive you absolutely should go ahead and do that.

Statistical significance grows roughly like the square root of the number of samples. Moving from 2-sigma (95%) to the physics gold-standard of 5-sigma requires drastically more data in almost all cases.

Selecting a measurement's uncertainty is something which should be carefully considered. Sometimes you only care about something to 10%, sometimes a 1-in-a-million part failure kills someone's Mom.

If you're doing lots of A/B testing, where trials penalties add up, it might be worth looking into the way that LIGO handles False Alarm Rates. They have to contend with a lot of non-Gaussian noise/glitches.



Statistical significance grows roughly like the square root of the number of samples.

No, no, no. You are confusing the growth of the standard deviation (which does grow like the square root of the number of samples) with the increase in certainty as you add standard deviations. That falls off like e^(-O(t^2)) where t is the number of samples. This literally falls off faster than exponential.

What does this mean in the real world? In a standard 2-tailed test you get to 95% confidence at 1.96 standard deviations, 99% confidence at 2.58 standard deviations, and 99.9% confidence at 3.29 standard deviations. These numbers are all a long ways away from 5 standard deviations.

Let's flip that around and take 95% confidence as your base. If you are measuring a real difference, then on average 99% confidence requires a test to get 32% more data, and 99.9% confidence requires a test to get 68% more data. Depending on your business, the number of samples that you get are often proportional to the time it takes to run the test. If making errors with x% of your company involves significant dollar figures, the cost of running all of your tests to higher confidence tends to be much, much less than the cost of one mistake.

That is why I say that if the cost of collecting more data is not prohibitive, you shouldn't be satisfied with 95% confidence.


Assume a random variable is barely resolved at 1-sigma off zero with N samples. If I wish to increase my confidence that it really is off zero (and the mean with N samples is actually the mean of the distribution), then I'll need 4N samples to halve my uncertainty and double the significance of the observation (as measured in sigma-units). It is in that sense that the significance of a measurement increases like \sqrt(N).

Viewed from my perspective, if you'd like to go from 2-sigma (95%) to 3.29-sigma, you'd need (3.29^2)/(2^2)=2.7 times the amount of data used to get the 2-sigma result, or 170% more samples.

It looks like you've reached your conclusion that I'd need 68% more data to reach 99.9% by taking 3.29/1.95=1.68. I believe that this is in error. Uncertainty (in standard deviation) decreases like 1/\sqrt(N), not 1/N.

\sqrt(N) has driven me to depression more than once.


You are right that I used linear where I should have used quadratic.

However consider this. To go from 95% to 99% confidence takes 73% more data collection. So for 73% more data, you get 5x fewer mistakes.

To go from 95% to 99.9% confidence takes 182% more data. So for less than 3x the data, you get 50 times the confidence.

My point remains. Confidence improves very, very rapidly.


Neat to see a different side of a coin. In our lab, individual measurements can take as long as a year. \sqrt(N), when constrained by human realities, presents a wall beyond which we cannot pass without experimental innovation.

As the derivative of \sqrt(N) is 1/2*1/\sqrt(N), your first measurement teaches you the most. Every measurement teaches you less than the last. In general, we measure as much as we must, double the size of the dataset as a consistency check, and move on. The allocation of time is one of the most important decisions of an experimenter.


Ah. Well I talk about the cost of data acquisition for a reason.

I've seen a number of businesses who have a current body of active users, and this does not change that fast. So when they run an A/B test, before long their active users are all in it, and before too much longer those of their active users who would have done X will have done X, and data stops piling up. In that case there is a natural amount of data to collect, and you've got to stop at that point and do the best you can.

Businesses are as alike as snowflakes - I am happy to talk about generalities but in the end you have to know what your business looks like and customize to that.


If memory serves though, the further out you push your sigma's the greater the likelihood of introducing a type-2 error.

There is no easy answer in statistics!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: