While this method is scientifically valid, it has a major drawback: if you only implement significant results, you will leave a lot of money on the table.
In this blogpost, I will argue why a post-hoc Bayesian test evaluation is a better evaluation method than a frequentist one for growing your business. If it sounds complicated, don’t worry – by the end of the post, you’ll easily be able to do your own Bayesian analyses.
The Challenges of a Successful A/B Testing Program
The sad truth is that we see a lot of testing programs die a quiet death.
There is a real challenge in keeping such programs alive. If not everyone in your organization believes in experimentation, you will have a hard time proving its worth.
Sure, you can learn from losing tests; but it
Of course, you can learn from losing tests, but too many can kill a budding testing program.
This belief in experimentation is highly dependent upon the number of winning tests. If your win ratio is very low (say, lower than 20%, which isn’t far from the industry average depending who you ask),tuvalu cell phone database your website isn’t changing much over time. This will drain the energy right out of your testing team.
Team members have put a lot of time and energy in finding the insights, developing test variations and analyzing them. If these efforts aren’t rewarded, then their energy and motivation will drop (not to mention that the energy from any stakeholders tends to fade quick without ROI).
And another more important consequence is that you will have lower visibility in the organization.
Too many losing tests can cause your testing team to lose organizational visibility.
If you only deliver a winning variation once in a blue moon you will not be perceived as very important to the business. Consequently, your program will be deprioritized or even discontinued.
We Need More Winners!
The solution to this problem is to get more winners out of your A/B tests.
But that’s more easily said than done!
You may be able to accomplish this by improving your conversion research or testing bolder changes, but another approach would be to redefine what you perceive as a winner by changing the statistics.
Ok, that may sound a bit sketchy. But there are a couple of challenges with frequentist statistics, which we have been using to evaluate our A/B tests.

Say What? I Don’t Understand!
The foremost problem with using frequentist statistics is the difficulty of interpreting the test outcome correctly. A t-test (which is used in frequentist statistics) checks whether the averages of two independent groups differ significantly from each other. The basic assumption of this test is that there is no difference in conversion rate between group A and B. This is the so-called null hypothesis.
With a frequentist test evaluation you try to reject this hypothesis, because you want to prove that your test variation (B) outperforms the original (A). With a set significance level in advance of the test (usually 90% or 95%) you judge whether the p-value (1 – significance level) of the test is lower than the threshold p-value. If the result is very unlikely under the null hypothesis – say with a p-value of 0.02 – then you could safely state that the conversion rate of A is different from that of B.
Innocent Until Proven Guilty
You could compare using frequentist statistics to the process of a US trial.
Webcast, October 12th: The Subtle Art of Content Marketing as Sales Enablement
The null hypothesis in a trial states that the defendant is innocent. This is the starting point of the trial: a defendant is innocent until they are proven guilty without reasonable doubt. The alternative hypothesis thus states that the defendant is guilty. The prosecutor has the burden of proving that the defendant isn’t innocent at all, by presenting incriminating evidence.
Recommended for You
Then, this evidence is judged. The jury asks themselves the question, “could the data plausibly have happened by change if the defendant is actually innocent? In other words, could the null hypothesis still be true?
If the data were likely to have occurred under the assumption that the null hypothesis were true, then we would fail to reject the null hypothesis, and state that the evidence is not sufficient to suggest that the defendant is guilty.
If the data were very likely to have occurred, then the evidence raises more than reasonable doubt about the null hypothesis, and hence we reject the null hypothesis.
In conclusion, a t-test only tells you how surprising the results are based on the hypothesis that A and B perform exactly the same. I don’t know about you