We live in exciting times for sure. “Big data” (enormous databases and methods of analysing them) is creating all kinds of new knowledge. So I’m not saying that it’s all hype, and I did for example enjoy reading Kenneth Cukier and Viktor Mayer-Schönberger’s book Big Data.
But there sure is a lot of hype around as well. One particular meme I’m not so keen about is the claim that we now live in a whole new “N = all” world, where statistics is no longer needed, since we can just check from the data exactly how many x are y (e.g. people who live in London and bought something online last month, or something else that in the past we would have had to estimate from a sample to find out). Yes, there is a lot of information like this that is now easily available, and the big data advocates have many cool anecdotes to tell. And Google probably knows more about us that we do ourselves.
One obvious situation where old-fashioned statistical inference will be needed for some time still is medical research. Say you’re developing a new drug. You will need to do your phase 1, 2, and 3 trials just as before, and convince people at each stage that it’s safe to carry on. Unless you can somehow feed your new prototype drug to everyone in the world, record the outcomes in your data lake, and do your data mining? And there are surely many other situations like it, outside of academia as well. One of my previous jobs was on bank stress testing, which requires econometric modelling using very limited data sets and, yes, plenty of statistical inference.
I suppose in terms of the hype cycle, we are still in the initial peak phase of great expectations. And eventually all of these new methods will find their place in the great toolbox of data analytics. Right next to the boring old regression models, and slightly less old and never boring decision trees and neural networks.