Analytics, lean and fat

I’ve been doing a bit of reading on lean analytics this week. It has reminded me of the many MI packs I’ve seen, updated, and developed in risk management. (Note that I’ve not read the Lean Analytics book by Ben Yoskovitz and Alistair Croll though, which is where most of this stuff comes from.)

There is lots that makes sense (and isn’t anything new) in lean analytics. Using proportions and rates rather than absolute values, making comparisons across time, and so on, are all still good advice. The notion of a “Vanity Metric” is also a nice one to keep in mind: something that makes you feel good and looks impressive, but isn’t actually either very informative or actionable, is of little value.

The one thing that I do have a problem with, is the lean emphasis on the “One Metric That Matters”. This is meant to be a single piece of information that will drive everything in your venture. I can see how concentrating on the key MI would be advisable, especially for an early-stage business, but I think they are taking this a bit too far.

You can absolutely go too far in the other direction as well. I was once tasked to completely rebuild a portfolio credit risk MI pack. The old one had grown over the years, and was now so big that it took two weeks to produce (each month!), and it was suspected that nobody even looked at half the pages. So I started scoping out the replacement: what were the best parts of the old pack to keep (everyone had their favourites), and what new information should be added (everyone had lots of ideas). And of course; very quickly the planned new pack became even bigger than the old one.

So by all means have your one main metric, but do keep an eye on other relevant data as well. You won’t get swamped by a well thought-out set of relevant metrics. If for example you are looking at active users at two months after registration as your main figure, then make sure that you also know your early indicators: page views, registrations, actives at one month and so on.

Advertisements

Big open data. For good too!

Nesta’s Big and open data for the common good event this week went together with the publication of their new report, Data for good. The event brought together many of the contributors to the report, with some interesting projects using “big data” methods on third sector data and issues. It is well worth reading.

The most interesting discussions at the event were not about statistical methods though. One question asked was about the infrastructure needed to use all the big and open data coming available – who is going to fund and build it? And what format will it be in: if it is too complex to obtain and process, it won’t really be available “for all”. Very good points.

There are also unsolved issues still with anonymisation, especially with health data which is of special interest to me at the moment. A proper debate needs to take place about the ethics of sharing and using public data, often for commercial purposes. And something that I’ve also been pondering about is the use of Twitter data. Yes it is an interesting data source, but there must be a danger that it will be used too much, just because it is so easy to get hold of. For more on this topic, Emma Uprichard has interesting things to say from a social science perspective, in a paper with the great title Big data, little questions?

Big data and small

We live in exciting times for sure. “Big data” (enormous databases and methods of analysing them) is creating all kinds of new knowledge. So I’m not saying that it’s all hype, and I did for example enjoy reading Kenneth Cukier and Viktor Mayer-Schönberger’s book Big Data.

But there sure is a lot of hype around as well. One particular meme I’m not so keen about is the claim that we now live in a whole new “N = all” world, where statistics is no longer needed, since we can just check from the data exactly how many x are y (e.g. people who live in London and bought something online last month, or something else that in the past we would have had to estimate from a sample to find out). Yes, there is a lot of information like this that is now easily available, and the big data advocates have many cool anecdotes to tell. And Google probably knows more about us that we do ourselves.

One obvious situation where old-fashioned statistical inference will be needed for some time still is medical research. Say you’re developing a new drug. You will need to do your phase 1, 2, and 3 trials just as before, and convince people at each stage that it’s safe to carry on. Unless you can somehow feed your new prototype drug to everyone in the world, record the outcomes in your data lake, and do your data mining? And there are surely many other situations like it, outside of academia as well. One of my previous jobs was on bank stress testing, which requires econometric modelling using very limited data sets and, yes, plenty of statistical inference.

I suppose in terms of the hype cycle, we are still in the initial peak phase of great expectations. And eventually all of these new methods will find their place in the great toolbox of data analytics. Right next to the boring old regression models, and slightly less old and never boring decision trees and neural networks.

big health data

Last week I attended the Operational Research Society’s Data Science: The Final Frontier – Health Analytics event (hashtag: #bighealth) at Westminster Uni. Two of the six presentations were worth noting.

Cono Ariti from The Nuffield Trust spoke about predictive risk modelling in health care. He mentioned the “Kaiser pyramid”, which is the old 20/80 rule, slightly expanded, saying that 3% of patients make up 45% of health care costs. The next 13% are responsible for another 33%; added up, these are approximately 20/80!

And he made two important points to keep in mind with health analytics. First, just building a model is useless without corresponding interventions in place. In other words, if you identify patient segments, say, you also need to have suitable treatments available for them. And secondly, that regression to the mean is a major issue in this area: many people get better by themselves, without any treatment at all. This will complicate evaluations between treatments (and no-treatments), since a large number of patients in all groups, whether treatment or control, may improve significantly. And any differences between control and treatment groups may be very small and difficult to identify.

The second interesting talk was more of a blue sky horizon scan, from Rob Smith at IBM. He talked about the future of health analytics, noting the differences in people of different ages when it comes to tech, gadgets, and privacy, and consequent health behaviours. He also talked a bit about the data issues around genomics, and more about what IBM is doing with Watson. For example, it gets fed as much medical literature as possible, so that it can propose not only treatments to match symptoms, but also suggest new research avenues. Very impressive stuff, and potentially useful in things like cancer treatment which is getting very complex. So much so in fact, that my conclusion was to ask whether artificial intelligence is now the only thing clever enough to handle modern medicine?