Data Presentation

May 08, 2008

Scatterplots—they're really not that hard!

The intermittently good blog Junk Charts has another post highlighting the danger of thinking that you can analyse the relationship between two variables simply by plotting them next to each other on a line chart.

The problem is that line charts really don't give you the ability to tell if there is a relationship or not, but your brain may well con you into believing that they do because of our tendency to see patterns (even if they're not there). Take this chart:

Line

Is there a relationship between the two lines? When you've made up your mind click here to see a scatter plot of the same data. It's a lot easier to tell isn't it?

Why are people so afraid of scatterplots? They're one of the most useful basic tools in any analyst's toolkit, and they're really a very simple idea—essentially a scatterplot is just a map. If you don't feel comfortable with them perhaps it's time you learnt to be!

April 16, 2008

Show me the data

Nicholas Bissantz has another great post over at his blog. He returns to a favourite theme of his (and mine)—simplifying data for greater clarity is all very well, but sometimes it's best to show the actual numbers.

Averages can hide a wide range of actual performance. With large data sets, like surveys, we address that with statistical tools such as variance, standard deviation and confidence intervals. But what do you do when dealing with data that isn't statistical?

Bissantz suggest graphing the data points as well as the average, a technique that I've found very useful with small samples:

Streuungderwerte_en1

His example makes the point nicely—three very different real-world situations that would produce the same average, and therefore the same conclusion for an unwary analyst.

March 28, 2008

Significance testing is soooo 20th century

I'm just in the middle of an excellent book called Beyond Significance Testing. It's one of an increasing number of works to point out the flaws in both the fundamental theory and conventional usage of the "classic" significance test.

Let's have a quick look at how significance testing works. Imagine we want to see if there is a significant difference between the Satisfaction Index for men and women in the situation below:

Menandwomen_2

How do we know if it's significant? The first thing is to work out what question we're asking, which is "does this difference exist in the population as well as in the sample?".  Then we create a null hypothesis, which is that there is no difference between men and women. Finally we conduct a hypothesis test, a t-test in this case, which tells us the chances of finding a difference as big as this in the sample assuming the null hypothesis (that there's no difference in the population). This value, the p. value, conventionally needs to be less than 5% for us to reject the null hypothesis...telling us that there is very unlikely to be no difference between men and women in the population (note that it doesn't tell us how big the difference is!). Clear as mud? I thought so.

So if this approach has flaws, which it does, what should we do instead? The alternative is to look at effect size and margin of error. In other words, instead of asking our stats package for an obscure p. value, let's ask it how big the difference is likely to be in the population...which is what we're really interested in.

One quick way to do this is to plot the margin of error, or confidence interval, of the Satisfaction Indexes straight onto the chart. This has a number of advantages: it is much easier to explain and understand, it illustrates the uncertainty of the measurement, and it captures the size of the difference rather than a simple "significant" or "not significant".

Confinterval

A better test of a specific question like this is to work out the margin of error of the difference, which in this case is 3.1 ±1.8. In other words we can conclude with 95% confidence that the difference between men and women in the population is between 1.3 and 4.9.

Much more powerful, and much easier to understand, than null hypothesis significance testing isn't it?

February 29, 2008

Customer Retention at Netflix

There's a great article on the Wired website about the effort that Netflix (an online movie rental company in the US) has put into improving its Cinematch algorithm. This is the bit of their website that says "if you liked this movie we suggest these movies as well". A tiny percentage increase in this algorithm's accuracy means fewer angry customers (because they haven't had what they perceive as 'stupid' suggestion) and that in turns means improved customer retention and loyalty. Read the full article here.

December 20, 2007

Classic charts...and beer

Interesting article in the Economist looking at three of "history's best" charts.

Ever since Tufte published the Visual Display of Quantitative Information*, Charles Minard's graphic of Napoleon's march on Moscow has been the standard example of a thoughtful and compelling "infographic". I even use it in my course...despite its somewhat tangential relationship to Customer Satisfaction Measurement!

Playfair is a fair choice too. He invented, for better or worse, many of the graphic forms that we are familiar with today, including the bar chart, line chart and even the dreaded pie chart. The Economist shows an early attempt to make political mileage out of charts.

The third person featured is Florence Nightingale, who is sometimes thought to have invented the pie chart, but didn't. She did come up with the "Nightingale Rose", or polar area chart, which is the one covered by the Economist. Frankly this is an odd choice, as it is not one of history's best charts by any means. Nonetheless the outcome of the analysis was of great importance, forcing a review of the sanitary conditions of army barracks and hospitals at a time when disease killed far more soldiers than enemy action.

For similar reasons, my preferred third choice would have been John Snow's map of cholera deaths in Soho.
643pxsnowcholeramap1

A map that forced the closure of a lethally infected water pump (the Broad Street pump) and finally began to convince people that cholera was water-borne and not spread by smell.

Next time you're in Soho or Carnaby Street, find your way to Broadwick Street, as Broad Street is now known, and have a pint in the John Snow pub. It seems an ideal way to commemorate the closing of an era in which drinking beer was safer than drinking water.

* You can read our review of VDQI here[PDF]