Pages

Monday, March 25, 2013

GeoSpatial Analytics, Microsoft BI & John Snow’s Cholera Map

Throughout my childhood, I have always been fascinated by the idea of maps. It could be attributed to the number of pirate and treasure hunter stories that I used to voraciously devour. My dreams were filled with the protagonists in those books and their endless struggles to decipher the coded information in the maps. What used to strike me was that the location of the treasure was always present in the maps but still countless men were misled by them. Most of the times, it took the right person or the right context to get to the bottom of the map. As I grew older, my association with pirate / treasure hunter stories decreased (apart from watching the occasional Pirates of the Caribbean releases) but my bond with maps just went on getting stronger. Though my friends might joke that the reason for this is because I am so spatially disoriented (I still use a navigation system to travel from my home to the office which is just a mile away for fear of getting lost), the real rationale is because I see a lot of value in geospatial analytics as a BI consultant. In today’s world, when more and more product companies are bringing forth tools to easily analyse location based data, the excuses for not trying to analyse and explore geographic patterns in your business data are running dry. The Microsoft BI platform has a very able set of tools for geospatial analysis which includes SQL Server Reporting Services, Power View and Geo Flow (this is without including some of the map apps that are present in Office 2013) and helps me a lot in evangelizing geospatial analytics amongst my clients. While I was reading the news, I came to know that the last week was the 200th birth anniversary of one of the pioneers of geospatial analysis as well as the father of modern epidemiology – John Snow. I decided to pay my homage to him by following his adventure and reproducing his famous 1854 Broad Street Cholera Outbreak map using Microsoft BI.

GeoSpatial Analytics, Microsoft BI & John Snow’s Cholera Map

Before we start, it might be interesting to set the context to the 19th century London. London was evolving into a big city with a large influx of people, but without the modern day facilities and sanitary services. Houses used to have cesspools below their basements and human as well as animal wastes were accumulated there. Initially, people seemed to forget about the cesspools below and somehow thought that the wastes would get disposed magically, till the cesspools started overrunning. That is when the London government decided to dispose the waste into River Thames (which was also the source for water supply). You can imagine London at that time – stinking with all the human and animal wastes and it was no wonder that diseases were rampant at that time. The reason for the diseases were all attributed to the pollution and foul air emanating from those wastes. John Snow was a big sceptic of this theory though he had no evidence to prove otherwise. That is around when there was a major outbreak of Cholera in the Soho district of London and he decided to investigate it further. I am not 100% sure of how he actually did his investigation, but I am pretty sure that he must have at least gone through some variations (if not exact) of the four main sections in any treasure-hunting plot :-

1) Collecting the Data (or acquiring the Map!)
The first section of most treasure hunter stories weave through the part where the protagonist goes through shady alleys and dilapidated shops to find the treasure map from some ancient hawker. Likewise, John Snow might have gone through the disease laden streets to pick up the statistics on the cholera deaths and their locations. Luckily for me, I just had to search the internet and got the data (thanks to Robin Wilson). You can download a copy of the data from here. I did some simple changes to the source data to extract the longitude and the latitude columns from the geometry column as shown below (pro tip: use the Flash Fill feature of Excel 2013).

Cholera death stats

2) Exploring the data patterns (or deciphering the Map!)
The plot continues where the protagonist tries to make sense out of the map. Treasure maps are never straightforward and he takes a lot of time to see whether there are any hidden clues or markers in the map. Similarly, John Snow must have spent a lot of time studying the statistics as well as exploring the data visually for geographic patterns. I decided to visualize the data using Power View initially to check for any patterns.

Cholera death visualization using Power View

Just as the protagonist tries different approaches, I decided to also use GeoFlow to visualize the same data.

Cholera Deaths layer 

The heatmap visualization in GeoFlow indicated something that Power View didn’t show that explicitly - unusually large number of deaths around the red area.

3) Investigating the Outliers / Patterns (or Validating the potential Treasure locations)
This is one of the most crucial part of the story, where the protagonist tries to cross check and re-validate the patterns or code that he has found within the map. This is what is going to distinguish him from the countless men that have lost their lives in search of the treasure. He makes sure that the clues that he has got from the map is not a red-herring. This is what John Snow did too, as is his evident from his writings – “On proceeding to the spot, I found that nearly all the deaths had taken place within a short distance of the [Broad Street] pump. There were only ten deaths in houses situated decidedly nearer to another street-pump. In five of these cases the families of the deceased persons informed me that they always sent to the pump in Broad Street, as they preferred the water to that of the pumps which were nearer. In three other cases, the deceased were children who went to school near the pump in Broad Street...

With regard to the deaths occurring in the locality belonging to the pump, there were 61 instances in which I was informed that the deceased persons used to drink the pump water from Broad Street, either constantly or occasionally...”. As for all of us, we already know the cause and hence all I had to do currently was to display the pumps also as another layer in GeoFlow and to see the correlation of deaths and proximity of pump.

Cholera Deaths layer with the Pumps layer

Isn’t it amazing to see how accurately GeoFlow has plotted the problem causing Broad Street pump right in the area of the maximum deaths!

4) Visualizing and Sharing the Results (or finding the Treasure!)
This is the final part of the story where the protagonist undertakes the arduous journey to find (and hopefully claims) the treasure. The journey is not easy and there are every chance that he might be misled like the countless many before him. Similarly, in geospatial analysis, it is very easy to get carried away and share wrong results.

xkcd

However, John Snow’s map (shown below) was very instrumental in convincing the London authorities to shut down the problem causing Broad Street Pump and history says that this helped in containing the outbreak.

File:Snow-cholera-map-1.jpg

If done correctly, geospatial analysis is extremely powerful and with technologies like Power View and GeoFlow, this can be done pretty easily and quickly. If you are interested in geospatial analysis or Business Analytics in general, it’s still not late to register for the PASS BA Conference happening on Apr 10-12, 2013. And if you do register, don’t forget to attend my session on GeoSpatial Analytics using Microsoft BI.

GeoSpatial Analytics using Microsoft BI

You can download the completed Excel 2013 file with the Power View and GeoFlow visualizations from here.

8 comments:

  1. Great article (as usual) - Maybe if you moved up a level geospatially (Use zip code/post code instead of co-ordinates) then you could use bubble size to predict the affected areas a little better.
    The advantage to using Power View over GeoFlow is the fact we can put in in SharePoint, although hopefully GeoFlow will get there as well.

    ReplyDelete
    Replies
    1. An issue with using aggregators like Zip Code for detailed analysis like this is that it can be highly misleading. Consider a scenario where there are 5 zip codes (with very small areas) within the prime red spot, and just a couple of zip codes everywhere else. So it might look like there are more deaths in the exterior zip codes though the reason is just that the area is much larger for the 2 zip codes than the 5 zip codes put together. The moment we start aggregating it, we need to add further information like population density, area coverage, etc so that it will not be skewed. But on a high level analysis, it is always a great idea to add aggregators. And yes, hopefully GeoFlow gets to SharePoint soon!

      Delete
  2. A good post, Jason. Loved it. Thanks for writing such posts, makes me feel educated better way.

    Regards.

    ReplyDelete
  3. Very nice post, Jason.
    I find it highly useful to explain to my business users what you can do with BI.
    Thank you for sharing this with us.
    Regards,
    Anders Friis

    ReplyDelete
  4. What a cool way to display this data. Thanks for a great post. I have been fascinated with this map since seeing it in Tufte's Visual Display of Quantitative Information. I found another recent review of the data here http://www.guardian.co.uk/news/datablog/2013/mar/15/john-snow-cholera-map
    Looking forward to your GeoFlow presentation!
    -Katherine Fraser

    ReplyDelete
    Replies
    1. I made the visualizations much before the Guardian post, as I was preparing it for my BA Conference session. But I must admit that the Guardian post is what made me realize that it was the 200th birth anniversary of Dr John Snow, which in turn prompted this post! :)

      Delete
  5. Hi,

    Great post. I have a question. Do you require connection to Bing maps to be able to display the map, or can you use say a static map image that you have?

    ReplyDelete
    Replies
    1. You would need internet connection for the Bing Maps to be displayed in both Power Map (formerly called GeoFlow) as well as Power View.

      Delete