Pages

Monday, March 25, 2013

GeoSpatial Analytics, Microsoft BI & John Snow’s Cholera Map

Throughout my childhood, I have always been fascinated by the idea of maps. It could be attributed to the number of pirate and treasure hunter stories that I used to voraciously devour. My dreams were filled with the protagonists in those books and their endless struggles to decipher the coded information in the maps. What used to strike me was that the location of the treasure was always present in the maps but still countless men were misled by them. Most of the times, it took the right person or the right context to get to the bottom of the map. As I grew older, my association with pirate / treasure hunter stories decreased (apart from watching the occasional Pirates of the Caribbean releases) but my bond with maps just went on getting stronger. Though my friends might joke that the reason for this is because I am so spatially disoriented (I still use a navigation system to travel from my home to the office which is just a mile away for fear of getting lost), the real rationale is because I see a lot of value in geospatial analytics as a BI consultant. In today’s world, when more and more product companies are bringing forth tools to easily analyse location based data, the excuses for not trying to analyse and explore geographic patterns in your business data are running dry. The Microsoft BI platform has a very able set of tools for geospatial analysis which includes SQL Server Reporting Services, Power View and Geo Flow (this is without including some of the map apps that are present in Office 2013) and helps me a lot in evangelizing geospatial analytics amongst my clients. While I was reading the news, I came to know that the last week was the 200th birth anniversary of one of the pioneers of geospatial analysis as well as the father of modern epidemiology – John Snow. I decided to pay my homage to him by following his adventure and reproducing his famous 1854 Broad Street Cholera Outbreak map using Microsoft BI.

GeoSpatial Analytics, Microsoft BI & John Snow’s Cholera Map

Before we start, it might be interesting to set the context to the 19th century London. London was evolving into a big city with a large influx of people, but without the modern day facilities and sanitary services. Houses used to have cesspools below their basements and human as well as animal wastes were accumulated there. Initially, people seemed to forget about the cesspools below and somehow thought that the wastes would get disposed magically, till the cesspools started overrunning. That is when the London government decided to dispose the waste into River Thames (which was also the source for water supply). You can imagine London at that time – stinking with all the human and animal wastes and it was no wonder that diseases were rampant at that time. The reason for the diseases were all attributed to the pollution and foul air emanating from those wastes. John Snow was a big sceptic of this theory though he had no evidence to prove otherwise. That is around when there was a major outbreak of Cholera in the Soho district of London and he decided to investigate it further. I am not 100% sure of how he actually did his investigation, but I am pretty sure that he must have at least gone through some variations (if not exact) of the four main sections in any treasure-hunting plot :-

1) Collecting the Data (or acquiring the Map!)
The first section of most treasure hunter stories weave through the part where the protagonist goes through shady alleys and dilapidated shops to find the treasure map from some ancient hawker. Likewise, John Snow might have gone through the disease laden streets to pick up the statistics on the cholera deaths and their locations. Luckily for me, I just had to search the internet and got the data (thanks to Robin Wilson). You can download a copy of the data from here. I did some simple changes to the source data to extract the longitude and the latitude columns from the geometry column as shown below (pro tip: use the Flash Fill feature of Excel 2013).

Cholera death stats

2) Exploring the data patterns (or deciphering the Map!)
The plot continues where the protagonist tries to make sense out of the map. Treasure maps are never straightforward and he takes a lot of time to see whether there are any hidden clues or markers in the map. Similarly, John Snow must have spent a lot of time studying the statistics as well as exploring the data visually for geographic patterns. I decided to visualize the data using Power View initially to check for any patterns.

Cholera death visualization using Power View

Just as the protagonist tries different approaches, I decided to also use GeoFlow to visualize the same data.

Cholera Deaths layer 

The heatmap visualization in GeoFlow indicated something that Power View didn’t show that explicitly - unusually large number of deaths around the red area.

3) Investigating the Outliers / Patterns (or Validating the potential Treasure locations)
This is one of the most crucial part of the story, where the protagonist tries to cross check and re-validate the patterns or code that he has found within the map. This is what is going to distinguish him from the countless men that have lost their lives in search of the treasure. He makes sure that the clues that he has got from the map is not a red-herring. This is what John Snow did too, as is his evident from his writings – “On proceeding to the spot, I found that nearly all the deaths had taken place within a short distance of the [Broad Street] pump. There were only ten deaths in houses situated decidedly nearer to another street-pump. In five of these cases the families of the deceased persons informed me that they always sent to the pump in Broad Street, as they preferred the water to that of the pumps which were nearer. In three other cases, the deceased were children who went to school near the pump in Broad Street...

With regard to the deaths occurring in the locality belonging to the pump, there were 61 instances in which I was informed that the deceased persons used to drink the pump water from Broad Street, either constantly or occasionally...”. As for all of us, we already know the cause and hence all I had to do currently was to display the pumps also as another layer in GeoFlow and to see the correlation of deaths and proximity of pump.

Cholera Deaths layer with the Pumps layer

Isn’t it amazing to see how accurately GeoFlow has plotted the problem causing Broad Street pump right in the area of the maximum deaths!

4) Visualizing and Sharing the Results (or finding the Treasure!)
This is the final part of the story where the protagonist undertakes the arduous journey to find (and hopefully claims) the treasure. The journey is not easy and there are every chance that he might be misled like the countless many before him. Similarly, in geospatial analysis, it is very easy to get carried away and share wrong results.

xkcd

However, John Snow’s map (shown below) was very instrumental in convincing the London authorities to shut down the problem causing Broad Street Pump and history says that this helped in containing the outbreak.

File:Snow-cholera-map-1.jpg

If done correctly, geospatial analysis is extremely powerful and with technologies like Power View and GeoFlow, this can be done pretty easily and quickly. If you are interested in geospatial analysis or Business Analytics in general, it’s still not late to register for the PASS BA Conference happening on Apr 10-12, 2013. And if you do register, don’t forget to attend my session on GeoSpatial Analytics using Microsoft BI.

GeoSpatial Analytics using Microsoft BI

You can download the completed Excel 2013 file with the Power View and GeoFlow visualizations from here.

Tuesday, March 12, 2013

Unpivoting Data in Data Explorer

Data Explorer is generating a lot of positive buzz within the community and everyone I speak to seems to be really excited about this. (If you are wondering what Data Explorer is, you might want to quickly catch up on my previous post - Introduction to Data Explorer Preview for Excel). And just today, the Data Explorer team announced A new build of “Data Explorer”, and an Auto Update feature (must say I really like the idea of Update button). This was followed by a post from Jamie Thomson aka SSIS Junkie (blog | twitter) on the query language ( M ) in Data Explorer and you can read more about that here. All this made me really excited and I also decided to contribute something through this post.

Unpivot rows in data explorer

It is a pretty common requirement to unpivot data, especially when you are scourging the net for open data. Data Explorer currently does not have any feature to unpivot data but the query language looked pretty solid and I decided to give it a try (actually, this exercise started as a way to test the new query editing functionality in Data Explorer). Well, I did succeed though it might not look that easy (if you do find an easier way, please share it!). Read on for the solution:-

1) The source for my post is given below

source data

I just made a simple table in the excel spreadsheet which gives the Sales by State for the years 2010 to 2012. I then click on the From Table option in Data Explorer tab to get the query window as shown below.

Import data from excel

2) I add a new column called JCol with a value of 1 using the expression below

= Table.AddColumn(Excel.CurrentWorkbook(){[Name="Table37"]}[Content], "JCol", each 1)

Add join column to first table 

Now this is my first table, and I store the expression with me.

3) My next objective is to make a transpose of the original table. This can be done with the expression below

= Table.PromoteHeaders(Table.Transpose(Table.DemoteHeaders(Excel.CurrentWorkbook(){[Name="Table37"]}[Content])))

Transpose table

4) Now we need to rename the State Column to Year.

= Table.RenameColumns(Table.PromoteHeaders(Table.Transpose(Table.DemoteHeaders(Excel.CurrentWorkbook(){[Name="Table37"]}[Content]))), {"State","Year"})

Rename column

5) Now, we need to add a column to this resultant table called JCol (just as we did to the first table in Step 2).

= Table.AddColumn(Table.RenameColumns(Table.PromoteHeaders(Table.Transpose(Table.DemoteHeaders(Excel.CurrentWorkbook(){[Name="Table37"]}[Content]))), {"State","Year"}), "JCol", each 1)

Add join column to second table

6) You might be wondering at this stage why we created the JCol column in both the tables. The reason is that we need to cross join both of these tables so that we get extra rows, and the join column is going to be JCol. Since the values for the join column are all the same, we get a cross join. To do the join of both the tables, delete the existing expression and enter the expression below

= Table.Join(Table.AddColumn(Excel.CurrentWorkbook(){[Name="Table37"]}[Content], "JCol", each 1), "JCol", Table.AddColumn(Table.RenameColumns(Table.PromoteHeaders(Table.Transpose(Table.DemoteHeaders(Excel.CurrentWorkbook(){[Name="Table37"]}[Content]))), {"State","Year"}), "JCol", each 1), "JCol")

Join both the table expressions

7) Now add a new calculated column which will give you the value in NC column if State is NC, SC column if State is SC and FL column if State is FL.

Add custom column for Sales

The expression for the calculated column is given below

if [State] = "NC" then [NC] else if [State]= "SC" then [SC] else [FL]

custom column expression

Now your result should look like shown below

end result of custom calculation

8) Now all you have to do is to rename the Custom column to Sales and hide the unnecessary columns. Then you will be able to see the unpivoted data as shown below

rename and hide unnecessary columns

9) Also, have a look at how the actual query looks like in the advanced query editor

Advanced query editor

I would like to see a word-wrap option for the advanced query editor so that I don’t have to scroll over to the right to see long formulas. Apart from that, looks great! Don’t forget to share your comments as well as your posts / experiences with Data Explorer and it’s query language.

Update

I got quite some requests asking for the actual query. So here it is

let
    Source = Table.Join(Table.AddColumn(Excel.CurrentWorkbook(){[Name="Table37"]}[Content], "JCol", each 1), "JCol", Table.AddColumn(Table.RenameColumns(Table.PromoteHeaders(Table.Transpose(Table.DemoteHeaders(Excel.CurrentWorkbook(){[Name="Table37"]}[Content]))), {"State","Year"}), "JCol", each 1), "JCol"),
   InsertedCustom = Table.AddColumn(Source, "Custom", each if [State] = "NC" then [NC] else if [State]= "SC" then [SC] else [FL]),
    RenamedColumns = Table.RenameColumns(InsertedCustom,{{"Custom", "Sales"}}),
    HiddenColumns = Table.RemoveColumns(RenamedColumns,{"2010", "2011", "2012", "JCol", "NC", "SC", "FL"})
in
    HiddenColumns

Remember to change the name of the table name when you use it.

Monday, March 4, 2013

Introduction to Data Explorer Preview for Excel

It’s just been a couple of days since this add-in has been released and I can’t seem to stop using it. I guess I haven’t been this excited about a Microsoft offering since SSRS 2008 R2 (well, I am really passionate about Tabular and PowerPivot now but I wasn’t that hooked onto those technologies when they were released). As a BI consultant, blogger and speaker, I traverse through lots and loads of open data - sometimes trying to get that extra piece of open data which will add value to my clients’ existing data or to get some interesting stuff for my blogs and presentations. Data Explorer might not be the perfect solution, but it sure does a splendid job of reducing my efforts in searching, shaping and preparing my data. I thought of sharing my experiences in the form of a quick introduction to Data Explorer Preview for Excel.

Introduction to Data Explorer Preview for Excel

1) What is Data Explorer Preview for Excel and where can I get it from?

Microsoft “Data Explorer” Preview for Excel is an add-in for Excel 2013 which provides an intuitive user interface for data discovery, data transformation and enrichment. You can download the add-in from here.

2) Basic Tutorial for Data Explorer

You can import data into excel from a wide variety of sources using Data Explorer and the complete list of sources is given here. For the purpose of this demo, I am going to be connecting to my favourite source which is the ‘web page’ source.

a) After installing the add-in, you should be able to see the Data Explorer tab in Excel (if not, go to File—>Options—>Add-Ins—>Com Add-Ins and enable the Data Explorer add-in). Click on the From Web option and enter the url - http://www.nuforc.org/webreports/ndxloc.html as shown below.

Import From Web

b) Now click on Table 0 in the Navigator pane and you should be able to see the UFO sightings by states. The table has some non-US data and so to filter them, click on the dropdown in the reports column and unselect them. Click on OK when you are done.

Unselect unnecessary data

c) Rename the columns as States and UFO Sightings respectively. Notice that the Steps pane on the right shows the modifications you are making and you can expand the pane.

rename columns

d) Click on Done and now you can see the data in Excel. This data can be used as a source for your charts / tables and in the image below, I have used the “Geographic Heat Map” Office app to visualize the same. Looks like the aliens have taken a liking for California!

Data visualized using Heat Map

3) Interesting Features

a) Online Search:- I spend a lot of time trying to search for open datasets and this feature is surely going to reduce that time. You can search for data right from Excel by clicking on the Online Search button. For eg, if I search for ‘richest states’, the results as shown below.

Online Search option

Just scroll your mouse over the results and click on Use to add the data.

b) Filter & Shape Data:- The ability to filter and transform your data is what makes this so useful. You can just click on the Filter & Shape button to start the process as shown below.

Filter & shape data

For eg, if you just want to show the income for 2011, you can hide the rest of the columns as shown below.

Hide unnecessary columns

You can also do a lot of other operations like Splitting columns, removing duplicates, replacing values, changing types, group-by, etc.

Types of transforms available

You can read the complete list from here. Hopefully, there will be more additions to this in the future (I could definitely use a Unpivot / Pivot option).

c) Merging & Appending from Multiple Sources:- This is another killer feature in Data Explorer. You can merge or append from multiple sources. For eg, if we have to merge the two sheets that we just created, click on the Merge button and then select the primary and secondary tables from the dropdown.

Merge data from two sources

Then select the matching columns from both the tables and click on Apply.

select matching columns

Now, to display the additional columns, click on the expand icon on the New Column and select the UFO sightings. Click on OK.

Expand columns

Now you have combined both the sources and can use it for your visualization. I have visualized the same data using GeoFlow below.

Data visualized using GeoFlow

4) Further Reading

There’s lot of interesting stuff you can do with Data Explorer and if this post caught your interest, make sure to check the below ones too

1) Data Explorer Team - Announcing Microsoft “Data Explorer” Preview for Excel

2) Chris Webb - Importing Data From Multiple Log Files Using Data Explorer

3) Chris Webb - Calling A Web Service From Data Explorer, Part 1

4) Jamie Thomson - Traversing the Facebook Graph using Data Explorer

5) Matt Masson - Access the Windows Azure Marketplace from Data Explorer

6) Jake Smillie - Best Oscar winning Film? My first Data Explorer adventure…

7) Ian Morrish - SharePoint OData and the Excel Data Explorer

8) Data Explorer Help

Updates

9) Dan English - Installing Data Explorer Preview & Demo with IMDB Data

10) Alan Koo - Introduction to Microsoft Data Explorer Preview for Excel 2013 - Part 1