This document contains monthly price and sales data for four Indian cities from January 2010 to November 2010. It also contains meter reading data from April 2010 to March 2011 for 10 sections of an energy utility. Additionally, it includes information on the social network of Indian programmers based on their follower connections on Github and a network graph of top Indian ODI batsmen. The document explores using data to detect patterns and relationships.
7. DETECTING FRAUD
“
We know meter readings are
incorrect, for various reasons.
We don’t, however, have the
concrete proof we need to start the
process of meter reading
ENERGY UTILITY automation.
Part of our problem is the volume
of data that needs to be analysed.
The other is the inexperience in
tools or analyses to identify such
patterns.
11. SECURITIES FINDING PATTERNS
Which securities move together?
How should I diversify?
What should I sell to reduce risk?
What’s a reliable predictor of a security?
12. 68% correlation
between AUD & EUR
Plot of 6 month daily
AUD - EUR values
13.
14. PREDICTING MARKS
What determines a child’s marks?
Do girls score better than boys?
Does the choice of subject matter?
EDUCATION Does the medium of instruction matter?
Does community or religion matter?
Does the first letter of their name matter?
Does their sun sign matter?
15. … and peaks
Based on the results of the 20 lakh for Sep-borns
students taking the Class XII exams The marks
at Tamil Nadu over the last 3 years, shoot up for Aug
borns
it appears that the month you were
born in can make a difference of as
much as 120 marks out of 1,200. 120 marks out of
1200 explainable
by month of birth
June borns
score the lowest
An identical pattern was observed in 2009 and 2010…
“It’s simply that in Canada the eligibility
cutoff for age-class hockey is January 1. A
boy who turns ten on January 2, then,
could be playing alongside someone who
doesn’t turn ten until the end of the year—
and at that age, in preadolescence, a
twelve-month gap in age represents an
enormous difference in physical maturity.”
-- Malcolm Gladwell, Outliers … and across districts, gender, subjects, and class X & XII.
16. EXPLORING RELATIONS
This is the social network of programmers
across various Indian cities, using the
follower network at Github.com – a
Facebook for developers.
Each circle represents a coder. The size
shows their number of followers. The
colour shows the language they develop in.
NETWORKS The lines show whom they follow.
Data visualisation is about telling picture stories using numbers.In the next few minutes, we’ll cover the history of data visualisations, and some examples I’ve been working on.My name is Anand. More about me at www.s-anand.net
This is a data-generated map of London. Red spots indicate where photos on Flickr were taken. Blue spots indicate Twitter messages.Just with this, you can already see the streets, the river Thames, the tourist spots and the business districts.
The earliest data visualisation was shown by Florence Nightingale to Queen Elizabeth during the 100 Years War.The red shows deaths from wounds. The blue shows deaths from illnesses.The Queen got the point, started funding hospitals more, and England won the war.
When Cholera struck London, water wasn’t known as the carrier.Dr Snow plotted a map showing cholera incidents along with distance to water pumps.He identified a damaged pump as the source of the disease, and saved thousands of lives.
That was how data visualisation began. But WHY do we need it? Aren’t numbers obvious?Take a look at the price and sales by city on this table.Every column has the same average, and the same variance. So are they similar?
No. Each city has a very distinct pattern.But it’s not easy to spot this pattern with just the numbers and the averages.If there’s ONE rule you want to remember from this talk, it is: Do NOT trust averages. Always plot it.
One day, a senior electricity board official said to us, “We know our meter readings have a lot of fraud in them.But when we go to the Union, they ask for proof. It’s sure to be there somewhere in our data. But it’s too large, and we don’t know how to analyse it.Can YOU help us?
We plotted each of the 200 billionreadings, and got what looks like a smooth lognormal curve.But with spikes – at exactly the slab boundaries. People with a reading of 100 pay bills at a lower rate than those with a reading of 101.
And this isn’t randomly spread out. There are SPECIFIC people whose meter reading is consistently at the slab boundaries.The first row, for instance, is a famous personality, and her reading shows 200, 200, 200, 200…So do a lot of others’.
We also showed the degree of fraud by geographic sections. Section 1 has very high fraud.Section 5’s fraud fell dramatically in Jun, and shot back up in September.That happens to be EXACTLY when a particular section manager was transferred in, then out.
In another example I worked on, a bank approached us and said, we want to find patterns in currency, stock and commodity prices.Specifically, how do they move with each other? Are there blocks of securities that are related? Can you show it in a visually obvious way?
This has 19 securities and their correlations with each other.The Australian Dollar and the Euro have a correlation of 68%, and that’s the plot of their values over time.The green indicates a positive correlation. The red indicates a negative one.
You can now see two big blocks of securities.The S&P, the FTSE, the BSE, and for some reason, the Pakistani Rupee.The Singapore Dollar, the Japanese Yen, Gold, Swiss Franc and the Chinese Yuan.They move together with each other, but when any one block goes up, the other block is sure to go down.
The Tamil Nadu education department shared with us the marks of every single student over the last 3 years.I tried to see if I could predict their marks. Does gender make a difference? Does subject matter?Also silly things like whether the first letter of the name matters, and whether the sun sign matters.
The sun sign matters a LOT. August borns score a good 10% more than June borns, and this is statistically extremely significant.You can see the same pattern every year, in every district, in every class, in every subject.The reason was clear in retrospect, but I’ll let you guess it.
We plotted that across a number of cities.Bangalore – dense network with a central connected component.Chennai and Pune – not so well connected, but not too bad either.The other cities barely have a network.And you can FEEL this if you visit the cities and talk to geeks.
Who’s the best Indian one-daybatsman? The size represents every run ever scored. The colour represents speed. Red is slow, green is fast.Sehwag’s very fast – but so was Kapil, especially for his time.
This is a drilldown, showing every single match they played.With this, you’ll be able to see who the consistent players are, and where exactly their runs came from.You can also click to see that particular match statistics.
All of these visualisations, and the picture stories they told, were generated purely using numbers, and only using programs, with not one bit of manual adjustment.For more such, andmore details about these, here are the links.