O'Reilly Where 2.0 2011
As a result of cheap storage and computing power, society is measuring and storing increasing amounts of information.
It is now possible to efficiently crunch Petabytes of data with tools like Hadoop.
In this O'Reilly Where 2.0 tutorial, Pete Skomoroch, Sr. Data Scientist at LinkedIn, gives an overview of spatial analytics and how you can use tools like Hadoop, Python, and Mechanical Turk to process location data and derive insights about cities and people.
Topics:
* Data Science & Geo Analytics
* Useful Geo tools and Datasets
* Hadoop, Pig, and Big Data
* Cleaning Location Data with Mechanical Turk
* Spatial Tweet Analytics with Hadoop & Python
* Using Social Data to Understand Cities
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Geo Analytics Tutorial - Where 2.0 2011
1. Geo Analytics Tutorial
Pete Skomoroch
Sr. Data Scientist - LinkedIn (@peteskomoroch)
#geoanalytics
** Hadoop Intro slides from Kevin Weil, Twitter
2. Topics
‣ Data Science & Geo Analytics
‣ Useful Geo tools and Datasets
‣ Hadoop, Pig, and Big Data
‣ Cleaning Location Data with Mechanical Turk
‣ Spatial Tweet Analytics with Hadoop & Python
‣ Using Social Data to Understand Cities
‣ Q&A
3. Topics
‣ Data Science & Geo Analytics
‣ Useful Geo tools and Datasets
‣ Hadoop, Pig, and Big Data
‣ Cleaning Location Data with Mechanical Turk
‣ Spatial Tweet Analytics with Hadoop & Python
‣ Using Social Data to Understand Cities
‣ Q&A
13. Spatial Analysis
Map by Dr. John Snow of London,
showing clusters of cholera cases in
the 1854 Broad Street cholera
outbreak. This was one of the first
uses of map-based spatial analysis.
14. Spatial Analysis
• Spatial regression - estimate dependencies between variables
• Gravity models - estimate the flow of people, material, or
information between locations
• Spatial interpolation - estimate variables at unobserved locations
based on other measured values
• Simulation - use models and data to predict spatial phenomena
15. Life Span & Food by Zip Code
* http://zev.lacounty.gov/news/health/death-by-zip-code
* http://www.verysmallarray.com/?p=975
16. Where Americans Are Moving (IRS Data)
‣ (Jon Bruner) http://jebruner.com/2010/06/the-migration-map/
18. Topics
‣ Data Science & Geo Analytics
‣ Useful Geo tools and Datasets
‣ Hadoop, Pig, and Big Data
‣ Cleaning Location Data with Mechanical Turk
‣ Spatial Tweet Analytics with Hadoop & Python
‣ Using Social Data to Understand Cities
‣ Q&A
19. Useful Geo Tools
•R, Matlab, SciPy, Commercial Geo Software
•R Spatial Pkgs http://cran.r-project.org/web/views/Spatial.html
•Hadoop, Amazon EC2, Mechanical Turk
•Data Science Toolkit: http://www.datasciencetoolkit.org/
•80% of effort is often in cleaning and processing data
20. DataScienceToolkit.org
•Runs on VM or Amazon EC2
•Street Address to Coordinates
•Coordinates to Political Areas
•Geodict (text extraction)
•IP Address to Coordinates
•New UK release on Github
21. Resources for location data
• SimpleGeo
• Factual
• Geonames
• Infochimps
• Data.gov
• DataWrangling.com
22. Topics
‣ Data Science & Geo Analytics
‣ Useful Geo tools and Datasets
‣ Hadoop, Pig, and Big Data
‣ Cleaning Location Data with Mechanical Turk
‣ Spatial Tweet Analytics with Hadoop & Python
‣ Using Social Data to Understand Cities
‣ Q&A
23. Hadoop: Motivation
•We want to crunch 1TB of Twitter stream data and understand
spatial patterns in Tweets
•Data collected from the Twitter “Garden Hose” API last Spring
24. Data is Getting Big
‣ NYSE: 1 TB/day
‣ Facebook: 20+ TB
compressed/day
‣ CERN/LHC: 40 TB/day (15
PB/year!)
‣ And growth is accelerating
‣ Need multiple machines,
horizontal scalability
25. Hadoop
‣ Distributed file system (hard to store a PB)
‣ Fault-tolerant, handles replication, node failure, etc
‣ MapReduce-based parallel computation
(even harder to process a PB)
‣ Generic key-value based computation interface
allows for wide applicability
‣ Open source, top-level Apache project
‣ Scalable: Y! has a 4000-node cluster
‣ Powerful: sorted a TB of random integers in 62 seconds
26. MapReduce?
cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per
output county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to
2x faster.
27. MapReduce?
cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per
output county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to
2x faster.
28. MapReduce?
cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per
output county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to
2x faster.
29. MapReduce?
cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per
output county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to
2x faster.
30. MapReduce?
cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per
output county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to
2x faster.
31. MapReduce?
cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per
output county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to
2x faster.
32. MapReduce?
cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per
output county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close
to 2x faster.
33. But...
‣ Analysis typically done in Java
‣ Single-input, two-stage data flow is rigid
‣ Projections, filters: custom code
‣ Joins: lengthy, error-prone
‣ n-stage jobs: Hard to manage
‣ Prototyping/exploration requires ‣ analytics in Eclipse?
compilation ur doin it wrong...
34. Enter Pig
‣ High level language
‣ Transformations on sets of records
‣ Process data one step at a time
‣ Easier than SQL?
35. Why Pig?
‣ Because I bet you can read the following script.
36. A Real Pig Script
‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
38. Pig Simplifies Analysis
‣ The Pig version is:
‣ 5% of the code, 5% of the time
‣ Within 50% of the execution time.
‣ Pig Geo:
‣ Programmable: fuzzy matching, custom filtering
‣ Easily link multiple datasets, regardless of size/structure
‣ Iterative, quick
39. A Real Example
‣ Fire up your Elastic MapReduce Cluster.
‣ ... or follow along at http://bit.ly/whereanalytics
‣ I used Twitter’s streaming API to store some tweets
‣ Simplest thing: group by location and count with Pig
‣ http://bit.ly/where20pig
‣ Here comes some code!
49. hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30
brasil 37985
indonesia 33777
brazil 22432
london 17294
usa 14564
são paulo 14238
new york 13420
tokyo 10967
singapore 10225
rio de janeiro 10135
los angeles 9934
california 9386
chicago 9155
uk 9095
jakarta 9086
germany 8741
canada 8201
7696
7121
jakarta, indonesia 6480
nyc 6456
new york, ny 6331
50. Neat, but...
‣ Wow, that data is messy!
‣ brasil, brazil at #1 and #3
‣ new york, nyc, and new york ny all in the top 30
‣ Mechanical Turk to the rescue...
51. Topics
‣ Data Science & Geo Analytics
‣ Useful Geo tools and Datasets
‣ Hadoop, Pig, and Big Data
‣ Cleaning Location Data with Mechanical Turk
‣ Spatial Tweet Analytics with Hadoop & Python
‣ Using Social Data to Understand Cities
‣ Q&A
65. Topics
‣ Data Science & Geo Analytics
‣ Useful Geo tools and Datasets
‣ Hadoop, Pig, and Big Data
‣ Cleaning Location Data with Mechanical Turk
‣ Spatial Tweet Analytics with Hadoop & Python
‣ Using Social Data to Understand Cities
‣ Q&A
66. Tokenizing and Cleaning Tweet Text
‣ Extract Tweet topics with Hadoop + Python + NLTK + Wikipedia
77. Topics
‣ Data Science & Geo Analytics
‣ Useful Geo tools and Datasets
‣ Hadoop, Pig, and Big Data
‣ Cleaning Location Data with Mechanical Turk
‣ Spatial Tweet Analytics with Hadoop & Python
‣ Using Social Data to Understand Cities
‣ Q&A
88. Topics
‣ Data Science & Geo Analytics
‣ Useful Geo tools and Datasets
‣ Hadoop, Pig, and Big Data
‣ Cleaning Location Data with Mechanical Turk
‣ Spatial Tweet Analytics with Hadoop & Python
‣ Using Social Data to Understand Cities
‣ Q&A
89. Questions? Follow me at
twitter.com/peteskomoroch
datawrangling.com