1. Exploring Big and not so Big Data:
Opportunities and Challenges
Juliana Freire
juliana.freire@nyu.edu
Visualization and Data Analysis (ViDA) Center
http://bigdata.poly.edu
NYU Poly
2. Big Data: What is the Big deal?
http://www.google.com/trends/explore#q=%22big%20data%22!
ViDA Center Juliana Freire 2
3. Big Data: What is the Big deal?
Many success stories
– Google: many billions of pages indexed, products,
structured data
– Facebook: 1.1 billion users using the site each month
– Twitter: 517 million accounts, 250 million tweets/day
This is changing society!
ViDA Center Juliana Freire 3
4. Big Data: What is the Big deal?
Smart Cities: 50% of the world population lives in
cities
– Census, crime, emergency visits, cabs, public transportation,
real estate, noise, energy, …
– Make cities more efficient and sustainable, and improve the
lives of their citizens
http://www.nyu.edu/about/university-initiatives/center-for-urban-science-progress.html
Enable scientific discoveries: science is now data rich
– Petabytes of data generated each day, e.g., Australian radio
telescopes, Large Hadron Collider
– Social data, e.g., Facebook, Twitter (2,380,000 and 2,880,000
results in Google Scholar!)
Data is currency
ViDA Center Juliana Freire 4
5. Big Data: What is the Big deal?
Smart Cities
– Census, crime, emergency visits, cabs, public transportation,
real estate, noise, energy, …
– Make cities more efficient and sustainable, and improve the
lives of their citizens
Enable scientific discoveries: science is now data rich
– Petabytes of data generated each day, e.g., Australian radio
telescopes, Large Hadron Collider
– Social data, e.g., Facebook, Twitter
Data is currency
ViDA Center Juliana Freire 5
6. Big Data: What is the Big deal?
Big data is not new: financial transactions, call
detail records, astronomy, …
What is new is that there are many more data
enthusiasts
More data are widely available, e.g.,and Halperin, DEB 2012
Plot from Howe Web, data.gov,
data volumes, % IT investment
Astronomy
scientific data
Computing is cheap and easy to access
Physics
– Server with 64 cores, 512GB RAM ~$11k
– ClusterMedicine1000 cores ~$150k
with
– Pay as you go: Amazon EC2
Geosciences 2020
Microbiology Chemistry Social Sciences
2010
rank
ViDA Center Juliana Freire 6
7. Big Data: What is the Big deal?
Big data is not new: financial transactions, call
detail records, astronomy, …
What is new is that there are many more data
enthusiasts
More data are widely available, e.g., Web, data.gov,
scientific data, social and urban data
Computing is cheap and easy to access
– Server with 64 cores, 512GB RAM ~$11k
– Cluster with 1000 cores ~$150k
– Pay as you go: Amazon EC2
ViDA Center Juliana Freire 7
8. Big Data: What is hard?
Scalability is not the problem…
Usability is the Big issue
algorithms data visual encodings
technology user interfaces
statistics provenance interaction modes
math
machine learning data management
data knowledge
ViDA Center Juliana Freire 8
9. algorithms data visual encodings
technology user interfaces
statistics provenance interaction modes
math
machine learning data management
data knowledge
Exploring data is hard
10. algorithms data visual encodings
technology user interfaces
statistics provenance interaction modes
math
machine learning data management
data knowledge
Exploring data is hard,
regardless of whether the data
is big or small
11. Case Study: Studying Cab Trips in NYC
Prepare data for analysis
Raw data for 2011 63 GB
– 24 csv files, 2 csv files for each month - one for trip data,
and snother for fare data
– ~170M trips
Cleaning
– ~60,000 fare records do not have trip records
– ~200 duplicates per month
ViDA Center Juliana Freire 11
13. Storage Solutions: Spatial-Temporal
All trips for a week in a given region
All trips in a week for a given taxi
All trips in a week for a given taxi in a
given region
Needs a complex indexing scheme that
combines spatial, temporal, and taxi id searches
ViDA Center Juliana Freire 13
14. Storage Solutions: Spatial-Temporal
SQLite Custom storage
– 20+10 GB of storage (ours)
(index on time and – 12+4 GB of storage
id, r-tree for (using (4d) kd-tree
coordinates) on time, id and
– Creating indexes: coordinates)
52hrs – Building kd-tree: 8
– Range queries: 2.1s mins
– Combined queries: – Range queries: 0.2s
15.3s – Combined queries:
– Cross-table queries: 0.2s
57s – Cross-table queries:
2s
ViDA Center Juliana Freire 14
15. Summary Statistics
13,237 Medallion Cabs Analysis/Modeling
42,000 Taxi Drivers
Average Number of Rides: 485k/day
Average Number of Passengers: 660k/day
Rides in 2011
590k
29k Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Apr 2 Aug 28 Dec 25
ViDA Center
Apr 3 Irene Juliana Freire 15
16. Weekly Patterns
0h
Rides per Hour June 2011
Between
5k and 35k
rides/hour
Night Life!
Rides at
Midnight
Analysis/
Modeling
0h
0h
0h
0h
0h
ViDA Center Juliana Freire 16
18. Drop-offs vs. Pickups
Drop-off
Pickup
Most of the drop-
off’s occur on the
avenues while
most of the pick-
up’s occur on the
streets
ViDA Center Juliana Freire 18
19. Studying Anomalies
Sunday, May 1st 2011
4:00AM-4:30AM 6:00AM-6:30AM 8:00AM-8:30AM
ViDA Center Juliana Freire 19
20. Studying Anomalies
Sunday, May 1st 2011
4:00AM-4:30AM 6:00AM-6:30AM 8:00AM-8:30AM
ViDA Center Juliana Freire 20
21. Studying Anomalies
Sunday, May 1st 2011
8:00AM-8:30AM 9:30AM-10:00AM
ViDA Center Juliana Freire 21
22. Studying Anomalies Interpretation
Sunday, May 1st 2011
8:00AM-8:30AM 9:30AM-10:00AM
Five Borough
Bike Tour
ViDA Center Juliana Freire 22
23. Studying Anomalies
Sunday May 1st
2011
07:00AM-08:00AM
ViDA Center Juliana Freire 23
24. Studying Anomalies
Sunday May 1st
2011
08:00AM-10:00AM
ViDA Center Juliana Freire 24
25. Studying Anomalies
Sunday May 1st
2011
10:00AM-11:00AM
ViDA Center Juliana Freire 25
26. Studying Patterns
May 1st – May 7th
2011
3.6 Million Trips
Compare
movement in the
airports against the
large train stations
ViDA Center Juliana Freire 26
27. Studying Patterns
Train Stations
Airports
May 1st – May
7th 2011
3.6 Million
Trips
ViDA Center Juliana Freire 27
28. Studying Patterns
Train Stations
Airports
May 1st – May
7th 2011
3.6 Million
Trips
ViDA Center Juliana Freire 28
30. Uses of Clean Data: FindMeACab App
ViDA Center Juliana Freire 30
31. Take Away
Data exploration is challenging for both small and
big data
It is hard to prepare data for exploration
For many tasks, existing tools are either too
cumbersome, not scalable, etc.
Need better, usable tools
– Tools for data enthusiasts who are not computer scientists!
Visualization is essential for exploring large volumes
of data --- “A picture is worth a thousand words’’
Pictures help us think [Tamara Munzner]
– Substitute perception for cognition
– Free up limited cognitive/memory resources for higher-
level problems
ViDA Center Juliana Freire 31
32. Masters in Big Data
New degree at NYU Poly – Spring 2014
Courses:
– Machine learning
– Massive data analysis
– Visualization
– Visual Analytics
– Database Systems
– Algorithms
– …
ViDA Center Juliana Freire 32