1. Unleashing Data Science Innovations:
Sparking Big Data
linkedin.com/in/sureshsood
@soody
http://www.slideshare.net/ssood/spark-47741029
6 May, 2015
2. Topic Areas for Discussion
1. Statistics/Data mining or Data Science?
2. What is big data and the challenge today ?
3. Data types
4. Hadoop File Storage System and Spark
5. Data Science innovation
6. Data Science discoveries and workflow
7. New Sources of Information (Big data) Data Innovations
8. Internet of Things
9. Data Science Innovations
10. Apache Spark
3. Statistics, Data Mining or Data Science ?
• Statistics
– precise deterministic causal analysis over precisely collected data
• Data Mining
– deterministic causal analysis over re-purposed data carefully sampled
• Data Science
– trending/correlation analysis over existing data using bulk of population i.e.
big data
Adapted from:
NIST Big Data taxonomy draft report (see http://bigdatawg.nist.gov /show_InputDoc.php)
4. Big Data Challenge Today : Moving from Transactions
Alone to Relationships and Empathy
Current State
= Transactions $$$
We do this stuff well
e.g.Collect payments …
Future State
= Human Empathy (relationships)
We don’t do this really e.g. User
generated content, ratings, reviews,
1:1 dialogue, Distress Signals,
Geolocation
4
5. 5
What is Big Data ?
Unknown relationships
Unstructured data
95% of data not collected
Social-Psychological- local-Mobile-GPS-M2M
Beyond Transactions including interactions and observations
6. Data Types
• Astronomical
• Documents
• Earthquake
• Email
• Environmental sensors
• Fingerprints
• Health (personal) Images
• Graph data (social network)
• Location
• Marine
• Particle accelerator
• Satellite
• Scanned survey data
• Sound
• Text
• Transactions
• Video
7. HadoopConfigurations(SingleandMulti-Rack)
Adapted from: http://stackiq.com/
Cluster manager e.g. Apache Ambari, Apache Mesos, or Rocks
3 TB drives ,18 data nodes
configuration represents 648
TB of raw storage HDFS
standard replication factor
of 3
216 TB of usable storage
Name/secondary/data nodes – 6 core 96 GB
Management node – 4 core 16 GB
10. Data Science Innovation
Data science innovation is something an
organization or individual has not done
before using data. The innovation focuses
on discovery using new or
nontraditional data sources solving new
problems.
Adapted from:
Franks, B. (2012) Taming the Big Data Tidal Wave, p. 255, John Wiley & Son
11. Data Science Discoveries
1. Outlier / Anomaly / Novelty / Surprise detection
2. Clustering (= New Class discovery, Segmentation)
3. Correlation & Association discovery
4. Classification, Diagnosis, Prediction
Source: Borne, Kirk (2012) LSST All Hands Meeting, 13-17 August
11
14. Internet of Things (IOTs)
“trillion sensors”
Source: www.tsensorssummit.org
15. Data Science Innovations
ID Analytics Innovative Info source Innovation Platform/Library
1. Graph Analytics Multiple Reduce suspect list from
18 million to 230/32 Spark GraphX
2. ANZ Truckometer NZ transport authority real
time traffic data
GDP forecast 6 months in
advance
N/A potential for
combining with GDELT
3. Driving (Usage Based
Insurance)
Black box (telematics)
Unstructured data
Pay as you drive policy
Pay how you drive
Hadoop Map Reduce
4a. Deception (veracity) Found stories online blogs Flag fake stories text,
images and short video
MongoDB/Spark
Python dictionary
4b. Psychological State Twitter and Instagram Junk words MongoDB/Spark
Python dictionary
4c. Thematic Apperception
Technique
Mobile phone screen
customisation
Automated informant
testing
Sparkling Water
(H2O/Spark)
Deep Learning
5. Brand Brand stories “found” online Brand user profile SparkR
6. Supermarket shopper behavior CCTV /beacon transmitters “My store” product
placement based on time
of day predictive shopping
behaviour
MongoDB
Hadoop 2 Cluster
Spark GraphX
Spark MLib
7. Sandbag exercise Sandbag sensors Virtual trainer Spark GraphX
Spark MLib
8. Oil reserves shipment
monitoring
Skybox (Google) satellite
images
Improved oil forecast “Busboy” – C /Hadoop
Suresh Sood 2015
16. 1. Graph Analytics
• 1990’s Ivan Milat killed 7 backpackers making him Australia's most notorious Serial Killer
• Everyone in Australia was a suspect
• Large volumes of data from multiple sources
RTA Vehicle records
Gym Memberships
Gun Licensing records
Internal Police records
• Police applied node link analysis techniques (NetMap) to the data
• Harness power of the human mind
• Analyst can spot indirect links, patterns , structure, relationships and anomalies
• A bottom-up approach with process of discovery to uncover structure
• Reduced the suspect list from 18 million to 230
• Further analysis with the use of additional satellite information reduced this to 32
Data Information Knowledge
17. The ANZ Heavy Traffic Index comprises flows
of vehicles weighing more than 3.5 tonnes
(primarily trucks) on 11 selected roads around
NZ. It is contemporaneous with GDP growth.
The ANZ Light Traffic Index is made up of light
or total traffic flows (primarily cars and
vans) on 10 selected roads around the country.
It gives a six month lead on GDP growth
http://www.anz.co.nz/about-us/economic-markets-research/truckometer/
2.
18. 3. Black Box Insurance
•Big data transforms actuarial insurance from using probability methods to estimate premiums into dynamic risk management using real data generating
individually tailored premiums
•Estimate 20 km work or home journey, data point acquired every min and journey captures 12 points per km. Assume 1000 km per month driving or
generating 12,000 points per month resulting in 144,000 points per car/annum. Hence, 1,000 cars leads to 144 million points per annum.
•Telematics technology (black box) monitor helps assess the driving behavior and prices policy based on true driver centric premiums by capturing:
–Number of journeys
–Distances travelled
–Types of roads
–Speed
–Time of travel
–Acceleration and braking
–Any accidents
–Location ?
•Benefits low mileage, smooth and safe drivers
•Privacy vs. Saving monies on insurance (Canada ; http://bit.ly/Black_box)
19. Psychological analytics helps put human context into Business
• Behavior data Links human emotions to business -> Analyse footprints left behind.
• What really does customer satisfaction mean ? Is the person actually happy?
• How do we take the emotional dimension into account for customer experience?
• How do we recognize someone is dissatisfied?
• How do we recognize a “distressed” person?
• Do we use text and voice? Will sleeping patterns and eating habits help?
• would you act differently if someone is happy?
• How do you coach employees to see how someone sounds in emotional terms?
• Understanding when distress exists and when a customer needs enhanced service
• Behavior data reveals attitude and intent. This is more predictive of future opportunities and
risk versus historical data
21. 1.Gayle
3. Paris
2. Paige
+
+
4.”The occasion
was my cousin
Paige’s 16th”
5. “I am a Canadian
and get by in
French.”
6. "All I can say is WOW! We rented a 2
bedroom, 1 ½ bath apartment (two showers),
"Merlot" from ParisPerfect
http://www.parisperfect.com/ and boy was it
ever perfect! "
7. “We had a full view of the Eiffel from our
charming little terrace. ....We were within
walking distance to two metro stops (Pont
d'Alma or Ecole Militaire) "
8. "We were walkable to many good
bistros, cafes and bakeries and only a few
blocks from the wonderful market street
Rue Cler."
9. "I bought a Paris Pratique pocket-sized book at a
Metro station. This handy guide has detailed maps of
each arrondisement, as well as the metro lines, the
bus lines, the RER and the SCNF (trains). I'll never be
without this again."
10."Six months before our trip, I gave
Paige a couple of good guide books on
Paris and suggested she let me know
what her interests were since after all,
this was to be her trip."
11.Sites
•The Marais
•Notre Dame
•L'Arc de Triomphe - 248 steps up and 248 steps
down...
•Champs Elysee
•Jacquemart Museum
•Louvre Lite
•Musee D'Orsay
•Les Invalides, Napoleon's Tomb and the Napoleon
Museum
•Sacre Coeur
•Monmartre
•Rodin Museum
•Pompidou Museum
•Train to Vernon, bike to Giverny with Fat Tire Bike
Tours
•http://www.fattirebiketoursparis.com/
•Eiffel Tower
Elaboration of Trip to Paris Blog Story (Means-End & Heider)
Woodside, Sood & Miller 2008 When Consumers and Brands Talk Psychology & Marketing
12. Unforgettable Memories
"This trip had so many memories, but here are a few choice
highlights........On our very first night, knowing that the Eiffel
Tower light show started at 10:00 p.m.... she [Paige] dropped
her camera…down 6 flights…we were stunned…Spanish
Family below standing below [with pieces of the camera]”
15." Michael Osman is an American artists
living in Paris."
"He supplements his income by being a tour
guide." I" found out about him on Fodors"
"So I engaged Michael for two days."
16. "On our trip to Giverny, we met a young
woman from Brisbane, Australia who was
traveling on her own and we invited her to join
us. Three of us enjoyed delicious and innovative
soufflés, while Paige had the rack of lamb. We
shared two dessert soufflés, one chocolate and
the other cherry/almond. Yum"
17. "I wanted Paige to get a feel for
shopping experiences that she
would not have at home (aka the
ubiquitous mall). "
18."We went on Fat Tire's
day trip to Monet's
gardens and house in
Giverny, about an hour
outside Paris."
13."The father stretched out his cupped hands
which held all of the pieces they were able to
recover, including the memory stick and he
very solemnly said, "El muerto...".
14. "They had decide to come to Paris to
find the Harley Davidson store so they
could buy Harley Paris t-shirts."
+
+
+
+
19....."I know Paige will
treasure the memory of
this girl's trip for many
years to come."
21
23. The Newman Model of Deception (Pennebaker et al)
Key word categories for deception mapping:
1. Self words e.g. “I” and “me” – decrease when someone distances
themselves from content
1. Exclusive words e.g. “but” and “or” decrease with fabricated content
owing to complexity of maintaining deception
1. Negative emotion words e.g. “hate” increase in word usage owing to
shame or guilty feeling
1. Motion verbs e.g. “go” or “move” increase as exclusive words go
down to keep the story on track
25. 4b. Psychological State
• LIWC (analyzewords.com)
– Reveal personality from word usage
– Uses LIWC classification of words
• TweetPsych (tweetpsych.com/)
– Linguisitic analysis using:
– RID
– LIWC
Note: TweetPsych is not without critics:
http://psychcentral.com/blog/archives/2009/06/18/putting-cool-ahead-of-science-tweetpsych/
32. smart-dove.com
The first 3 columns are x, y, z axis of gyroscope, then x, y, z axis
of accelerator. These are raw data of 40 repetitions of shoulder
press exercise. Standard Deviation and moving average
algorithm to build the chart and Hidden Markov Model to extract
features and build model of exercise. All models are put into
cloud for trainee exercise scoring.
7. Smart Sandbag
35. Square
Kilometer Array
(SKA)
• Data collected in a single day take nearly two million years to playback on an MP3 player
• Central computer has processing power of about one hundred million PCs.
• SKA will use enough optical fiber linking up all the radio telescopes to wrap twice around the Earth.
• Dishes of SKA when fully operational will produce 10 times the global internet traffic as of 2013.
• Aperture arrays in the SKA could produce more than 100 times the global internet traffic as of 2013.
• The SKA will generate enough raw data to fill 15 million 64 GB MP3 players every day.
• The SKA supercomputer will perform 1018 operations per second - equivalent to the number of stars in three million Milky
Way galaxies - in order to process all the data that the SKA will produce.
• So sensitive that it will be able to detect an airport radar on a planet 50 light years away.
• Thousands of antennas with collecting area of about one square kilometer (that's 1,000,000 square meters).
• Previous mapping of Centaurus A galaxy took a team 12,000 hours of observations or several years. SKA ETA 5 minutes !
• In first six hours of operation, SKA will generate more information than all previous radio telescopes
• in the world combined.
• The Square Kilometer Array will link 250,000 radio telescopes together, creating most sensitive telescope.
To the scientists involved, however, the SKA is no testbed, it’s a transformative instrument which,
according to Luijten, will lead to “fundamental discoveries of how life and planets and matter all
came into existence. As a scientist, this is a once in a lifetime opportunity.”
Sources: http://bit.ly/amazin-facts & http://bit.ly/astro-ska
Centaurus A
Combine traditional and social data to create a Social CRM
Build social fields into customer contact information
Track social media interactions with customers.
Understand where customers hang with social media data
Collect customer feedback from social channels.