2. Handouts & Reference Materials
1.NIST Big Data Interoperability Volume 1 Definitions Final Version 1 9/ 2015
2. Field Guide to Hadoop (preview edition)
3. Learning Spark preview edition
4. Databricks Spark Reference Applications
5. Spark Data Analytics projects/users
3. Areas for Conversation
Social (content, structure and analytics)
Data Science Primer and Resources:Big data, Spark ecosystem
Data Science Innovation
4. Roadmap – Evolution from Existing Operations to Predictive
4
Rigid Flexible Connected
What if conversations continue?
(Adapted from Solis, 2012 and Davenport 2007)
Themes
Silo, rigid
Hoarding info
Vs. collaboration
Freely share info and
Knowledge on internal basis
acting social with customers
2 –way communications
Connected internal and
External. Listening and
Learning. Internal and
external engagement
Shared via hub and
Spoke. Employees
Connected directly to
Customers.
Adaptive
Agile, integrate customer
Experiences and feedback
Loops. Listening and
Learning now become
analyse and insights
Makes sense of data
And transforms into
Intelligence.
Respond in Real time
Predictive
Shift from reactive to
Proactive and predictive
Business uses social
media heavily and is
flexible, connected,
adaptive and predictive in
terms of customer
experiences,
needs and new
opportunities. Predict
scenarios before they
occur maximise
opportunity and limit risk
How can we lead conversations?
(predictive recommendation)
What conversations are next?
Why are these conversations occurring?
What actions are required?
What are the sentiment of conversations?
When and where are conversations taking place?
What conversations are taking place?
Business
Intelligence
17. • Australian Pioneer Dr John Galloway (AM)
• 1990’s Ivan Milat killed 7 backpackers making him Australia's most notorious Serial Killer
• Everyone in Australia was a suspect
• Large volumes of data from multiple sources
RTA Vehicle records
Gym Memberships
Gun Licensing records
Internal Police records
• Police applied node link analysis techniques (NetMap) to the data
• Harness power of the human mind
• Analyst can spot indirect links, patterns , structure, relationships and anomalies
• A bottom-up approach with process of discovery to uncover structure
• Reduced the suspect list from 18 million to 230
• Further analysis with the use of additional satellite information reduced this to 32
Node Link Analytics
Data Information Knowledge
23. Language on Twitter Tracks Rates of Coronary Heart
Disease, Psychological Science, January 2015
23
The findings show that expressions of negative emotions such as anger, stress, and fatigue in the tweets from peo
The results suggest that using Twitter as a window into a community’s collective mental state may provide a usefu
24. Twitter and Marketing Predictions
• Tweets is “found data” without asking questions
• More meaning than typical search engine query
• Large numbers of passive participants in natural settings
• Twitter can predict the stock market (Lisa Grossman, Wired, Oct 19 2010)
• Predict movie success in first few weekends of release
– “…it also raises an interesting new question for advertisers and marketing executives. Can they change the
demand for their film, product or service buy directly influencing the rate at which people tweet about it?
In other words, can they change the future that tweeters predict?”
Tech Review, http://www.technologyreview.com/blog/arxiv/25000/
24
25. Psychological analytics helps put human context into Business
• Behavior data Links human emotions to business -> Analyse footprints left behind.
• What really does customer satisfaction mean ? Is the person actually happy?
• How do we take the emotional dimension into account for customer experience?
• How do we recognize someone is dissatisfied?
• How do we recognize a “distressed” person?
• Do we use text and voice? Will sleeping patterns and eating habits help?
• Would you act differently if someone is happy?
• How do you coach employees to see how someone sounds in emotional terms?
• Understanding when distress exists and when a customer needs enhanced service
• Behaviour data reveals attitude and intent. This is more predictive of future opportunities and
risk versus historical data
27. The Newman Model of Deception (Pennebaker et al)
Key word categories for deception mapping:
(1) Self words e.g. “I” and “me” – decrease when someone distances themselves from content
(2) Exclusive words e.g. “but” and “or” decrease with fabricated content owing to complexity of maintaining
deception
(3) Negative emotion words e.g. “hate” increase in word usage owing to shame or guilty feeling
(4) Motion verbs e.g. “go” or “move” increase as exclusive words go down to keep the story on track
29. Variety of Data Types & Big Data Challenge
1. Astronomical
2. Documents
3. Earthquake
4. Email
5. Environmental sensors
6. Fingerprints
7. Health (personal) Images
8. Graph data (social network)
9. Location
10.Marine
11.Particle accelerator
12.Satellite
13.Scanned survey data
14.Sound
15.Text
16.Transactions
17.Video
Big Data consists of extensive datasets primarily in the characteristics of
volume, variety, velocity, and/or variability that require a scalable
architecture for efficient storage, manipulation, and analysis.
. Computational portability is the movement of the computation to the location of the data.
30.
31. Statistics, Data Mining or Data Science ?
• Statistics
– precise deterministic causal analysis over precisely collected data
• Data Mining
– deterministic causal analysis over re-purposed data carefully sampled
• Data Science
– trending/correlation analysis over existing data using bulk of population i.e. big data
– Extraction of actionable knowledge directly from data through a process of discovery,
hypothesis, and hypothesis testing.
Adapted from: NIST Big Data taxonomy draft report :
(see http://bigdatawg.nist.gov /show_InputDoc.php)
36. Berkeley Data Analytics Stack (BDAS)
AMPCrowd: RESTful web service for sending tasks to human workers on crowd platforms .
Used by sampleclean.org - Data Cleaning With Algorithms, Machines, and People
37. Data Science Innovation
Data science innovation is something an
organization has not done before or even
something nobody anywhere has done before. A
data science innovation focuses on discovering
and using new or untraditional data sources to
solve new problems.
Adapted from:
Franks, B. (2012) Taming the Big Data Tidal Wave, p. 255, John Wiley & Son
38. http://tacocopter.com/
New Sources of Information (Big data) : Social Media + Internet of Things
Accounting Analytic Innovations
7,919 40,204
2,003,254,102 51
Gridded Data Sources
http://smap.jpl.nasa.gov/
39. The ANZ Heavy Traffic Index comprises
flows of vehicles weighing more than 3.5
tonnes (primarily trucks) on 11 selected
roads around NZ. It is contemporaneous
with GDP growth.
The ANZ Light Traffic Index is made up of
light or total traffic flows (primarily cars and
vans) on 10 selected roads around the
country. It gives a six month lead on GDP
growth in normal circumstances (but
cannot predict sudden adverse events such
as the Global Financial Crisis).
http://www.a http://www.anz.co.nz/about-us/economic-markets-research/truckometer/
ANZ TRUCKOMETER
41. The following BigQuery query (note that the wildcard on "TAX_WEAPONS_SUICIDE_" catches suicide vests, suicide bombers, suicide bombings,
suicide jackets, and so on):
SELECT DATE, DocumentIdentifier, SourceCommonName, V2Themes, V2Locations, V2Tone, SharingImage, TranslationInfo FROM [gdeltv2.gkg] where
(V2Themes like '%TAX_TERROR_GROUP_ISLAMIC_STATE%' or V2Themes like '%TAX_TERROR_GROUP_ISIL%' or V2Themes like
'%TAX_TERROR_GROUP_ISIS%' or V2Themes like '%TAX_TERROR_GROUP_DAASH%') and (V2Themes like '%TERROR%TERROR%' or V2Themes like
'%SUICIDE_ATTACK%' or V2Themes like '%TAX_WEAPONS_SUICIDE_%')
The GDELT Project pushes the boundaries of “big data,” weighing in at over a quarter-billion rows with 59 fields for each record,
spanning the geography of the entire planet, and covering a time horizon of more than 35 years. The GDELT Project is the largest
open-access database on human society in existence. Its archives contain nearly 400M latitude/longitude geographic coordinates
spanning over 12,900 days, making it one of the largest open-access spatio-temporal datasets as well.
GDELT + BigQuery = Query The Planet
43. MEMEX - Human Trafficking Analytics
• Human traffickers coercive victims into sex work or low cost labour appearing in adverts online
• Adverts contain embedded data on name of worker, contact info, physical characteristics,
services offered, location, price/pay rates, and other attributes. Useful data but not accessible via
SQL or R.
• DeepDive converts “raw set of advertisements into a single clean structured database table”
• 30 million advertisements obtained for sex work from online
• Trafficking analytic signals
✴ Traffickers move victims from place to place to keep them isolated and easier to control.
Detect individuals in the advertisement data who post multiple advertisements from different
physical locations
✴ Non-trafficked sex workers exhibit economic rationality charge as much as possible for
services, and avoid engaging in risky services. Charging non-market rates or engaging in risky
services
✴ Traffickers may have multiple victims simultaneously. If the contact information for multiple
workers across multiple advertisements contains consecutive phone numbers, it might
suggest one individual purchased several phones at one time.
43
Source : http://www.scientificamerican.com/slideshow/scientific-american-exclusive-darpa-memex-data-maps/
Also see, http://humantraffickingcenter.org/posts-by-htc-associates/memex-helps-find-human-trafficking-cases-online/
44. DeepDive Data Extraction and Dataset Generation
• URL where the advertisement was found
• Phone number of the person in the advertisement
• Name of the person in the advertisement
• Location where the person offers services
• Rates for services offered
44
46. 3. Black Box Insurance
• Big data transforms actuarial insurance from using probability methods to estimate premiums into dynamic risk management using real data generating individually tailored premiums
• Estimate 20 km work or home journey, data point acquired every min and journey captures 12 points per km. Assume 1000 km per month driving or generating 12,000 points per
month resulting in 144,000 points per car/annum. Hence, 1,000 cars leads to 144 million points per annum.
• Telematics technology (black box) monitor helps assess the driving behavior and prices policy based on true driver centric premiums by capturing:
– Number of journeys
– Distances travelled
– Types of roads
– Speed
– Time of travel
– Acceleration and braking
– Any accidents
– Location ?
• Benefits low mileage, smooth and safe drivers
• Privacy vs. Saving monies on insurance (Canada ; http://bit.ly/Black_box)
47. Smart Sandbag System
smart-dove.com
The first 3 columns are x, y, z axis of gyroscope, then x, y, z
axis of accelerator. These are raw data of 40 repetitions of
shoulder press exercise. Standard Deviation and moving
average algorithm to build the chart and Hidden Markov
Model to extract features and build model of exercise. All
models are put into cloud for trainee exercise scoring.
50. • The data collected in a single day take nearly two million years to playback on an MP3 player
• Generates enough raw data to fill 15 million 64GB iPods every day
• The central computer has processing power of about one hundred million PCs
• Uses enough optical fiber linking up all the radio telescopes to wrap twice around the Earth
• The dishes when fully operational will produce 10 times the global internet traffic as of 2013
• The supercomputer will perform 1018 operations per second - equivalent to the number of stars in three million Milky
Way galaxies - in order to process all the data produced.
• Sensitivity to detect an airport radar on a planet 50 light years away.
• Thousands of antennas with a combined collecting area of 1,000,000 square meters - 1 sqkm)
• Previous mapping of Centaurus A galaxy took a team 12,000 hours of observations and several years - SKA ETA 5
minutes !
To the scientists involved, however, the SKA is no testbed, it’s a transformative instrument which,
according to Luijten, will lead to “fundamental discoveries of how life and planets and matter all came
into existence. As a scientist, this is a once in a lifetime opportunity.”
Sources: http://bit.ly/amazin-facts & http://bit.ly/astro-ska
Galileo
Square Kilometer Array Construction
(SKA1 - 2018-23; SKA2 - 2023-30)
Centaurus A
52. 52
The future is impossible to predict.
However one thing is certain :
The company that can excite it’s customers
dreams is out ahead in the race to business success
Selling Dreams, Gian Luigi Longinotti
Notas do Editor
Combine traditional and social data to create a Social CRM
Build social fields into customer contact information
Track social media interactions with customers.
Understand where customers hang with social media data
Collect customer feedback from social channels.
Diana – max links (degree centrality) most connected – connector or hub – number of nodes connected – high influence of spreading info or virus
Heather – best location powerful figure as broker to determine what flows and doesn’t –single point of failure – high betweeness = high influence – position of node as gatekeeper to exploit structural holes (gaps in network)
Fernado & Garth – shortest paths = closeness – the bigger the number the less central
Eigenvector = importance of node in network ~ page rank google is similar measure – being connected to well connected a popularity and power measure