Mais conteúdo relacionado Semelhante a Big data analytics presented at meetup big data for decision makers (20) Big data analytics presented at meetup big data for decision makers1. The Science Behind Data Science
Presented at Big Data for Decision Makers
Ruhollah Farchtchi – Director of Big Data
December 5, 2013
2. Agenda
• Introductions
• Big Data Analytics Overview
• Use Cases – Examples of Data Products
• Building Blocks
• Data Mining
• Technologies
• Operational Models
© 2013 Unisys Corporation. All rights reserved.
2
3. So we’ve got a lot of data…
• What can we get out of it?
• How does it help with our business decision making?
• How is this complex landscape changing?
Column 1
Column 2
Column 3
Column 4
Multiple
Types
Multiple
Sources
Pictures
Column 5
1-A
2-A
3-A
4-A
5-A
1-B
2-B
3-B
4-B
5-B
1-C
2-C
3-C
4-C
5-C
1-D
2-D
3-D
4-D
5-D
1-E
2-E
3-E
4-E
5-E
1-F
Tabular /
Structured
My Documents
2-F
3-F
4-F
5-F
Documents
Unstructured
Emails
Video
Sensors, Networks, C
yber Infrastructure
Web, Email, Social Media Enterprise Applications
Mobile Devices, GPS, and
many more!
Multiple
Domains
Defense
Health
Finance
Other
• Logistics / Workforce
analytics
• Cyber and EW
• Intelligence Analysis
• Drug Discovery
• EHR
• Epidemic/pandemic
prediction
• Fraud Detection
• Identity Resolution
• Customer Support
• Supply/Demand
Forecasting
• MTTB Prediction
• Context-based IR
© 2013 Unisys Corporation. All rights reserved.
3
5. Big Data and Data Analytics – A Unisys Point of View
• Unisys Point of View: Today’s big data is tomorrow’s normal data
– What remains is the need to extract insights and value out of the data
• Data Analytics is often the goal or end-product of what organizations
what to get out of their data (Big or otherwise)
– Focused around the capabilities of:
• Efficient Data Processing – get data in and processed in time to make use of it and
in a tenable manner
• Effective Information Management – ability to make the data accessible and to
manage the downstream data products as assets
• and Expressive Analytics – make sense of the data in a format that is easily
digested and incorporated into decision making i.e., if you need a PhD to interpret the
results, you still have work to do here
– With the aim to increase business value
• It’s about understanding the data and what you can get out of it
– ―…40% of business leaders had no response when asked what types of
information would transform their industries over the next 10 years.‖1
1. Anne Lapkin, 2012. Hype Cycle for Big Data, 2012, Gartner.
© 2013 Unisys Corporation. All rights reserved.
5
6. Backward-looking
(Forensic)
Modeling and
Forecasting
Pattern
Recognition
Scale-out
Linear
Programming
Data
Analytics
Global
Optimization Classification
Machine Learning
Simulation
Business
Intelligence & Data
Warehousing
STAR
Schema
OLAP
RDBMS
SQL
ETL
Leverage for
large-scale
analytics and data
mining
Extend
Complexity
Forward-looking
(Predictive)
Data Analytics is the culmination of Analytics and IT
Big Data & NoSQL
Hadoop
Google
BigTable
Map/Reduce
Splunk Dynamo
Hive
MongoDB
Cassandra EMC
Greenplum
HBase
Leverage for largescale application
development &
information
management
Multi-TB Turning Point
Low
Volume, Variety, Velocity
Data Volume
High
Volume, Variety, Velocity
Data Analytics is at the intersection of high volume data processing and advanced analysis. The tools
and methodologies here represent a mix of both worlds and there is currently no ‘killer app’.
© 2013 Unisys Corporation. All rights reserved.
6
7. Challenges
Misaligned IT, Analytics, and
Business Strategies
Ineffective Data Management
Strategy
Ineffective/inefficient storage and
security platforms
In-accessible or siloed analytics
(―Cylinders of Excellence‖)
Untrusted analytic products or
analytics that are not
timely, accurate, or repeatable
(untested)
Inability to scale analytic
generation (lack of training)
© 2013 Unisys Corporation. All rights reserved.
7
8. Analytic Environment That Supports Data
Processing, Enhances Information Management and
Improve Decision Making
Data Products
Building Analytic Environment
1.
2.
3.
4.
5.
6.
7.
8.
Work with business leaders
and decision makers to
understand and quantify data
value chain
View data as an enterprise
asset
Innovate through creation of
new data products and
services
Retrain staff and/or acquire
Data Scientist skills
Integrate teams across big
data, data warehousing, and
business analysis
Revise information
management strategies to
incorporate big data
Develop new ways of capturing
information e.g., mobile and
streaming data
Identify and leverage
previously unused internal and
external data
Analyst
Focused
IT Focused
Raw Data
© 2013 Unisys Corporation. All rights reserved.
8
9. Creation of data products is key to analytic reuse
• What are Data Products?
– Essentially this the output of a data science or data mining activity
– Non-trivial; more than a simple query
– Requires a platform for processing
• They can manifest themselves as many things
– Analytical "engines" running in a larger application (Amazon's
recommender engine is a great Data Product)
– Lists (e.g., Top 10 things I need to know today)
– Entire applications (e.g., customer baseball cards)
• However once they are defined, one thing is true for all
– It takes a combination of domain agnostic analytic techniques
together with domain specific knowledge to produce something
relevant and consumable that can be monetized or operationalized.
© 2013 Unisys Corporation. All rights reserved.
9
11. Use Case #1- Netflix Recommendation
•
Netflix is about connecting people to the movies they love by leveraging their movie
recommendation system: CinematchSM
•
CinematchSM initially was a linear model that helped to predict the users choices
•
The predictions are used to make personal movie recommendations based on a customers unique
tastes
–
Challenge: Can the recommendation engine be improved upon?
–
Resolution: Set the improvement accuracy level(10%) and create a contest with a $1 million prize
•
Crowdsourcing: Teams merged together for an internet enabled approach and improve results
•
Netflix provided a training dataset of 100+ million ratings that 480,000 users gave to 17K movies and
contained the quadruplet of the form (user, movie, date of grade , grade)
–
–
–
–
–
Goal is to predict grade
Example of Supervised Machine Learning
Submitted predictions are scored against the true grades in terms of Root Mean Squared Error (RMSE)
RSME is a frequently used measure of the difference between values predicted by a model and the values
observed(i.e. residuals)
Similarity is determined by a distance measure such as Jaccard or Cosine distance
Source; Netflixprize.com and Mining Massive Datasets by Anand Rajaraman and Jeffry Ullman
© 2013 Unisys Corporation. All rights reserved.
11
12. Use Case #2- Google PageRank
•
Google wanted to be able to measure and rank the importance of Web Pages.
–
Challenge: Identify and rank the pages that a users would want to view in terms of their relevance?
–
Resolution: Develop an algorithm that leverages link analysis and implement it as part of Google’s infrastructure
•
The PageRank algorithm considers a webpage to be important if many other webpages point to it.
The linking webpages that point to a given page aren’t treated equally
•
The algorithm takes into account both the importance (PageRank) of the linking pages and the number
of outgoing links it has – Similar to Social Network Analysis
•
Linking pages with higher PageRank are given more weight while pages with more outgoing links are
given less weight.
•
Example of Un-Supervised Machine Learning
0 0 1 0
1 0 0 0
Link Matrix=
1 1 0 1
0 0 0 0
Page 2
Page 1
Page 3
Page 4
Source; The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani and Jerome Friedman
© 2013 Unisys Corporation. All rights reserved.
12
13. Use Case #3- Walmart Data Driven Value Chain
•
Walmart is the leading and largest retailer in the world.
•
Walmart has been a catalyst for technology adoption amongst its suppliers including
requiring partners to leverage RFID technology to track and coordinate inventories.
•
They have a great cross section of data from individual Social Security
Information, Geographic detail and product purchases
•
They utilize econometric and marketing mix modeling (multiplicative, log-log, power
additive, adstocks, lags and powers) for a number of their key analyses
•
Walmart mines their data to get their product mix correct under different and changing
environment conditions.
–
–
•
Challenge: Identify the correct product mix in order to protect the firm from too much or not enough inventory
Resolution: Mine their multiple data sources for data products that will help tighten and improve operational
forecasts
For impending hurricane warnings, Walmart found that:
Sales
–
Pop Tarts increase in sales(7 times their normal rate)
–
Identified that the top selling premium item was beer
–
Allows the firm to get the supply to the store ahead of time
GAs = a + b(TV)
GAs = a + b(TV)G
Item(Beer, Pop Tarts)
Source; What Walmart Knows about Customer Habits: New York Times
© 2013 Unisys Corporation. All rights reserved.
13
14. Use Case #4- Amazon Targeted Marketing
•
Amazon is the worlds largest online retailer and known for their e-commerce Web Site where they use
input about a customer’s interest to generate a list of recommendation.
•
Similar to Netflix they use recommendation algorithms but they do targeted marketing for items that a
customer would want to buy based on their previous purchase patterns
•
The recommendation algorithms personalize the online store for each customer and radically changes
based on the customers interest
–
Challenge(s): Analyze massive amounts of data, submit results realtime, new customers have very little data
and customer data is very volatile
–
Resolution: Cluster modeling, search based methods and Item to Item Collaborative filtering
•
Cluster Modeling: Identify customers similar to the user by dividing the customer base into segments
and treat the task as a classification problem. Typically uses a unsupervised learning algorithm such
as K-Means or Hierarchical
•
Search Based Methods: Treats the recommendations problem as a search for related items. Given a
users purchases and rated items, the algorithm constructs a search query to find other popular items
by the same author, artist or director with similar keywords
•
Item to Item Collaborative Filtering: Customized algorithm that is able to scale to massive data sets
and produces high quality recommendations in real time. This algorithm matches each of the users
purchased and rated items to similar items and then combines those similar items into a
recommendation list. Offline and Online components to increase performance
Source; Amazon.com Recommendations: Item to Item Collaborative Filtering. Greg Linden, Brenth Smith and Jeremy York
© 2013 Unisys Corporation. All rights reserved.
14
16. Big Data Analytics Methodology
Modeling Components
Decision Making &
Forecasting
• Provide actionable intelligence into the future state
Models
•
Statistical model applied to input data that separates the portion of volume due to each of the variables or
factors. We use the term model, because it is a simplification of reality.
Data
Internal Data
Demographic Data
Demographic Data
3rd Party Data
© 2013 Unisys Corporation. All rights reserved.
16
18. Data Mining - Motivations
• We’ve covered big data
– There’s a lot of it!
• New Modus Operandi
– Gather whatever data you can, whenever and where ever possible
• New Expectation
– Data gathered will have value; either for the purpose it was
collected or for a purpose not yet envisioned
• Challenge: There will never be enough analysts to sift
through it all
© 2013 Unisys Corporation. All rights reserved.
18
19. Data Mining Definitions
• Non-trivial extraction of implicit, previously unknown and potentially
useful information from data (normally large databases)
• Exploration & analysis, by automatic or semiautomatic means, of large
quantities of data in order to discover meaningful patterns.
• Part of the Knowledge Discovery in Databases Process.
Source: http://liris.cnrs.fr/abstract/abstract.html
© 2013 Unisys Corporation. All rights reserved.
19
20. Data Mining Tasks
Prediction Methods: Use some
variables to predict unknown or future
values of other variables
Description Methods: Find human
interpretable patterns that describe the
data.
• Classification
• Clustering
–
For a given set of attributes apply a
model for the class (what you want to
predict) as a function of the attributes
–
•
• Regression
–
Predict a value of a given continuous
valued variable based on the values of
other variables, assuming a linear or
nonlinear model of dependency
•
Data points in one cluster are more similar to one
another
Data points in separate clusters are less similar to
one another
• Association Rule Discovery
–
• Deviation Detection
–
Given a set of data points, each having a
set of attributes, and a similarity measure
among them, find clusters such that:
Given a set of records each of which
contain some number of items from a
given collection:
•
Detect significant deviations from
normal behavior
Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
• Sequential Pattern Discovery
–
Given a set of sequences and support
threshold, find the complete set of
frequent subsequences
© 2013 Unisys Corporation. All rights reserved.
20
21. Classification - Example
Tax Fraud
Refund
Marital
Status
Taxable
Income
Cheat
Yes
Single
125k
?
Tid
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
100k
?
1
Yes
Single
125k
No
No
Single
70k
?
2
No
Married
100k
No
Yes
Married
120k
?
3
No
Single
70k
No
4
Yes
Married
120k
No
5
No
Divorced
95k
Yes
6
No
Married
60k
No
7
Yes
Divorced
220k
No
8
No
Single
85k
Yes
9
No
Married
75k
No
10
No
Single
90k
Yes
Training Data Set
Test Data Set
Learn
Classifier
Model
Model
Model
© 2013 Unisys Corporation. All rights reserved.
21
22. Classification – Your Turn
• Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
–
–
–
–
What kind of data will you try to get ?
Can you say something about the characteristics of the data?
Estimate the size of the data.
What kind of pitfalls you might run into ?
© 2013 Unisys Corporation. All rights reserved.
22
23. Fraud Detection
• Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
– Use credit card transactions and the information on its
accountholder as attributes.
– When does a customer buy, what does he buy, how often he pays
on time, etc
– Label past transactions as fraud or fair transactions. This forms the
class attribute.
– Learn a model for the class of the transactions.
– Use this model to detect fraud by observing credit card transactions
on an account.
© 2013 Unisys Corporation. All rights reserved.
23
24. Clustering - Example
• Document Clustering:
– Goal: To find groups of documents that are similar to each other
based on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each document.
Form a similarity measure based on the frequencies of different
terms. Use it to cluster.
– Gain: Search tools can utilize the clusters to relate a new document
or search term to clustered documents.
• Clustering Points: 3204 Articles of
Los Angeles Times.
• Similarity Measure: How many
words are common in these
documents (after some word
filtering).
© 2013 Unisys Corporation. All rights reserved.
24
25. Clustering - Illustration
Seems strait-forward for a small number of dimensions…
what if there were more?
© 2013 Unisys Corporation. All rights reserved.
25
26. Clustering - Illustration
Source: http://salsahpc.indiana.edu/plotviz
We [human beings] have a limited ability to visualize and reason over a large
number of dimensions – clustering helps
© 2013 Unisys Corporation. All rights reserved.
26
27. Association Rules
• Classic Association Rule Example:
– If a customer buys diaper and milk, then he is very likely to buy
beer.
• Applications: Supermarket shelf management.
– Goal: To identify items that are bought together by sufficiently many
customers.
– Approach: Process the point-of-sale data collected with barcode
scanners to find dependencies among items.
© 2013 Unisys Corporation. All rights reserved.
27
29. Hadoop -- So what is Hadoop, Really?
- Dilbert
It’s just a framework
© 2013 Unisys Corporation. All rights reserved.
29
30. Hadoop and MapReduce
Hadoop is an open-source framework
(written in Java) to store and process gobs
of data across many commodity
computers
Hadoop is designed to solve a different
problem: the fast, reliable analysis of both
structured, unstructured and complex
data.
Hadoop and related software are designed
for 3V’s: (1) Volume – Commodity
hardware and open source software
lowers cost and increases capacity;
(2) Velocity – Data ingest speed aided by
append-only and schema-on-read design;
and (3) Variety – Multiple tools to
structure, process, and access
Hadoop consists of two
elements: reliable very large, low-cost
data storage using the Hadoop
Distributed File System (HDFS) and
high-performance parallel/distributed
data processing framework called
MapReduce.
HDFS is self-healing high-bandwidth
clustered storage. Map-Reduce is
essentially fault tolerant distributed
computing.
© 2013 Unisys Corporation. All rights reserved.
30
31. The Hadoop Stack
• Hadoop runs on a
collection/cluster of
commodity, sharednothing x86 servers.
• You can add or remove
servers in a Hadoop cluster
(sizes from 50, 100 to even
2000+ nodes) at will; the
The four primary areas where to use Hadoop:
system detects and
1) To aggregate ―data exhaust‖ —
compensates for hardware or
system problems on any server. messages, posts, blog entries, photos, video
clips, maps, web graph….
• Hadoop is self-healing. It can 2) To give data context — friends networks, social
graphs, recommendations, collaborative filtering….
deliver data — and can run
3) To keep apps running — web logs, system
large-scale, high-performance
logs, system metrics, database query logs….
processing batch jobs — in
4) To deliver novel mashup services – mobile
spite of system changes or
location data, clickstream data, SKUs, pricing…..
failures.
© 2013 Unisys Corporation. All rights reserved.
31
33. Data Products Become the Drivers to Identify new
Insights, Cost Savings and Increase Efficiencies
Your Customers
Feedback
• Decreased time to
analytics
• Reuse of analytics
tools
• Focus on analytic vs.
IT integration
Internal Data Sets
Data Analytics Environment
Knowledge Repository
Populate
Analytics Engine
• More self-service
• Incorporation of
external data
• Ability to scale to
analytic needs
• Supports analytics
lifecycle
External Data Sets
© 2013 Unisys Corporation. All rights reserved.
33
Notas do Editor Think about the access to top talent and how crowd sourcing is allowing organizations to put a bounty on solutions to hard problems. Think about graph analysis and the work being done with SNA today. Think about common patterns and pattern discovery. For example in Cargo, if a ship stops at certain ports is the probability higher or lower that it may have picked up some illegal substances on the way. Really great example of how different techniques can be combined and reused. This is really driving the need for an enterprise analytic data set as you can start to chain analytics together to do many types of operations. Think about automation of analysis tasks. If I’ve figured how to to bucket things, I may be able to triage the data better according to priorities in my organization. Clustering is really BIG in the big data world right now due to the wide applicability.