Big data analytics presented at meetup big data for decision makers

The Science Behind Data Science
Presented at Big Data for Decision Makers
Ruhollah Farchtchi – Director of Big Data
December 5, 2013

Agenda
• Introductions
• Big Data Analytics Overview

• Use Cases – Examples of Data Products
• Building Blocks
• Data Mining

• Technologies
• Operational Models

© 2013 Unisys Corporation. All rights reserved.

2

So we’ve got a lot of data…
• What can we get out of it?
• How does it help with our business decision making?
• How is this complex landscape changing?
Column 1

Column 2

Column 3

Column 4

Multiple
Types
Multiple
Sources

Pictures

Column 5

1-A

2-A

3-A

4-A

5-A

1-B

2-B

3-B

4-B

5-B

1-C

2-C

3-C

4-C

5-C

1-D

2-D

3-D

4-D

5-D

1-E

2-E

3-E

4-E

5-E

1-F

Tabular /
Structured

My Documents

2-F

3-F

4-F

5-F

Documents

Unstructured

Emails
Video

Sensors, Networks, C
yber Infrastructure

Web, Email, Social Media Enterprise Applications

Mobile Devices, GPS, and
many more!

Multiple
Domains

Defense

Health

Finance

Other

• Logistics / Workforce
analytics
• Cyber and EW
• Intelligence Analysis

• Drug Discovery
• EHR
• Epidemic/pandemic
prediction

• Fraud Detection
• Identity Resolution
• Customer Support

• Supply/Demand
Forecasting
• MTTB Prediction
• Context-based IR


3

Source: http://www.ongridventures.com/wp-content/uploads/2012/10/Big-Data-Landscape.jpg

And we’ve got a lot of tools…


4

Big Data and Data Analytics – A Unisys Point of View
• Unisys Point of View: Today’s big data is tomorrow’s normal data
– What remains is the need to extract insights and value out of the data

• Data Analytics is often the goal or end-product of what organizations
what to get out of their data (Big or otherwise)
– Focused around the capabilities of:
• Efficient Data Processing – get data in and processed in time to make use of it and
in a tenable manner
• Effective Information Management – ability to make the data accessible and to
manage the downstream data products as assets
• and Expressive Analytics – make sense of the data in a format that is easily
digested and incorporated into decision making i.e., if you need a PhD to interpret the
results, you still have work to do here

– With the aim to increase business value

• It’s about understanding the data and what you can get out of it
– ―…40% of business leaders had no response when asked what types of
information would transform their industries over the next 10 years.‖1
1. Anne Lapkin, 2012. Hype Cycle for Big Data, 2012, Gartner.


5

Backward-looking
(Forensic)

Modeling and
Forecasting
Pattern
Recognition

Scale-out

Linear
Programming

Data
Analytics

Global
Optimization Classification
Machine Learning
Simulation

Business
Intelligence & Data
Warehousing
STAR
Schema
OLAP
RDBMS

SQL

ETL

Leverage for
large-scale
analytics and data
mining

Extend

Complexity

Forward-looking
(Predictive)

Data Analytics is the culmination of Analytics and IT

Big Data & NoSQL
Hadoop

Google
BigTable

Map/Reduce
Splunk Dynamo
Hive
MongoDB
Cassandra EMC
Greenplum
HBase

Leverage for largescale application
development &
information
management

Multi-TB Turning Point

Low
Volume, Variety, Velocity

Data Volume

High
Volume, Variety, Velocity

Data Analytics is at the intersection of high volume data processing and advanced analysis. The tools
and methodologies here represent a mix of both worlds and there is currently no ‘killer app’.

6

Challenges

Misaligned IT, Analytics, and
Business Strategies

Ineffective Data Management
Strategy

Ineffective/inefficient storage and
security platforms

In-accessible or siloed analytics
(―Cylinders of Excellence‖)

Untrusted analytic products or
analytics that are not
timely, accurate, or repeatable
(untested)

Inability to scale analytic
generation (lack of training)


7

Analytic Environment That Supports Data
Processing, Enhances Information Management and
Improve Decision Making
Data Products

Building Analytic Environment
1.

2.
3.

4.

5.

6.

7.

8.

Work with business leaders
and decision makers to
understand and quantify data
value chain
View data as an enterprise
asset
Innovate through creation of
new data products and
services
Retrain staff and/or acquire
Data Scientist skills
Integrate teams across big
data, data warehousing, and
business analysis
Revise information
management strategies to
incorporate big data
Develop new ways of capturing
information e.g., mobile and
streaming data
Identify and leverage
previously unused internal and
external data

Analyst
Focused

IT Focused

Raw Data

8

Creation of data products is key to analytic reuse
• What are Data Products?
– Essentially this the output of a data science or data mining activity
– Non-trivial; more than a simple query
– Requires a platform for processing

• They can manifest themselves as many things
– Analytical "engines" running in a larger application (Amazon's
recommender engine is a great Data Product)
– Lists (e.g., Top 10 things I need to know today)
– Entire applications (e.g., customer baseball cards)

• However once they are defined, one thing is true for all
– It takes a combination of domain agnostic analytic techniques
together with domain specific knowledge to produce something
relevant and consumable that can be monetized or operationalized.

9

Use Case #1- Netflix Recommendation
•

Netflix is about connecting people to the movies they love by leveraging their movie
recommendation system: CinematchSM

•

CinematchSM initially was a linear model that helped to predict the users choices

•

The predictions are used to make personal movie recommendations based on a customers unique
tastes
–

Challenge: Can the recommendation engine be improved upon?

–

Resolution: Set the improvement accuracy level(10%) and create a contest with a $1 million prize

•

Crowdsourcing: Teams merged together for an internet enabled approach and improve results

•

Netflix provided a training dataset of 100+ million ratings that 480,000 users gave to 17K movies and
contained the quadruplet of the form (user, movie, date of grade , grade)
–
–
–
–
–

Goal is to predict grade
Example of Supervised Machine Learning
Submitted predictions are scored against the true grades in terms of Root Mean Squared Error (RMSE)
RSME is a frequently used measure of the difference between values predicted by a model and the values
observed(i.e. residuals)
Similarity is determined by a distance measure such as Jaccard or Cosine distance

Source; Netflixprize.com and Mining Massive Datasets by Anand Rajaraman and Jeffry Ullman


11

Use Case #2- Google PageRank
•

Google wanted to be able to measure and rank the importance of Web Pages.
–

Challenge: Identify and rank the pages that a users would want to view in terms of their relevance?

–

Resolution: Develop an algorithm that leverages link analysis and implement it as part of Google’s infrastructure

•

The PageRank algorithm considers a webpage to be important if many other webpages point to it.
The linking webpages that point to a given page aren’t treated equally

•

The algorithm takes into account both the importance (PageRank) of the linking pages and the number
of outgoing links it has – Similar to Social Network Analysis

•

Linking pages with higher PageRank are given more weight while pages with more outgoing links are
given less weight.

•

Example of Un-Supervised Machine Learning
0 0 1 0
1 0 0 0

Link Matrix=

1 1 0 1
0 0 0 0
Page 2

Page 1

Page 3

Page 4

Source; The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani and Jerome Friedman


12

Use Case #3- Walmart Data Driven Value Chain
•

Walmart is the leading and largest retailer in the world.

•

Walmart has been a catalyst for technology adoption amongst its suppliers including
requiring partners to leverage RFID technology to track and coordinate inventories.

•

They have a great cross section of data from individual Social Security
Information, Geographic detail and product purchases

•

They utilize econometric and marketing mix modeling (multiplicative, log-log, power
additive, adstocks, lags and powers) for a number of their key analyses

•

Walmart mines their data to get their product mix correct under different and changing
environment conditions.
–
–

•

Challenge: Identify the correct product mix in order to protect the firm from too much or not enough inventory
Resolution: Mine their multiple data sources for data products that will help tighten and improve operational
forecasts

For impending hurricane warnings, Walmart found that:

Sales

–

Pop Tarts increase in sales(7 times their normal rate)

–

Identified that the top selling premium item was beer

–

Allows the firm to get the supply to the store ahead of time

GAs = a + b(TV)
GAs = a + b(TV)G

Item(Beer, Pop Tarts)

Source; What Walmart Knows about Customer Habits: New York Times


13

Use Case #4- Amazon Targeted Marketing
•

Amazon is the worlds largest online retailer and known for their e-commerce Web Site where they use
input about a customer’s interest to generate a list of recommendation.

•

Similar to Netflix they use recommendation algorithms but they do targeted marketing for items that a
customer would want to buy based on their previous purchase patterns

•

The recommendation algorithms personalize the online store for each customer and radically changes
based on the customers interest
–

Challenge(s): Analyze massive amounts of data, submit results realtime, new customers have very little data
and customer data is very volatile

–

Resolution: Cluster modeling, search based methods and Item to Item Collaborative filtering

•

Cluster Modeling: Identify customers similar to the user by dividing the customer base into segments
and treat the task as a classification problem. Typically uses a unsupervised learning algorithm such
as K-Means or Hierarchical

•

Search Based Methods: Treats the recommendations problem as a search for related items. Given a
users purchases and rated items, the algorithm constructs a search query to find other popular items
by the same author, artist or director with similar keywords

•

Item to Item Collaborative Filtering: Customized algorithm that is able to scale to massive data sets
and produces high quality recommendations in real time. This algorithm matches each of the users
purchased and rated items to similar items and then combines those similar items into a
recommendation list. Offline and Online components to increase performance
Source; Amazon.com Recommendations: Item to Item Collaborative Filtering. Greg Linden, Brenth Smith and Jeremy York


14

Unisys Big Data Analytics
Building Blocks

Big Data Analytics Methodology

Modeling Components
Decision Making &
Forecasting
• Provide actionable intelligence into the future state

Models
•

Statistical model applied to input data that separates the portion of volume due to each of the variables or
factors. We use the term model, because it is a simplification of reality.

Data
Internal Data

Demographic Data
Demographic Data

3rd Party Data


16

Data Mining - Motivations

• We’ve covered big data
– There’s a lot of it!

• New Modus Operandi
– Gather whatever data you can, whenever and where ever possible

• New Expectation
– Data gathered will have value; either for the purpose it was
collected or for a purpose not yet envisioned

• Challenge: There will never be enough analysts to sift
through it all

18

Data Mining Definitions
• Non-trivial extraction of implicit, previously unknown and potentially
useful information from data (normally large databases)
• Exploration & analysis, by automatic or semiautomatic means, of large
quantities of data in order to discover meaningful patterns.
• Part of the Knowledge Discovery in Databases Process.

Source: http://liris.cnrs.fr/abstract/abstract.html


19

Data Mining Tasks
Prediction Methods: Use some
variables to predict unknown or future
values of other variables

Description Methods: Find human
interpretable patterns that describe the
data.

• Classification

• Clustering

–

For a given set of attributes apply a
model for the class (what you want to
predict) as a function of the attributes

–

•

• Regression
–

Predict a value of a given continuous
valued variable based on the values of
other variables, assuming a linear or
nonlinear model of dependency

•

Data points in one cluster are more similar to one
another
Data points in separate clusters are less similar to
one another

• Association Rule Discovery
–

• Deviation Detection
–

Given a set of data points, each having a
set of attributes, and a similarity measure
among them, find clusters such that:

Given a set of records each of which
contain some number of items from a
given collection:
•

Detect significant deviations from
normal behavior

Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.

• Sequential Pattern Discovery
–

Given a set of sequences and support
threshold, find the complete set of
frequent subsequences

20

Classification - Example

Tax Fraud
Refund

Marital
Status

Taxable
Income

Cheat

Yes

Single

125k

?

Tid

Refund

Marital
Status

Taxable
Income

Cheat

No

Married

100k

?

1

Yes

Single

125k

No

No

Single

70k

?

2

No

Married

100k

No

Yes

Married

120k

?

3

No

Single

70k

No

4

Yes

Married

120k

No

5

No

Divorced

95k

Yes

6

No

Married

60k

No

7

Yes

Divorced

220k

No

8

No

Single

85k

Yes

9

No

Married

75k

No

10

No

Single

90k

Yes

Training Data Set

Test Data Set

Learn
Classifier

Model
Model
Model


21

Classification – Your Turn

• Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
–
–
–
–

What kind of data will you try to get ?
Can you say something about the characteristics of the data?
Estimate the size of the data.
What kind of pitfalls you might run into ?


22

Fraud Detection

• Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
– Use credit card transactions and the information on its
accountholder as attributes.
– When does a customer buy, what does he buy, how often he pays
on time, etc
– Label past transactions as fraud or fair transactions. This forms the
class attribute.
– Learn a model for the class of the transactions.
– Use this model to detect fraud by observing credit card transactions
on an account.


23

Clustering - Example

• Document Clustering:
– Goal: To find groups of documents that are similar to each other
based on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each document.
Form a similarity measure based on the frequencies of different
terms. Use it to cluster.
– Gain: Search tools can utilize the clusters to relate a new document
or search term to clustered documents.
• Clustering Points: 3204 Articles of
Los Angeles Times.
• Similarity Measure: How many
words are common in these
documents (after some word
filtering).


24

Clustering - Illustration

Seems strait-forward for a small number of dimensions…
what if there were more?

25

Clustering - Illustration

Source: http://salsahpc.indiana.edu/plotviz

We [human beings] have a limited ability to visualize and reason over a large
number of dimensions – clustering helps

26

Association Rules

• Classic Association Rule Example:
– If a customer buys diaper and milk, then he is very likely to buy
beer.

• Applications: Supermarket shelf management.
– Goal: To identify items that are bought together by sufficiently many
customers.
– Approach: Process the point-of-sale data collected with barcode
scanners to find dependencies among items.


27

Hadoop -- So what is Hadoop, Really?

- Dilbert
It’s just a framework

29

Hadoop and MapReduce

 Hadoop is an open-source framework
(written in Java) to store and process gobs
of data across many commodity
computers
 Hadoop is designed to solve a different
problem: the fast, reliable analysis of both
structured, unstructured and complex
data.

 Hadoop and related software are designed
for 3V’s: (1) Volume – Commodity
hardware and open source software
lowers cost and increases capacity;
(2) Velocity – Data ingest speed aided by
append-only and schema-on-read design;
and (3) Variety – Multiple tools to
structure, process, and access

 Hadoop consists of two
elements: reliable very large, low-cost
data storage using the Hadoop
Distributed File System (HDFS) and
high-performance parallel/distributed
data processing framework called
MapReduce.
 HDFS is self-healing high-bandwidth
clustered storage. Map-Reduce is
essentially fault tolerant distributed
computing.

30

The Hadoop Stack
• Hadoop runs on a
collection/cluster of
commodity, sharednothing x86 servers.
• You can add or remove
servers in a Hadoop cluster
(sizes from 50, 100 to even
2000+ nodes) at will; the
The four primary areas where to use Hadoop:
system detects and
1) To aggregate ―data exhaust‖ —
compensates for hardware or
system problems on any server. messages, posts, blog entries, photos, video
clips, maps, web graph….
• Hadoop is self-healing. It can 2) To give data context — friends networks, social
graphs, recommendations, collaborative filtering….
deliver data — and can run
3) To keep apps running — web logs, system
large-scale, high-performance
logs, system metrics, database query logs….
processing batch jobs — in
4) To deliver novel mashup services – mobile
spite of system changes or
location data, clickstream data, SKUs, pricing…..
failures.

31

Data Products Become the Drivers to Identify new
Insights, Cost Savings and Increase Efficiencies

Your Customers

Feedback

• Decreased time to
analytics
• Reuse of analytics
tools
• Focus on analytic vs.
IT integration

Internal Data Sets

Data Analytics Environment
Knowledge Repository
Populate

Analytics Engine

• More self-service
• Incorporation of
external data
• Ability to scale to
analytic needs
• Supports analytics
lifecycle

External Data Sets


33

Thank you


34

Big data analytics presented at meetup big data for decision makers

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Big data analytics presented at meetup big data for decision makers

Semelhante a Big data analytics presented at meetup big data for decision makers (20)

Último

Último (20)

Big data analytics presented at meetup big data for decision makers

Notas do Editor