SlideShare uma empresa Scribd logo
1 de 34
The Science Behind Data Science
Presented at Big Data for Decision Makers
Ruhollah Farchtchi – Director of Big Data
December 5, 2013
Agenda
• Introductions
• Big Data Analytics Overview

• Use Cases – Examples of Data Products
• Building Blocks
• Data Mining

• Technologies
• Operational Models

© 2013 Unisys Corporation. All rights reserved.

2
So we’ve got a lot of data…
• What can we get out of it?
• How does it help with our business decision making?
• How is this complex landscape changing?
Column 1

Column 2

Column 3

Column 4

Multiple
Types
Multiple
Sources

Pictures

Column 5

1-A

2-A

3-A

4-A

5-A

1-B

2-B

3-B

4-B

5-B

1-C

2-C

3-C

4-C

5-C

1-D

2-D

3-D

4-D

5-D

1-E

2-E

3-E

4-E

5-E

1-F

Tabular /
Structured

My Documents

2-F

3-F

4-F

5-F

Documents

Unstructured

Emails
Video

Sensors, Networks, C
yber Infrastructure

Web, Email, Social Media Enterprise Applications

Mobile Devices, GPS, and
many more!

Multiple
Domains

Defense

Health

Finance

Other

• Logistics / Workforce
analytics
• Cyber and EW
• Intelligence Analysis

• Drug Discovery
• EHR
• Epidemic/pandemic
prediction

• Fraud Detection
• Identity Resolution
• Customer Support

• Supply/Demand
Forecasting
• MTTB Prediction
• Context-based IR

© 2013 Unisys Corporation. All rights reserved.

3
Source: http://www.ongridventures.com/wp-content/uploads/2012/10/Big-Data-Landscape.jpg

And we’ve got a lot of tools…

© 2013 Unisys Corporation. All rights reserved.

4
Big Data and Data Analytics – A Unisys Point of View
• Unisys Point of View: Today’s big data is tomorrow’s normal data
– What remains is the need to extract insights and value out of the data

• Data Analytics is often the goal or end-product of what organizations
what to get out of their data (Big or otherwise)
– Focused around the capabilities of:
• Efficient Data Processing – get data in and processed in time to make use of it and
in a tenable manner
• Effective Information Management – ability to make the data accessible and to
manage the downstream data products as assets
• and Expressive Analytics – make sense of the data in a format that is easily
digested and incorporated into decision making i.e., if you need a PhD to interpret the
results, you still have work to do here

– With the aim to increase business value

• It’s about understanding the data and what you can get out of it
– ―…40% of business leaders had no response when asked what types of
information would transform their industries over the next 10 years.‖1
1. Anne Lapkin, 2012. Hype Cycle for Big Data, 2012, Gartner.

© 2013 Unisys Corporation. All rights reserved.

5
Backward-looking
(Forensic)

Modeling and
Forecasting
Pattern
Recognition

Scale-out

Linear
Programming

Data
Analytics

Global
Optimization Classification
Machine Learning
Simulation

Business
Intelligence & Data
Warehousing
STAR
Schema
OLAP
RDBMS

SQL

ETL

Leverage for
large-scale
analytics and data
mining

Extend

Complexity

Forward-looking
(Predictive)

Data Analytics is the culmination of Analytics and IT

Big Data & NoSQL
Hadoop

Google
BigTable

Map/Reduce
Splunk Dynamo
Hive
MongoDB
Cassandra EMC
Greenplum
HBase

Leverage for largescale application
development &
information
management

Multi-TB Turning Point

Low
Volume, Variety, Velocity

Data Volume

High
Volume, Variety, Velocity

Data Analytics is at the intersection of high volume data processing and advanced analysis. The tools
and methodologies here represent a mix of both worlds and there is currently no ‘killer app’.
© 2013 Unisys Corporation. All rights reserved.

6
Challenges

Misaligned IT, Analytics, and
Business Strategies

Ineffective Data Management
Strategy

Ineffective/inefficient storage and
security platforms

In-accessible or siloed analytics
(―Cylinders of Excellence‖)

Untrusted analytic products or
analytics that are not
timely, accurate, or repeatable
(untested)

Inability to scale analytic
generation (lack of training)

© 2013 Unisys Corporation. All rights reserved.

7
Analytic Environment That Supports Data
Processing, Enhances Information Management and
Improve Decision Making
Data Products

Building Analytic Environment
1.

2.
3.

4.

5.

6.

7.

8.

Work with business leaders
and decision makers to
understand and quantify data
value chain
View data as an enterprise
asset
Innovate through creation of
new data products and
services
Retrain staff and/or acquire
Data Scientist skills
Integrate teams across big
data, data warehousing, and
business analysis
Revise information
management strategies to
incorporate big data
Develop new ways of capturing
information e.g., mobile and
streaming data
Identify and leverage
previously unused internal and
external data

Analyst
Focused

IT Focused

Raw Data
© 2013 Unisys Corporation. All rights reserved.

8
Creation of data products is key to analytic reuse
• What are Data Products?
– Essentially this the output of a data science or data mining activity
– Non-trivial; more than a simple query
– Requires a platform for processing

• They can manifest themselves as many things
– Analytical "engines" running in a larger application (Amazon's
recommender engine is a great Data Product)
– Lists (e.g., Top 10 things I need to know today)
– Entire applications (e.g., customer baseball cards)

• However once they are defined, one thing is true for all
– It takes a combination of domain agnostic analytic techniques
together with domain specific knowledge to produce something
relevant and consumable that can be monetized or operationalized.
© 2013 Unisys Corporation. All rights reserved.

9
Examples of Data Products
Use Case #1- Netflix Recommendation
•

Netflix is about connecting people to the movies they love by leveraging their movie
recommendation system: CinematchSM

•

CinematchSM initially was a linear model that helped to predict the users choices

•

The predictions are used to make personal movie recommendations based on a customers unique
tastes
–

Challenge: Can the recommendation engine be improved upon?

–

Resolution: Set the improvement accuracy level(10%) and create a contest with a $1 million prize

•

Crowdsourcing: Teams merged together for an internet enabled approach and improve results

•

Netflix provided a training dataset of 100+ million ratings that 480,000 users gave to 17K movies and
contained the quadruplet of the form (user, movie, date of grade , grade)
–
–
–
–
–

Goal is to predict grade
Example of Supervised Machine Learning
Submitted predictions are scored against the true grades in terms of Root Mean Squared Error (RMSE)
RSME is a frequently used measure of the difference between values predicted by a model and the values
observed(i.e. residuals)
Similarity is determined by a distance measure such as Jaccard or Cosine distance

Source; Netflixprize.com and Mining Massive Datasets by Anand Rajaraman and Jeffry Ullman

© 2013 Unisys Corporation. All rights reserved.

11
Use Case #2- Google PageRank
•

Google wanted to be able to measure and rank the importance of Web Pages.
–

Challenge: Identify and rank the pages that a users would want to view in terms of their relevance?

–

Resolution: Develop an algorithm that leverages link analysis and implement it as part of Google’s infrastructure

•

The PageRank algorithm considers a webpage to be important if many other webpages point to it.
The linking webpages that point to a given page aren’t treated equally

•

The algorithm takes into account both the importance (PageRank) of the linking pages and the number
of outgoing links it has – Similar to Social Network Analysis

•

Linking pages with higher PageRank are given more weight while pages with more outgoing links are
given less weight.

•

Example of Un-Supervised Machine Learning
0 0 1 0
1 0 0 0

Link Matrix=

1 1 0 1
0 0 0 0
Page 2

Page 1

Page 3

Page 4

Source; The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani and Jerome Friedman

© 2013 Unisys Corporation. All rights reserved.

12
Use Case #3- Walmart Data Driven Value Chain
•

Walmart is the leading and largest retailer in the world.

•

Walmart has been a catalyst for technology adoption amongst its suppliers including
requiring partners to leverage RFID technology to track and coordinate inventories.

•

They have a great cross section of data from individual Social Security
Information, Geographic detail and product purchases

•

They utilize econometric and marketing mix modeling (multiplicative, log-log, power
additive, adstocks, lags and powers) for a number of their key analyses

•

Walmart mines their data to get their product mix correct under different and changing
environment conditions.
–
–

•

Challenge: Identify the correct product mix in order to protect the firm from too much or not enough inventory
Resolution: Mine their multiple data sources for data products that will help tighten and improve operational
forecasts

For impending hurricane warnings, Walmart found that:

Sales

–

Pop Tarts increase in sales(7 times their normal rate)

–

Identified that the top selling premium item was beer

–

Allows the firm to get the supply to the store ahead of time

GAs = a + b(TV)
GAs = a + b(TV)G

Item(Beer, Pop Tarts)

Source; What Walmart Knows about Customer Habits: New York Times

© 2013 Unisys Corporation. All rights reserved.

13
Use Case #4- Amazon Targeted Marketing
•

Amazon is the worlds largest online retailer and known for their e-commerce Web Site where they use
input about a customer’s interest to generate a list of recommendation.

•

Similar to Netflix they use recommendation algorithms but they do targeted marketing for items that a
customer would want to buy based on their previous purchase patterns

•

The recommendation algorithms personalize the online store for each customer and radically changes
based on the customers interest
–

Challenge(s): Analyze massive amounts of data, submit results realtime, new customers have very little data
and customer data is very volatile

–

Resolution: Cluster modeling, search based methods and Item to Item Collaborative filtering

•

Cluster Modeling: Identify customers similar to the user by dividing the customer base into segments
and treat the task as a classification problem. Typically uses a unsupervised learning algorithm such
as K-Means or Hierarchical

•

Search Based Methods: Treats the recommendations problem as a search for related items. Given a
users purchases and rated items, the algorithm constructs a search query to find other popular items
by the same author, artist or director with similar keywords

•

Item to Item Collaborative Filtering: Customized algorithm that is able to scale to massive data sets
and produces high quality recommendations in real time. This algorithm matches each of the users
purchased and rated items to similar items and then combines those similar items into a
recommendation list. Offline and Online components to increase performance
Source; Amazon.com Recommendations: Item to Item Collaborative Filtering. Greg Linden, Brenth Smith and Jeremy York

© 2013 Unisys Corporation. All rights reserved.

14
Unisys Big Data Analytics
Building Blocks
Big Data Analytics Methodology

Modeling Components
Decision Making &
Forecasting
• Provide actionable intelligence into the future state

Models
•

Statistical model applied to input data that separates the portion of volume due to each of the variables or
factors. We use the term model, because it is a simplification of reality.

Data
Internal Data

Demographic Data
Demographic Data

3rd Party Data

© 2013 Unisys Corporation. All rights reserved.

16
Data Mining
Data Mining - Motivations

• We’ve covered big data
– There’s a lot of it!

• New Modus Operandi
– Gather whatever data you can, whenever and where ever possible

• New Expectation
– Data gathered will have value; either for the purpose it was
collected or for a purpose not yet envisioned

• Challenge: There will never be enough analysts to sift
through it all
© 2013 Unisys Corporation. All rights reserved.

18
Data Mining Definitions
• Non-trivial extraction of implicit, previously unknown and potentially
useful information from data (normally large databases)
• Exploration & analysis, by automatic or semiautomatic means, of large
quantities of data in order to discover meaningful patterns.
• Part of the Knowledge Discovery in Databases Process.

Source: http://liris.cnrs.fr/abstract/abstract.html

© 2013 Unisys Corporation. All rights reserved.

19
Data Mining Tasks
Prediction Methods: Use some
variables to predict unknown or future
values of other variables

Description Methods: Find human
interpretable patterns that describe the
data.

• Classification

• Clustering

–

For a given set of attributes apply a
model for the class (what you want to
predict) as a function of the attributes

–

•

• Regression
–

Predict a value of a given continuous
valued variable based on the values of
other variables, assuming a linear or
nonlinear model of dependency

•

Data points in one cluster are more similar to one
another
Data points in separate clusters are less similar to
one another

• Association Rule Discovery
–

• Deviation Detection
–

Given a set of data points, each having a
set of attributes, and a similarity measure
among them, find clusters such that:

Given a set of records each of which
contain some number of items from a
given collection:
•

Detect significant deviations from
normal behavior

Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.

• Sequential Pattern Discovery
–

Given a set of sequences and support
threshold, find the complete set of
frequent subsequences
© 2013 Unisys Corporation. All rights reserved.

20
Classification - Example

Tax Fraud
Refund

Marital
Status

Taxable
Income

Cheat

Yes

Single

125k

?

Tid

Refund

Marital
Status

Taxable
Income

Cheat

No

Married

100k

?

1

Yes

Single

125k

No

No

Single

70k

?

2

No

Married

100k

No

Yes

Married

120k

?

3

No

Single

70k

No

4

Yes

Married

120k

No

5

No

Divorced

95k

Yes

6

No

Married

60k

No

7

Yes

Divorced

220k

No

8

No

Single

85k

Yes

9

No

Married

75k

No

10

No

Single

90k

Yes

Training Data Set

Test Data Set

Learn
Classifier

Model
Model
Model

© 2013 Unisys Corporation. All rights reserved.

21
Classification – Your Turn

• Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
–
–
–
–

What kind of data will you try to get ?
Can you say something about the characteristics of the data?
Estimate the size of the data.
What kind of pitfalls you might run into ?

© 2013 Unisys Corporation. All rights reserved.

22
Fraud Detection

• Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
– Use credit card transactions and the information on its
accountholder as attributes.
– When does a customer buy, what does he buy, how often he pays
on time, etc
– Label past transactions as fraud or fair transactions. This forms the
class attribute.
– Learn a model for the class of the transactions.
– Use this model to detect fraud by observing credit card transactions
on an account.

© 2013 Unisys Corporation. All rights reserved.

23
Clustering - Example

• Document Clustering:
– Goal: To find groups of documents that are similar to each other
based on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each document.
Form a similarity measure based on the frequencies of different
terms. Use it to cluster.
– Gain: Search tools can utilize the clusters to relate a new document
or search term to clustered documents.
• Clustering Points: 3204 Articles of
Los Angeles Times.
• Similarity Measure: How many
words are common in these
documents (after some word
filtering).

© 2013 Unisys Corporation. All rights reserved.

24
Clustering - Illustration

Seems strait-forward for a small number of dimensions…
what if there were more?
© 2013 Unisys Corporation. All rights reserved.

25
Clustering - Illustration

Source: http://salsahpc.indiana.edu/plotviz

We [human beings] have a limited ability to visualize and reason over a large
number of dimensions – clustering helps
© 2013 Unisys Corporation. All rights reserved.

26
Association Rules

• Classic Association Rule Example:
– If a customer buys diaper and milk, then he is very likely to buy
beer.

• Applications: Supermarket shelf management.
– Goal: To identify items that are bought together by sufficiently many
customers.
– Approach: Process the point-of-sale data collected with barcode
scanners to find dependencies among items.

© 2013 Unisys Corporation. All rights reserved.

27
Technologies
Hadoop -- So what is Hadoop, Really?

- Dilbert
It’s just a framework
© 2013 Unisys Corporation. All rights reserved.

29
Hadoop and MapReduce

 Hadoop is an open-source framework
(written in Java) to store and process gobs
of data across many commodity
computers
 Hadoop is designed to solve a different
problem: the fast, reliable analysis of both
structured, unstructured and complex
data.

 Hadoop and related software are designed
for 3V’s: (1) Volume – Commodity
hardware and open source software
lowers cost and increases capacity;
(2) Velocity – Data ingest speed aided by
append-only and schema-on-read design;
and (3) Variety – Multiple tools to
structure, process, and access

 Hadoop consists of two
elements: reliable very large, low-cost
data storage using the Hadoop
Distributed File System (HDFS) and
high-performance parallel/distributed
data processing framework called
MapReduce.
 HDFS is self-healing high-bandwidth
clustered storage. Map-Reduce is
essentially fault tolerant distributed
computing.
© 2013 Unisys Corporation. All rights reserved.

30
The Hadoop Stack
• Hadoop runs on a
collection/cluster of
commodity, sharednothing x86 servers.
• You can add or remove
servers in a Hadoop cluster
(sizes from 50, 100 to even
2000+ nodes) at will; the
The four primary areas where to use Hadoop:
system detects and
1) To aggregate ―data exhaust‖ —
compensates for hardware or
system problems on any server. messages, posts, blog entries, photos, video
clips, maps, web graph….
• Hadoop is self-healing. It can 2) To give data context — friends networks, social
graphs, recommendations, collaborative filtering….
deliver data — and can run
3) To keep apps running — web logs, system
large-scale, high-performance
logs, system metrics, database query logs….
processing batch jobs — in
4) To deliver novel mashup services – mobile
spite of system changes or
location data, clickstream data, SKUs, pricing…..
failures.
© 2013 Unisys Corporation. All rights reserved.

31
Operational Models
Data Products Become the Drivers to Identify new
Insights, Cost Savings and Increase Efficiencies

Your Customers

Feedback

• Decreased time to
analytics
• Reuse of analytics
tools
• Focus on analytic vs.
IT integration

Internal Data Sets

Data Analytics Environment
Knowledge Repository
Populate

Analytics Engine

• More self-service
• Incorporation of
external data
• Ability to scale to
analytic needs
• Supports analytics
lifecycle

External Data Sets

© 2013 Unisys Corporation. All rights reserved.

33
Thank you

© 2013 Unisys Corporation. All rights reserved.

34

Mais conteúdo relacionado

Mais procurados

Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
DataStax
 

Mais procurados (20)

Data Quality Strategies
Data Quality StrategiesData Quality Strategies
Data Quality Strategies
 
Training Week: Introduction to Neo4j 2022
Training Week: Introduction to Neo4j 2022Training Week: Introduction to Neo4j 2022
Training Week: Introduction to Neo4j 2022
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Kafka Summit NYC 2017 - The Real-time Event Driven Bank: A Kafka Story
Kafka Summit NYC 2017 - The Real-time Event Driven Bank: A Kafka Story Kafka Summit NYC 2017 - The Real-time Event Driven Bank: A Kafka Story
Kafka Summit NYC 2017 - The Real-time Event Driven Bank: A Kafka Story
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
Delivering rapid-fire Analytics with Snowflake and Tableau
Delivering rapid-fire Analytics with Snowflake and TableauDelivering rapid-fire Analytics with Snowflake and Tableau
Delivering rapid-fire Analytics with Snowflake and Tableau
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
 
DATA & ANALYTICS
DATA & ANALYTICSDATA & ANALYTICS
DATA & ANALYTICS
 
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
 
Performance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla ClusterPerformance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla Cluster
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
(The life of a) Data engineer
(The life of a) Data engineer(The life of a) Data engineer
(The life of a) Data engineer
 
Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Tableau PPT
Tableau PPTTableau PPT
Tableau PPT
 
Service Workers and APEX
Service Workers and APEXService Workers and APEX
Service Workers and APEX
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 

Destaque

121010_Mobile Banking & Payments for Emerging Asia Summit 2012_Monitise: Mobi...
121010_Mobile Banking & Payments for Emerging Asia Summit 2012_Monitise: Mobi...121010_Mobile Banking & Payments for Emerging Asia Summit 2012_Monitise: Mobi...
121010_Mobile Banking & Payments for Emerging Asia Summit 2012_Monitise: Mobi...
spirecorporate
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 

Destaque (20)

Foundation for Success: How Big Data Fits in an Information Architecture
Foundation for Success: How Big Data Fits in an Information ArchitectureFoundation for Success: How Big Data Fits in an Information Architecture
Foundation for Success: How Big Data Fits in an Information Architecture
 
General Insurance Conference 2014: Big Data for Insurance Companies
General Insurance Conference 2014: Big Data for Insurance CompaniesGeneral Insurance Conference 2014: Big Data for Insurance Companies
General Insurance Conference 2014: Big Data for Insurance Companies
 
Brisbane Health-y Data: Legislation, Ethics and Governance
Brisbane Health-y Data: Legislation, Ethics and GovernanceBrisbane Health-y Data: Legislation, Ethics and Governance
Brisbane Health-y Data: Legislation, Ethics and Governance
 
Advanced Data Analytics and Open Data - Dr Ingo Keck of CeADAR - Dublinked Da...
Advanced Data Analytics and Open Data - Dr Ingo Keck of CeADAR - Dublinked Da...Advanced Data Analytics and Open Data - Dr Ingo Keck of CeADAR - Dublinked Da...
Advanced Data Analytics and Open Data - Dr Ingo Keck of CeADAR - Dublinked Da...
 
Modus Operandi
Modus OperandiModus Operandi
Modus Operandi
 
Big data
Big dataBig data
Big data
 
121010_Mobile Banking & Payments for Emerging Asia Summit 2012_Monitise: Mobi...
121010_Mobile Banking & Payments for Emerging Asia Summit 2012_Monitise: Mobi...121010_Mobile Banking & Payments for Emerging Asia Summit 2012_Monitise: Mobi...
121010_Mobile Banking & Payments for Emerging Asia Summit 2012_Monitise: Mobi...
 
SMAC -IoT Technology
SMAC -IoT TechnologySMAC -IoT Technology
SMAC -IoT Technology
 
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...
Big Data Day LA 2016/ Data Science Track -  Data Science + Hollywood, Todd Ho...Big Data Day LA 2016/ Data Science Track -  Data Science + Hollywood, Todd Ho...
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...
 
BIG Data & Hadoop Applications in E-Commerce
BIG Data & Hadoop Applications in E-CommerceBIG Data & Hadoop Applications in E-Commerce
BIG Data & Hadoop Applications in E-Commerce
 
Modus operandi
Modus operandiModus operandi
Modus operandi
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data Analytics
 
Announcing Amazon Rekognition - Deep Learning-Based Image Analysis - December...
Announcing Amazon Rekognition - Deep Learning-Based Image Analysis - December...Announcing Amazon Rekognition - Deep Learning-Based Image Analysis - December...
Announcing Amazon Rekognition - Deep Learning-Based Image Analysis - December...
 
All you wanted to know about analytics in e commerce- amazon, ebay, flipkart
All you wanted to know about analytics in e commerce- amazon, ebay, flipkartAll you wanted to know about analytics in e commerce- amazon, ebay, flipkart
All you wanted to know about analytics in e commerce- amazon, ebay, flipkart
 
Amazon Machine Learning Case Study: Predicting Customer Churn
Amazon Machine Learning Case Study: Predicting Customer ChurnAmazon Machine Learning Case Study: Predicting Customer Churn
Amazon Machine Learning Case Study: Predicting Customer Churn
 
Big Data in e-Commerce
Big Data in e-CommerceBig Data in e-Commerce
Big Data in e-Commerce
 
Ch15 software reuse
Ch15 software reuseCh15 software reuse
Ch15 software reuse
 
Netflix case study
Netflix case studyNetflix case study
Netflix case study
 
Implementing Effective Data Governance
Implementing Effective Data GovernanceImplementing Effective Data Governance
Implementing Effective Data Governance
 

Semelhante a Big data analytics presented at meetup big data for decision makers

02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
Raul Chong
 
Data-Ed Online Presents: Data Warehouse Strategies
Data-Ed Online Presents: Data Warehouse StrategiesData-Ed Online Presents: Data Warehouse Strategies
Data-Ed Online Presents: Data Warehouse Strategies
DATAVERSITY
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Precisely
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They Fall
Trillium Software
 
Data-Ed Webinar: Data Warehouse Strategies
Data-Ed Webinar: Data Warehouse StrategiesData-Ed Webinar: Data Warehouse Strategies
Data-Ed Webinar: Data Warehouse Strategies
DATAVERSITY
 

Semelhante a Big data analytics presented at meetup big data for decision makers (20)

02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Data-Ed: Data Warehousing Strategies
Data-Ed: Data Warehousing StrategiesData-Ed: Data Warehousing Strategies
Data-Ed: Data Warehousing Strategies
 
Data-Ed Online Presents: Data Warehouse Strategies
Data-Ed Online Presents: Data Warehouse StrategiesData-Ed Online Presents: Data Warehouse Strategies
Data-Ed Online Presents: Data Warehouse Strategies
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
 
data analytics lecture2.pptx
data analytics lecture2.pptxdata analytics lecture2.pptx
data analytics lecture2.pptx
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
 
Sgcp14dunlea
Sgcp14dunleaSgcp14dunlea
Sgcp14dunlea
 
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav MisraFrom Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Fight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsFight Fraud with Big Data Analytics
Fight Fraud with Big Data Analytics
 
Lecture 1.13 & 1.14 &1.15_Business Profiles in Big Data.pptx
Lecture 1.13 & 1.14 &1.15_Business Profiles in Big Data.pptxLecture 1.13 & 1.14 &1.15_Business Profiles in Big Data.pptx
Lecture 1.13 & 1.14 &1.15_Business Profiles in Big Data.pptx
 
"Hadoop: What we've learned in 5 years", Martin Oberhuber, Senior Data Scient...
"Hadoop: What we've learned in 5 years", Martin Oberhuber, Senior Data Scient..."Hadoop: What we've learned in 5 years", Martin Oberhuber, Senior Data Scient...
"Hadoop: What we've learned in 5 years", Martin Oberhuber, Senior Data Scient...
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They Fall
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili Saghafi
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
BIG DATA CHAPTER 2 IN DSS.pptx
BIG DATA CHAPTER 2 IN DSS.pptxBIG DATA CHAPTER 2 IN DSS.pptx
BIG DATA CHAPTER 2 IN DSS.pptx
 
Implementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White PaperImplementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White Paper
 
Data-Ed Webinar: Data Warehouse Strategies
Data-Ed Webinar: Data Warehouse StrategiesData-Ed Webinar: Data Warehouse Strategies
Data-Ed Webinar: Data Warehouse Strategies
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

Big data analytics presented at meetup big data for decision makers

  • 1. The Science Behind Data Science Presented at Big Data for Decision Makers Ruhollah Farchtchi – Director of Big Data December 5, 2013
  • 2. Agenda • Introductions • Big Data Analytics Overview • Use Cases – Examples of Data Products • Building Blocks • Data Mining • Technologies • Operational Models © 2013 Unisys Corporation. All rights reserved. 2
  • 3. So we’ve got a lot of data… • What can we get out of it? • How does it help with our business decision making? • How is this complex landscape changing? Column 1 Column 2 Column 3 Column 4 Multiple Types Multiple Sources Pictures Column 5 1-A 2-A 3-A 4-A 5-A 1-B 2-B 3-B 4-B 5-B 1-C 2-C 3-C 4-C 5-C 1-D 2-D 3-D 4-D 5-D 1-E 2-E 3-E 4-E 5-E 1-F Tabular / Structured My Documents 2-F 3-F 4-F 5-F Documents Unstructured Emails Video Sensors, Networks, C yber Infrastructure Web, Email, Social Media Enterprise Applications Mobile Devices, GPS, and many more! Multiple Domains Defense Health Finance Other • Logistics / Workforce analytics • Cyber and EW • Intelligence Analysis • Drug Discovery • EHR • Epidemic/pandemic prediction • Fraud Detection • Identity Resolution • Customer Support • Supply/Demand Forecasting • MTTB Prediction • Context-based IR © 2013 Unisys Corporation. All rights reserved. 3
  • 4. Source: http://www.ongridventures.com/wp-content/uploads/2012/10/Big-Data-Landscape.jpg And we’ve got a lot of tools… © 2013 Unisys Corporation. All rights reserved. 4
  • 5. Big Data and Data Analytics – A Unisys Point of View • Unisys Point of View: Today’s big data is tomorrow’s normal data – What remains is the need to extract insights and value out of the data • Data Analytics is often the goal or end-product of what organizations what to get out of their data (Big or otherwise) – Focused around the capabilities of: • Efficient Data Processing – get data in and processed in time to make use of it and in a tenable manner • Effective Information Management – ability to make the data accessible and to manage the downstream data products as assets • and Expressive Analytics – make sense of the data in a format that is easily digested and incorporated into decision making i.e., if you need a PhD to interpret the results, you still have work to do here – With the aim to increase business value • It’s about understanding the data and what you can get out of it – ―…40% of business leaders had no response when asked what types of information would transform their industries over the next 10 years.‖1 1. Anne Lapkin, 2012. Hype Cycle for Big Data, 2012, Gartner. © 2013 Unisys Corporation. All rights reserved. 5
  • 6. Backward-looking (Forensic) Modeling and Forecasting Pattern Recognition Scale-out Linear Programming Data Analytics Global Optimization Classification Machine Learning Simulation Business Intelligence & Data Warehousing STAR Schema OLAP RDBMS SQL ETL Leverage for large-scale analytics and data mining Extend Complexity Forward-looking (Predictive) Data Analytics is the culmination of Analytics and IT Big Data & NoSQL Hadoop Google BigTable Map/Reduce Splunk Dynamo Hive MongoDB Cassandra EMC Greenplum HBase Leverage for largescale application development & information management Multi-TB Turning Point Low Volume, Variety, Velocity Data Volume High Volume, Variety, Velocity Data Analytics is at the intersection of high volume data processing and advanced analysis. The tools and methodologies here represent a mix of both worlds and there is currently no ‘killer app’. © 2013 Unisys Corporation. All rights reserved. 6
  • 7. Challenges Misaligned IT, Analytics, and Business Strategies Ineffective Data Management Strategy Ineffective/inefficient storage and security platforms In-accessible or siloed analytics (―Cylinders of Excellence‖) Untrusted analytic products or analytics that are not timely, accurate, or repeatable (untested) Inability to scale analytic generation (lack of training) © 2013 Unisys Corporation. All rights reserved. 7
  • 8. Analytic Environment That Supports Data Processing, Enhances Information Management and Improve Decision Making Data Products Building Analytic Environment 1. 2. 3. 4. 5. 6. 7. 8. Work with business leaders and decision makers to understand and quantify data value chain View data as an enterprise asset Innovate through creation of new data products and services Retrain staff and/or acquire Data Scientist skills Integrate teams across big data, data warehousing, and business analysis Revise information management strategies to incorporate big data Develop new ways of capturing information e.g., mobile and streaming data Identify and leverage previously unused internal and external data Analyst Focused IT Focused Raw Data © 2013 Unisys Corporation. All rights reserved. 8
  • 9. Creation of data products is key to analytic reuse • What are Data Products? – Essentially this the output of a data science or data mining activity – Non-trivial; more than a simple query – Requires a platform for processing • They can manifest themselves as many things – Analytical "engines" running in a larger application (Amazon's recommender engine is a great Data Product) – Lists (e.g., Top 10 things I need to know today) – Entire applications (e.g., customer baseball cards) • However once they are defined, one thing is true for all – It takes a combination of domain agnostic analytic techniques together with domain specific knowledge to produce something relevant and consumable that can be monetized or operationalized. © 2013 Unisys Corporation. All rights reserved. 9
  • 10. Examples of Data Products
  • 11. Use Case #1- Netflix Recommendation • Netflix is about connecting people to the movies they love by leveraging their movie recommendation system: CinematchSM • CinematchSM initially was a linear model that helped to predict the users choices • The predictions are used to make personal movie recommendations based on a customers unique tastes – Challenge: Can the recommendation engine be improved upon? – Resolution: Set the improvement accuracy level(10%) and create a contest with a $1 million prize • Crowdsourcing: Teams merged together for an internet enabled approach and improve results • Netflix provided a training dataset of 100+ million ratings that 480,000 users gave to 17K movies and contained the quadruplet of the form (user, movie, date of grade , grade) – – – – – Goal is to predict grade Example of Supervised Machine Learning Submitted predictions are scored against the true grades in terms of Root Mean Squared Error (RMSE) RSME is a frequently used measure of the difference between values predicted by a model and the values observed(i.e. residuals) Similarity is determined by a distance measure such as Jaccard or Cosine distance Source; Netflixprize.com and Mining Massive Datasets by Anand Rajaraman and Jeffry Ullman © 2013 Unisys Corporation. All rights reserved. 11
  • 12. Use Case #2- Google PageRank • Google wanted to be able to measure and rank the importance of Web Pages. – Challenge: Identify and rank the pages that a users would want to view in terms of their relevance? – Resolution: Develop an algorithm that leverages link analysis and implement it as part of Google’s infrastructure • The PageRank algorithm considers a webpage to be important if many other webpages point to it. The linking webpages that point to a given page aren’t treated equally • The algorithm takes into account both the importance (PageRank) of the linking pages and the number of outgoing links it has – Similar to Social Network Analysis • Linking pages with higher PageRank are given more weight while pages with more outgoing links are given less weight. • Example of Un-Supervised Machine Learning 0 0 1 0 1 0 0 0 Link Matrix= 1 1 0 1 0 0 0 0 Page 2 Page 1 Page 3 Page 4 Source; The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani and Jerome Friedman © 2013 Unisys Corporation. All rights reserved. 12
  • 13. Use Case #3- Walmart Data Driven Value Chain • Walmart is the leading and largest retailer in the world. • Walmart has been a catalyst for technology adoption amongst its suppliers including requiring partners to leverage RFID technology to track and coordinate inventories. • They have a great cross section of data from individual Social Security Information, Geographic detail and product purchases • They utilize econometric and marketing mix modeling (multiplicative, log-log, power additive, adstocks, lags and powers) for a number of their key analyses • Walmart mines their data to get their product mix correct under different and changing environment conditions. – – • Challenge: Identify the correct product mix in order to protect the firm from too much or not enough inventory Resolution: Mine their multiple data sources for data products that will help tighten and improve operational forecasts For impending hurricane warnings, Walmart found that: Sales – Pop Tarts increase in sales(7 times their normal rate) – Identified that the top selling premium item was beer – Allows the firm to get the supply to the store ahead of time GAs = a + b(TV) GAs = a + b(TV)G Item(Beer, Pop Tarts) Source; What Walmart Knows about Customer Habits: New York Times © 2013 Unisys Corporation. All rights reserved. 13
  • 14. Use Case #4- Amazon Targeted Marketing • Amazon is the worlds largest online retailer and known for their e-commerce Web Site where they use input about a customer’s interest to generate a list of recommendation. • Similar to Netflix they use recommendation algorithms but they do targeted marketing for items that a customer would want to buy based on their previous purchase patterns • The recommendation algorithms personalize the online store for each customer and radically changes based on the customers interest – Challenge(s): Analyze massive amounts of data, submit results realtime, new customers have very little data and customer data is very volatile – Resolution: Cluster modeling, search based methods and Item to Item Collaborative filtering • Cluster Modeling: Identify customers similar to the user by dividing the customer base into segments and treat the task as a classification problem. Typically uses a unsupervised learning algorithm such as K-Means or Hierarchical • Search Based Methods: Treats the recommendations problem as a search for related items. Given a users purchases and rated items, the algorithm constructs a search query to find other popular items by the same author, artist or director with similar keywords • Item to Item Collaborative Filtering: Customized algorithm that is able to scale to massive data sets and produces high quality recommendations in real time. This algorithm matches each of the users purchased and rated items to similar items and then combines those similar items into a recommendation list. Offline and Online components to increase performance Source; Amazon.com Recommendations: Item to Item Collaborative Filtering. Greg Linden, Brenth Smith and Jeremy York © 2013 Unisys Corporation. All rights reserved. 14
  • 15. Unisys Big Data Analytics Building Blocks
  • 16. Big Data Analytics Methodology Modeling Components Decision Making & Forecasting • Provide actionable intelligence into the future state Models • Statistical model applied to input data that separates the portion of volume due to each of the variables or factors. We use the term model, because it is a simplification of reality. Data Internal Data Demographic Data Demographic Data 3rd Party Data © 2013 Unisys Corporation. All rights reserved. 16
  • 18. Data Mining - Motivations • We’ve covered big data – There’s a lot of it! • New Modus Operandi – Gather whatever data you can, whenever and where ever possible • New Expectation – Data gathered will have value; either for the purpose it was collected or for a purpose not yet envisioned • Challenge: There will never be enough analysts to sift through it all © 2013 Unisys Corporation. All rights reserved. 18
  • 19. Data Mining Definitions • Non-trivial extraction of implicit, previously unknown and potentially useful information from data (normally large databases) • Exploration & analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns. • Part of the Knowledge Discovery in Databases Process. Source: http://liris.cnrs.fr/abstract/abstract.html © 2013 Unisys Corporation. All rights reserved. 19
  • 20. Data Mining Tasks Prediction Methods: Use some variables to predict unknown or future values of other variables Description Methods: Find human interpretable patterns that describe the data. • Classification • Clustering – For a given set of attributes apply a model for the class (what you want to predict) as a function of the attributes – • • Regression – Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency • Data points in one cluster are more similar to one another Data points in separate clusters are less similar to one another • Association Rule Discovery – • Deviation Detection – Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that: Given a set of records each of which contain some number of items from a given collection: • Detect significant deviations from normal behavior Produce dependency rules which will predict occurrence of an item based on occurrences of other items. • Sequential Pattern Discovery – Given a set of sequences and support threshold, find the complete set of frequent subsequences © 2013 Unisys Corporation. All rights reserved. 20
  • 21. Classification - Example Tax Fraud Refund Marital Status Taxable Income Cheat Yes Single 125k ? Tid Refund Marital Status Taxable Income Cheat No Married 100k ? 1 Yes Single 125k No No Single 70k ? 2 No Married 100k No Yes Married 120k ? 3 No Single 70k No 4 Yes Married 120k No 5 No Divorced 95k Yes 6 No Married 60k No 7 Yes Divorced 220k No 8 No Single 85k Yes 9 No Married 75k No 10 No Single 90k Yes Training Data Set Test Data Set Learn Classifier Model Model Model © 2013 Unisys Corporation. All rights reserved. 21
  • 22. Classification – Your Turn • Fraud Detection • Goal: Predict fraudulent cases in credit card transactions. • Approach: – – – – What kind of data will you try to get ? Can you say something about the characteristics of the data? Estimate the size of the data. What kind of pitfalls you might run into ? © 2013 Unisys Corporation. All rights reserved. 22
  • 23. Fraud Detection • Fraud Detection • Goal: Predict fraudulent cases in credit card transactions. • Approach: – Use credit card transactions and the information on its accountholder as attributes. – When does a customer buy, what does he buy, how often he pays on time, etc – Label past transactions as fraud or fair transactions. This forms the class attribute. – Learn a model for the class of the transactions. – Use this model to detect fraud by observing credit card transactions on an account. © 2013 Unisys Corporation. All rights reserved. 23
  • 24. Clustering - Example • Document Clustering: – Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. – Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. – Gain: Search tools can utilize the clusters to relate a new document or search term to clustered documents. • Clustering Points: 3204 Articles of Los Angeles Times. • Similarity Measure: How many words are common in these documents (after some word filtering). © 2013 Unisys Corporation. All rights reserved. 24
  • 25. Clustering - Illustration Seems strait-forward for a small number of dimensions… what if there were more? © 2013 Unisys Corporation. All rights reserved. 25
  • 26. Clustering - Illustration Source: http://salsahpc.indiana.edu/plotviz We [human beings] have a limited ability to visualize and reason over a large number of dimensions – clustering helps © 2013 Unisys Corporation. All rights reserved. 26
  • 27. Association Rules • Classic Association Rule Example: – If a customer buys diaper and milk, then he is very likely to buy beer. • Applications: Supermarket shelf management. – Goal: To identify items that are bought together by sufficiently many customers. – Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. © 2013 Unisys Corporation. All rights reserved. 27
  • 29. Hadoop -- So what is Hadoop, Really? - Dilbert It’s just a framework © 2013 Unisys Corporation. All rights reserved. 29
  • 30. Hadoop and MapReduce  Hadoop is an open-source framework (written in Java) to store and process gobs of data across many commodity computers  Hadoop is designed to solve a different problem: the fast, reliable analysis of both structured, unstructured and complex data.  Hadoop and related software are designed for 3V’s: (1) Volume – Commodity hardware and open source software lowers cost and increases capacity; (2) Velocity – Data ingest speed aided by append-only and schema-on-read design; and (3) Variety – Multiple tools to structure, process, and access  Hadoop consists of two elements: reliable very large, low-cost data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel/distributed data processing framework called MapReduce.  HDFS is self-healing high-bandwidth clustered storage. Map-Reduce is essentially fault tolerant distributed computing. © 2013 Unisys Corporation. All rights reserved. 30
  • 31. The Hadoop Stack • Hadoop runs on a collection/cluster of commodity, sharednothing x86 servers. • You can add or remove servers in a Hadoop cluster (sizes from 50, 100 to even 2000+ nodes) at will; the The four primary areas where to use Hadoop: system detects and 1) To aggregate ―data exhaust‖ — compensates for hardware or system problems on any server. messages, posts, blog entries, photos, video clips, maps, web graph…. • Hadoop is self-healing. It can 2) To give data context — friends networks, social graphs, recommendations, collaborative filtering…. deliver data — and can run 3) To keep apps running — web logs, system large-scale, high-performance logs, system metrics, database query logs…. processing batch jobs — in 4) To deliver novel mashup services – mobile spite of system changes or location data, clickstream data, SKUs, pricing….. failures. © 2013 Unisys Corporation. All rights reserved. 31
  • 33. Data Products Become the Drivers to Identify new Insights, Cost Savings and Increase Efficiencies Your Customers Feedback • Decreased time to analytics • Reuse of analytics tools • Focus on analytic vs. IT integration Internal Data Sets Data Analytics Environment Knowledge Repository Populate Analytics Engine • More self-service • Incorporation of external data • Ability to scale to analytic needs • Supports analytics lifecycle External Data Sets © 2013 Unisys Corporation. All rights reserved. 33
  • 34. Thank you © 2013 Unisys Corporation. All rights reserved. 34

Notas do Editor

  1. Think about the access to top talent and how crowd sourcing is allowing organizations to put a bounty on solutions to hard problems.
  2. Think about graph analysis and the work being done with SNA today.
  3. Think about common patterns and pattern discovery. For example in Cargo, if a ship stops at certain ports is the probability higher or lower that it may have picked up some illegal substances on the way.
  4. Really great example of how different techniques can be combined and reused. This is really driving the need for an enterprise analytic data set as you can start to chain analytics together to do many types of operations.
  5. Think about automation of analysis tasks. If I’ve figured how to to bucket things, I may be able to triage the data better according to priorities in my organization.
  6. Clustering is really BIG in the big data world right now due to the wide applicability.