Predictive Analytics - Big Data Warehousing Meetup

@joe_Caserta#BDWMeetup
Topic:
Predictive
Analytics
Big Data Warehousing
June 3, 2015
Presented by:

About Caserta Concepts
• Technology innovation company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Solve highly complex business data challenges
• Award-winning solutions
• Business Transformation
• Maximize Data Value
• Industry Recognized Workforce
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Services
• Strategy, Roadmap, Implementation
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization

Why do we need predictive analytics today?

The Progression of Analytics
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we make
It happen?
Data Analytics Sophistication
BusinessValue
Source: Gartner
Reports  Correlations  Predictions  Recommendations

Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Big Data Lake
Canned Reporting
Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
Today’s Data Environment
Data Science

Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
The Big Data Pyramid
Metadata  Catalog
ILM  who has access,
how long do we
“manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned into
information: organized, well
defined, complete.
Agile business insight through data-
munging, machine learning, blending
with external data, development of
to-be BDW facts
ILM  who has access, how long do we
“manage it”
Data Quality and Monitoring 
Monitoring of completeness of data
ILM  who has access, how long do we “manage it”
Data Quality and Monitoring  Monitoring of
completeness of data
 Data has different governance demands at each tier.
 Only top tier of the pyramid is fully governed.
 We refer to this as the Trusted tier of the Big Data Warehouse.
Fully Data Governed ( trusted)
User community arbitrary queries and
reporting

Data Science Subway Map

Notable Predictive Analytic Tools
Open Source Tools:
scikit-learn
KNIME
OpenNN
Orange
R
Weka
GNU Octave
Apache Mahout
Commercial Tools:
Alpine Data Labs
BIRT Analytics
Angoss KnowledgeSTUDIO
IBM SPSS Statistics and IBM SPSS
Modeler
KXEN Modeler
Mathematica
MATLAB
Minitab
Oracle Data Mining (ODM)
Pervasive
Predixion Software
RapidMiner
RCASE
Most Popular:
SAS
SPSS
Statistica
R

The Data Scientist Winning Trifecta
Modern Data
Engineering/Data
Preparation
Domain
Knowledge/Business
Expertise
Advanced
Mathematics/
Statistics

Easier to Find Than an Awesome Data Scientist

Modern Data Engineering

Advanced Mathematics / Statistics

Domain and Outcome Sensibility

Are there Standards?
CRISP-DM: Cross Industry Standard Process for Data Mining
1. Business Understanding
• Solve a single business problem
2. Data Understanding
• Discovery
• Data Munging
• Cleansing Requirements
3. Data Preparation
• ETL
4. Modeling
• Evaluate various models
• Iterative experimentation
5. Evaluation
• Does the model achieve business objectives?
6. Deployment
• PMML; application integration; data platform; Excel

1. Business Understanding
In this initial phase of the project we will need to speak to
humans.
• It would be premature to jump in to the data, or begin
selection of the appropriate model(s) or algorithm
• Understand the project objective
• Review the business requirements
• The output of this phase will be conversion of business
requirements into a preliminary technical design (decision
model) and plan.
Since this is an iterative process, this phase will be revisited
throughout the entire process.

2. Data Understanding
• Data Discovery  understand where the data you
need comes from
• Data Profiling  interrogate the data at the entity
level, understand key entities and fields that are
relevant to the analysis.
• Cleansing Requirements  understand data
quality, data density, skew, etc
• Data Munging  collocate, blend and analyze data
for early insights! Valuable information can be
achieved from simple group-by, aggregate queries,
and even more with SQL Jujitsu!
Significant iteration between Business Understanding
and Data Understanding phases.

3. Data Preparation
ETL (Extract Transform Load)
90+% of Data Science time goes into Data
Preparation!
• Select required entities/fields
• Address Data Quality issues: missing or incomplete
values, whitespace, bad data-points
• Join/Enrich disparate datasets
• Transform/Aggregate data for intended use:
• Sample
• Aggregate
• Pivot

Data Quality and Monitoring
• BUILD a robust data quality
subsystem:
• Metadata and error event facts
• Orchestration
• Based on Data Warehouse ETL
Toolkit
• Each error instance of each data
quality check is captured
• Implemented as sub-system
after ingestion
• Each fact stores unique
identifier of the defective source
row

4. Modeling
The Lovers of Algebra & Statistics
• Evaluate various models/algorithms
• Classification
• Clustering
• Regression
• Many others…..
• Tune parameters
• Iterative experimentation
• Different models may require different data
preparation techniques (ie. Sparse Vector Format)
• Additionally we may discover the need for additional
data points, or uncover additional data quality issues!

What to use When?

5. Evaluation
What problem are we trying to solve again?
• Our final solution needs to be evaluated against original
Business Understanding
• Did we meet our objectives?
• Did we address all issues?

6. Deployment
Engineering Time!
• It’s time for the work products of data science to
“graduate” from “new insights” to real applications.
• Processes must be hardened, repeatable, and generally
perform well too!
• Full Data Governance applied
• PMML (Predictive Model Markup Langauge): XML based
interchange format

Some Thoughts
 Big Science requires the convergence
of
data governance,
advanced data engineering,
math and statistics and
business smarts
 Data Science must be guided by best
practices and standards
 Tools and techniques must
ultimately be platform agnostic
(portable)
 Work with experts that have done it
before!

Thank You
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta

Swag Giveaway
RAFFLE!!!

Predictive Analytics - Big Data Warehousing Meetup

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Predictive Analytics - Big Data Warehousing Meetup

Semelhante a Predictive Analytics - Big Data Warehousing Meetup (20)

Mais de Caserta

Mais de Caserta (14)

Último

Último (20)

Predictive Analytics - Big Data Warehousing Meetup

Notas do Editor