Predictive analytics has always been about the future, and the age of big data has made that future an increasingly dynamic place, filled with opportunity and risk.
The evolution of advanced analytics technologies and the continual development of new analytical methodologies can help to optimize financial results, enable systems and services based on machine learning, obviate or mitigate fraud and reduce cybersecurity risks, among many other things.
Caserta Concepts, Zementis, and guest speaker from FICO presented the strategies, technologies and use cases driving predictive analytics in a big data environment.
For more information, visit www.casertaconcepts.com or contact us at info@casertaconcepts.com
2. @joe_Caserta#BDWMeetup
About Caserta Concepts
• Technology innovation company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Solve highly complex business data challenges
• Award-winning solutions
• Business Transformation
• Maximize Data Value
• Industry Recognized Workforce
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Services
• Strategy, Roadmap, Implementation
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
5. @joe_Caserta#BDWMeetup
The Progression of Analytics
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we make
It happen?
Data Analytics Sophistication
BusinessValue
Source: Gartner
Reports Correlations Predictions Recommendations
7. @joe_Caserta#BDWMeetup
Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Big Data Lake
Canned Reporting
Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
Today’s Data Environment
Data Science
8. @joe_Caserta#BDWMeetup
Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
The Big Data Pyramid
Metadata Catalog
ILM who has access,
how long do we
“manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned into
information: organized, well
defined, complete.
Agile business insight through data-
munging, machine learning, blending
with external data, development of
to-be BDW facts
Metadata Catalog
ILM who has access, how long do we
“manage it”
Data Quality and Monitoring
Monitoring of completeness of data
Metadata Catalog
ILM who has access, how long do we “manage it”
Data Quality and Monitoring Monitoring of
completeness of data
Data has different governance demands at each tier.
Only top tier of the pyramid is fully governed.
We refer to this as the Trusted tier of the Big Data Warehouse.
Fully Data Governed ( trusted)
User community arbitrary queries and
reporting
10. @joe_Caserta#BDWMeetup
Notable Predictive Analytic Tools
Open Source Tools:
scikit-learn
KNIME
OpenNN
Orange
R
Weka
GNU Octave
Apache Mahout
Commercial Tools:
Alpine Data Labs
BIRT Analytics
Angoss KnowledgeSTUDIO
IBM SPSS Statistics and IBM SPSS
Modeler
KXEN Modeler
Mathematica
MATLAB
Minitab
Oracle Data Mining (ODM)
Pervasive
Predixion Software
RapidMiner
RCASE
Most Popular:
SAS
SPSS
Statistica
R
11. @joe_Caserta#BDWMeetup
The Data Scientist Winning Trifecta
Modern Data
Engineering/Data
Preparation
Domain
Knowledge/Business
Expertise
Advanced
Mathematics/
Statistics
16. @joe_Caserta#BDWMeetup
Are there Standards?
CRISP-DM: Cross Industry Standard Process for Data Mining
1. Business Understanding
• Solve a single business problem
2. Data Understanding
• Discovery
• Data Munging
• Cleansing Requirements
3. Data Preparation
• ETL
4. Modeling
• Evaluate various models
• Iterative experimentation
5. Evaluation
• Does the model achieve business objectives?
6. Deployment
• PMML; application integration; data platform; Excel
17. @joe_Caserta#BDWMeetup
1. Business Understanding
In this initial phase of the project we will need to speak to
humans.
• It would be premature to jump in to the data, or begin
selection of the appropriate model(s) or algorithm
• Understand the project objective
• Review the business requirements
• The output of this phase will be conversion of business
requirements into a preliminary technical design (decision
model) and plan.
Since this is an iterative process, this phase will be revisited
throughout the entire process.
18. @joe_Caserta#BDWMeetup
2. Data Understanding
• Data Discovery understand where the data you
need comes from
• Data Profiling interrogate the data at the entity
level, understand key entities and fields that are
relevant to the analysis.
• Cleansing Requirements understand data
quality, data density, skew, etc
• Data Munging collocate, blend and analyze data
for early insights! Valuable information can be
achieved from simple group-by, aggregate queries,
and even more with SQL Jujitsu!
Significant iteration between Business Understanding
and Data Understanding phases.
19. @joe_Caserta#BDWMeetup
3. Data Preparation
ETL (Extract Transform Load)
90+% of Data Science time goes into Data
Preparation!
• Select required entities/fields
• Address Data Quality issues: missing or incomplete
values, whitespace, bad data-points
• Join/Enrich disparate datasets
• Transform/Aggregate data for intended use:
• Sample
• Aggregate
• Pivot
20. @joe_Caserta#BDWMeetup
Data Quality and Monitoring
• BUILD a robust data quality
subsystem:
• Metadata and error event facts
• Orchestration
• Based on Data Warehouse ETL
Toolkit
• Each error instance of each data
quality check is captured
• Implemented as sub-system
after ingestion
• Each fact stores unique
identifier of the defective source
row
21. @joe_Caserta#BDWMeetup
4. Modeling
The Lovers of Algebra & Statistics
• Evaluate various models/algorithms
• Classification
• Clustering
• Regression
• Many others…..
• Tune parameters
• Iterative experimentation
• Different models may require different data
preparation techniques (ie. Sparse Vector Format)
• Additionally we may discover the need for additional
data points, or uncover additional data quality issues!
23. @joe_Caserta#BDWMeetup
5. Evaluation
What problem are we trying to solve again?
• Our final solution needs to be evaluated against original
Business Understanding
• Did we meet our objectives?
• Did we address all issues?
24. @joe_Caserta#BDWMeetup
6. Deployment
Engineering Time!
• It’s time for the work products of data science to
“graduate” from “new insights” to real applications.
• Processes must be hardened, repeatable, and generally
perform well too!
• Full Data Governance applied
• PMML (Predictive Model Markup Langauge): XML based
interchange format
25. @joe_Caserta#BDWMeetup
Some Thoughts
Big Science requires the convergence
of
data governance,
advanced data engineering,
math and statistics and
business smarts
Data Science must be guided by best
practices and standards
Tools and techniques must
ultimately be platform agnostic
(portable)
Work with experts that have done it
before!
Data science is not about Hadoop, but it is about modern data engineering. Think polyglot persistence – the right tool for the job.
Visualization can be tableau, excel, ggplot2 or d3.js. Or anything.