SlideShare uma empresa Scribd logo
1 de 154
Baixar para ler offline
The Journey to Becoming a Data-
Driven Enterprise
Pivotal Big Data Roadshow 2015
2© Copyright 2015 Pivotal. All rights reserved.
Where we’re going today…
3 Great Keynotes
• Journey to a Data-driven Enterprise
• Data Science Use Cases
• Streaming Data and Predictive Analytics
Stock Inference Demo and Architecture Overview
Intensive hands-on training sessions
3© Copyright 2015 Pivotal. All rights reserved.
Today’s Agenda
12:00 PM – 12:45 PM - Check-In & Lunch
12:45 PM – 1:00 PM – Welcome and Agenda Review
1:00 PM – 1:PM AM – How Pivotal’s Tools Help Drive Value from Data Science
1:20 PM – 2:20 PM – Accelerating the Generation of New Insights – R&D Use Case
Review and Demo
2:20 PM – 2:30 PM – Coffee Break
2:30 PM – 3:00 PM – Manufacturing Use Case Review and Demo
3:00 PM – 3:15 PM – Closing Remarks
© Copyright 2015 Pivotal. All rights reserved.
MASHING BIG DATA WITH BIG MACHINES
IS ‘BEAUTIFUL, DESIRABLE, INVESTABLE’
- IT COULD TRANSFORM GE'S BUSINESS -
AND THE ECONOMY.
“
”JEFF IMMELT, CEO, GE
© Copyright 2015 Pivotal. All rights reserved.
THE POWER OF 1
R
X
Increasing
Freight Utilization Rail
Predictive
Maintenance Healthcare
Predictive
Diagnostics Power
Driving Outcomes That Matter
One Percent Improvement Equals
$27B
Industry Value by
Reducing System
Inefficiency
$63B
Industry Value by
Reducing Process
Inefficiency
$66B
Industry Value with
Efficiency Improvements
In Gas-fired Power
Plant Fleets
Source: General Electric
© Copyright 2015 Pivotal. All rights reserved.
DATA-DRIVEN ENTERPRISE JOURNEY
STORE
• Structured
• Unstructured
• High Volume
• High Velocity
ANALYZE
• Predictive Analytics
• Machine Learning
• Advance Data Science
• Realtime Analytics
DEVELOP
• Advanced Analytic Pipelines
• Realtime Analytical Applications
• Global Scale Data-Driven
Applications
• Enterprise, Consumer, and
Mobile
INNOVATE
• Agile Dev Expertise
• DevOps
• Microservice
• Continuous Delivery
• Closed Loop Applications
AGILE DEVELOPMENT
BIG DATA
PREDICTIVE ANALYTICS
CLOUD NATIVE PLATFORM
8© Copyright 2015 Pivotal. All rights reserved.
0% of CIOs think
their IT infrastructure
is fully prepared for
big data (3)
30% of
companies have
deployed advanced
analytics, 11% big
data analysis (4)
44% of new applications
failed to meet performance
expectations (5)
2X
90% of companies
allocate at least 2X
more cloud capacity
than needed to ensure
performance (6)
But…
80% of
CEOs thinking
data mining and
analysis are
strategically important
(1)
4% of
companies use
analytics
effectively (2)
(1) 2015 PWC CEO Survey; (2)2013 Baine and Company - The Value of Big Data; (3) 2014 IT Infrastructure Conversation - IBM; (4) Ernest and Young - 2014 Enterprise IT Trends and
Investments; (5) 2014 Riverbed Tecnologies - The Transformers; (6) 2014 ElasticHosts CIO Study
LARGE ENTERPRISE BIG DATA TROUBLE
9© Copyright 2015 Pivotal. All rights reserved.
BIG DATA
CHASM
70%
of data
generated by
customers
80%
of data stored
3%
prepared for
analysis
0.5%
being
analyzed
<0.5%
being
operationalized
9
THE DATA DIVIDE
10© Copyright 2015 Pivotal. All rights reserved.
Software Is Eating The World
Data Is Fueling Software
SOFTWARE IS EATING THE WORLD
11© Copyright 2015 Pivotal. All rights reserved.
WE CHOSE PIVOTAL BECAUSE WE BELIEVE IT
PROVIDES A 360-DEGREE VIEW OF THE PROCESS.
FROM A DATA SCIENCE AND DATA TECHNOLOGY
PERSPECTIVE, IT MEANS DELIVERING BEST-IN-
CLASS DATA TECHNOLOGIES AND ENABLING THEM
ON THEIR PLATFORM.
“
”
12© Copyright 2015 Pivotal. All rights reserved.
ACROSS INDUSTRIES
13© Copyright 2015 Pivotal. All rights reserved.
THE NEW DATA IMPERATIVES
Converged
Data & Cloud
OpenData-Driven
Apps
14© Copyright 2015 Pivotal. All rights reserved.
THE BIG DATA PROBLEM
Fragmentation ConstraintsComplexity
15© Copyright 2015 Pivotal. All rights reserved.
• Remove Lock-in
• Leverage Ecosystem
• Co-innovate
GUIDING PRINCIPLES IN THE NEW ERA
OPEN AGILE CLOUD-READY
• Shorten innovation
cycles
• Reduce TCO
• Improve TTM
• Solve business
problems
• Avoid lock-in
• Appropriate security
16© Copyright 2015 Pivotal. All rights reserved.
JOURNEY TO A DATA-DRIVEN ENTERPRISE
Deploy analytic apps and
automate at scale
Perform advanced analytics
Discover insights
Modernize data
infrastructure
17© Copyright 2015 Pivotal. All rights reserved.
Deploy analytic apps and
automate at scale
Perform advanced analytics
Discover insights
Modernize data
infrastructure
DATA-DRIVEN COMPANIES:
USE MODERN DATA INFRASTRUCTURE
18© Copyright 2015 Pivotal. All rights reserved.
MODERNIZE DATA INFRASTRUCTURE
Elastic, Scale-out
storage and processing
Flexible data types and
pipelining
ETL on demand: low operational cost
Expanded use cases
Higher quality analytics
Lowered storage/processing cost
Less fragmented ecosystem
Reduced vendor lock-in
REQUIREMENTS BENEFITS
Cloud friendly and
open-source based
19© Copyright 2015 Pivotal. All rights reserved.
Modernize data
infrastructure
Deploy analytic apps and
automate at scale
Perform advanced analytics
Discover insights
DATA-DRIVEN COMPANIES:
STRATEGICALLY USE ADVANCED ANALYTICS
20© Copyright 2015 Pivotal. All rights reserved.
ADVANCED ANALYTICS
Leverage existing skills and tools
Rapid time to insights
Internet of Things use cases
Rapid time to insights
Solve business problems
Predictive insights: proactive execution
REQUIREMENTS BENEFITS
Machine learning and
advanced analytics
01010101010101
01001010101010
10101100101010
SQL- compliant batch
and interactive queries
Massive stream
processing
0101010101010101001010
1010101010110010101010
10101010
21© Copyright 2015 Pivotal. All rights reserved.
Modernize data
infrastructure
Perform advanced analytics
Discover insights
DATA-DRIVEN COMPANIES:
INNOVATE AT SCALE
Deploy analytic apps and
automate at scale
22© Copyright 2015 Pivotal. All rights reserved.
ANALYTIC APPS AND AUTOMATION AT SCALE
Reduced time to action
Low ‘analytics  app-dev’ integration cost
Reduced time to insights
Flexible ingestion: low operating cost
High performance: low operating cost
Transactional safety: business critical ops
REQUIREMENTS BENEFITS
Low-latency, distributed
in-memory transactions
Resilient, scale-out
messaging and object storage
Agile analytic app-dev
with enterprise PaaSPaaS
23© Copyright 2015 Pivotal. All rights reserved.
JOURNEY TO A DATA-DRIVEN ENTERPRISE
Deploy analytic apps and
automate at scale
Perform advanced analytics
Discover insights
Modernize data
infrastructure
Pivotal Data Science
helps you move from BI to
Data Science
Pivotal Labs
helps you move to an
agile development of apps
at scale
Pivotal Data Engineering
helps you move from data
administration to data engineering
26© Copyright 2015 Pivotal. All rights reserved.
PIVOTAL BIG DATA SUITE
27© Copyright 2015 Pivotal. All rights reserved.
Open sourcing all Pivotal Big Data Suite components including:
WORLD’S FIRST OPEN SOURCED BIG DATA
PORTFOLIO
BUILDING ON SUCCESS OF CLOUD FOUNDRY FOUNDATION
BUILT FOR ENTERPRISES
Pivotal GemFire
Apache Geode
Apache HAWQ
Pivotal HDB
Pivotal
Greenplum Database
28© Copyright 2015 Pivotal. All rights reserved.
BUILT FOR ENTERPRISES
Value added features: enterprise grade performance + robustness without lock-in
• Advanced Query Optimization in analytics
• WAN replication and continuous query in transactional processing
Flexible Deployment models: align to business objectives and needs
• Balance cost objectives with policy and compliance requirements
• Leverage Pivotal’s pre-integration + certification on supported configurations
Enterprise grade support: one throat to choke for the suite
• Focus on business problems – not on lifecycle management
• Expert support on Big Data Suite means reduced business risk
29© Copyright 2015 Pivotal. All rights reserved.
• Common core for Hadoop ecosystem
• Rapidly accelerated certifications, ecosystem
development and enterprise-grade quality
OpenDataPlatform.org
OPEN
30© Copyright 2015 Pivotal. All rights reserved.
AGILE
Deploy analytic apps and
automate at scale
Perform advanced analytics
Discover insights
Modernize data infrastructure
Spring XD Spark Pivotal HD & Open Data Platform
Pivotal Greenplum Database
Pivotal HDB Rabbit MQ
Redis
Pivotal GemFire Pivotal BDS on PCF
Pivotal Cloud Foundry
31© Copyright 2015 Pivotal. All rights reserved.
CLOUD-READY
COMMODITY
HARDWARE
APPLIANCE HYBRID CLOUDCLOUD
IaaS IaaS
PAAS
32© Copyright 2015 Pivotal. All rights reserved.
DATA-DRIVEN ENTERRPRISE JOURNEY WITH
PIVOTAL BIG DATA SUITE
STORE
• Structured
• Unstructured
• High Volume
• High Velocity
ANALYZE
• Predictive Analytics
• Machine Learning
• Advance Data Science
• Realtime Analytics
DEVELOP
• Advanced Analytic Pipelines
• Realtime Analytical Applications
• Global Scale Data-Driven
Applications
• Enterprise, Consumer, IoT, and
Mobile
INNOVATE
• Agile Dev Expertise
• DevOps
• Microservices
• Continuous Delivery
• Closed Loop Applications
AGILE DEVELOPMENT
BIG DATA
PREDICTIVE ANALYTICS
CLOUD NATIVE PLATFORM
Spring XD
Spark
Pivotal HD &
Open Data Platform
Spring XD
Pivotal Greenplum
Database
Pivotal HDB
Spring XD
Pivotal GemFire
Rabbit MQ
Spring Cloud
Pivotal BDS on PCF
Pivotal Cloud Foundry
Pivotal LabsData ScienceData Engineering
35© Copyright 2015 Pivotal. All rights reserved.
FOR FURTHER INFO, CHECKOUT…
• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal
• Pivotal Academy @ https://pivotal.biglms.com
Or reach out to your local Pivotal Account Executive…
36© Copyright 2015 Pivotal. All rights reserved. 36© Copyright 2013 Pivotal. All rights reserved.
Pivotal Data
Science Overview
and Use Cases
Pivotal Big Data Roadshow
37© Copyright 2015 Pivotal. All rights reserved.
DATA SCIENCE?
App Development
Analytics
Business Intelligence
Reporting
Visualization
Dashboards
Insights
Big Data
Machine Learning
Statistics
Mathematics
Time Series
Algorithms
Databases
Software
Modeling
Queries
Real-Time
Sensors
Predictive Models
ETL
Research
Hadoop
Distributed Computing
MapReduce
SQL
In-Memory
OLAP
Text Mining
Unstructured Data
Open Source
Decision Science
Ad Hoc Queries
Hacking
In-Database Analytics
Internet of Things
Data Cleansing
Sentiment
38© Copyright 2015 Pivotal. All rights reserved.
• ETL
• Unstructured
• Data Cleansing
• Sensors
Data Related
• Algorithms
• Mathematics
• Statistics
• Econometrics
• Predictive Modeling
• Machine Learning
• Text Mining
• Sentiment
• Map Reduce
Fields of Study &
Techniques
• Dashboards
• Insights
• Visualization
• Ad Hoc Queries
• Reporting
Business
Intelligence
• Software
• In-Database
Analysis
• Distributed
Computing
• Hadoop
• Open Source
Implementation
• Big Data
• Decision
Science
• Internet of
Things
• Real-Time
• Hacking
• In-Memory
Industry
Buzzwords
39© Copyright 2015 Pivotal. All rights reserved.
What is Data Science?
The use of statistical and machine learning techniques on
big multi-structured data in a distributed computing
environment to identify correlations and causal
relationships, classify and predict events, identify patterns
and anomalies, and infer probabilities, interest, and
sentiment.
DRIVE AUTOMATED, LOW-LATENCY ACTIONS IN RESPONSE TO EVENTS OF INTEREST
40© Copyright 2015 Pivotal. All rights reserved.
Gene Sequencing
Smart Grids
COST TO SEQUENCE
ONE GENOME
HAS FALLEN FROM
$100M IN 2001
TO $10K IN 2011
TO $1K IN 2014
READING SMART METERS
EVERY 15 MINUTES IS
3000X MORE
DATA INTENSIVE
Stock Market
Social Media
FACEBOOK UPLOADS
250 MILLION
PHOTOS EACH DAY
Billions of Data Points
Oil Exploration
Video Surveillance
OIL RIGS GENERATE
25000
DATA POINTS
PER SECOND
Medical Imaging
Mobile Sensors
41© Copyright 2015 Pivotal. All rights reserved.
What is Big Data Analytics?
Descriptive
Analytics
WHAT HAPPENED?
Diagnostic
Analytics
WHY DID IT HAPPENED?
Predictive
Analytics
WHAT WILL HAPPEN?
Prescriptive
Analytics
HOW CAN WE MAKE IT HAPPEN?
Complexity
Value of
Analytics
($)
42© Copyright 2015 Pivotal. All rights reserved.
P L A T F O R M
Data Science Toolkit
KEY TOOLS KEY LANGUAGES
SQL
43© Copyright 2015 Pivotal. All rights reserved.
Scalable, In-Database ML
• Open Source https://github.com/madlib/madlib
• Works on Greenplum DB, HAWQ and PostgreSQL
• In active development by Pivotal
• Downloads and Docs: http://madlib.net/
44© Copyright 2015 Pivotal. All rights reserved.
Functions
Supervised Learning
Regression Models
• Cox Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Ordinal Regression
• Robust Variance, Clustered Variance
• Support Vector Machines
Tree Methods
• Decision Tree
• Random Forest
Other Methods
• Conditional Random Field
• Naïve Bayes
Unsupervised Learning
• Association Rules (Apriori)
• Clustering (K-means)
• Topic Modeling (LDA)
Statistics
Descriptive
• Cardinality Estimators
• Correlation
• Summary
Inferential
• Hypothesis Tests
Other Statistics
• Probability Functions
Other Modules
• Conjugate Gradient
• Linear Solvers
• PMML Export
• Random Sampling
• Term Frequency for Text
Time Series
• ARIMA
Aug 2015
Data Types and Transformations
• Array Operations
• Dimensionality Reduction (PCA)
• Encoding Categorical Variables
• Matrix Operations
• Matrix Factorization (SVD, Low Rank)
• Norms and Distance Functions
• Sparse Vectors
Model Evaluation
• Cross Validation
Predictive Analytics Library
45© Copyright 2015 Pivotal. All rights reserved.
A single address for everything analytics
Analytics with Pivotal
Time-to-Insights
FORECASTING CLUSTERING
REGRESSION
CLASSIFICATION
OPTIMIZATION
46© Copyright 2015 Pivotal. All rights reserved.
Smart Systems = Sensors + Digital Brain + Actuators
Problem
Formulation
Modeling
Step
Data Step
Application
Step
Data Science for
Building Models
Sensors &
Actuators
Data Lake
47© Copyright 2015 Pivotal. All rights reserved. 47© Copyright 2013 Pivotal. All rights reserved.
Data Science
Use Cases
48© Copyright 2015 Pivotal. All rights reserved. 48© Copyright 2013 Pivotal. All rights reserved.
Financial Services
49© Copyright 2015 Pivotal. All rights reserved.
Identifying and Pricing Cross-Sell Opportunities
CUSTOMER
A global financial services provider
BUSINESS PROBLEM
Identify cross-sell opportunities between two
business arms of a financial institution.
CHALLENGES
Integration of large-scale data originating from
multiple data warehouses. Developing predictive
models to identify novel cross-sell opportunities
within the financial institution. Evaluate the
identified cross-sell opportunities by their revenue
potential.
SOLUTIONS
 Fast integration of data in Pivotal Greenplum
Database.
 Predictive models and evaluation of profitability:
– Association rule.
– Logistic regression for each product
offered.
– Estimation of revenue opportunity.
 On-demand reporting and visualization via
custom dashboards connected to
in-database models.
 Identified multi-million dollar opportunities for the
bank.
50© Copyright 2015 Pivotal. All rights reserved.
Financial Compliance
BUSINESS PROBLEM
• Ensure compliance with Dodd-Frank and Basel
Committee regulations
• Identify underlying risk and fraud while reducing the
compliance department’s overburdened
Emails Chats Trades
Transactions Policy Securities
Phone Calls Watch Lists …
Financial compliance
Data Lake
Data
integration
Data
clean up
Modeling
Classification
and ranking
Analyst user interfaces
Feedback
Analytics
Analyst feedback
Data integration: e.g., append
trade information with email
and chat communications
Data cleanup: e.g.,
identify newsletters and
spam emails
Modeling:
• Predictive modeling to flag
messages and trades
• Graph and cohort analysis
Analyst feedback
Reviewed fraud instances
included in periodic model
refreshes
SOLUTION
 A data lake platform coupled with cutting edge data
science techniques
 Flexible user interface to promote an adaptive,
continuously learning compliance framework
51© Copyright 2015 Pivotal. All rights reserved. 51© Copyright 2013 Pivotal. All rights reserved.
Telco & Mobile
52© Copyright 2015 Pivotal. All rights reserved.
Subscriber Micro-Segmentation
CUSTOMER
A major telco with cable & VOD, internet, and
phone business units
BUSINESS PROBLEM
Better understand aggregated subscriber behavior
to drive business strategy using newly available
data sources
CHALLENGES
▪ Large quantities of deep packet inspection data
and set top box data that had not been analyzed
before
▪ Needed to incorporate internet usage and TV
consumption information into pre-existing
subscriber segments
SOLUTION
▪ Generated new subscriber segments that
incorporated features based on consumption of TV
and internet services across a variety of devices
▪ Crossed new segments with existing segments to
generate new micro-segments for cross-sell/upsell
and new product development opportunities
Customized Micro-Segments
53© Copyright 2015 Pivotal. All rights reserved.
Newly Identified Behavior-Based SegmentsSubscribers
Moderates
OTT & Data Heavyweights
Portable OTT Entertainment Seekers
iPhone Heavy
Android Heavy
iPad Heavy
In-Home OTT Entertainment Seekers
In-Home Native Content Seekers
VOD Heavy
TV Heavy
54© Copyright 2015 Pivotal. All rights reserved.
Opportunities for
Data-Driven Decisions
in Pharma
55© Copyright 2015 Pivotal. All rights reserved.
Data driven drugs: From discovery to delivery
RICH DATA SOURCES
• Molecular data
• Cellular drug screens
• Animal models
• Clinical data including notes, images,
markers (e.g. genomics, lab results)
• Sensor and assay data
• Internal and partner/purchased
external data
• Contact center data
• Patient registries, public and federal
data, clinical partnerships
Clinical
Trials
Manufacturing
Marketing
Distribution and
surveillance
Drug discovery
+ development
56© Copyright 2015 Pivotal. All rights reserved.
A pipeline of sensors and opportunities for optimizing output
Internet of Things in Manufacturing
Input materials Mix Incubate Filter Centrifuge Final Product
0
5
10
15
20
25
30
0 50 100 150 200
Sensors
High-Content
Screens
TEMP
TIME
Absorbance
Elution volume
Velocity
Time
Automated raw
materials mixing
57© Copyright 2015 Pivotal. All rights reserved.
Vaccine Potency Prediction
CUSTOMER
A major pharmaceutical company
BUSINESS PROBLEM
Predict potency and antigen levels of live virus
vaccines based on manufacturing sensor data
and manual data collected throughout the
process.
CHALLENGES
 Customer’s data model was not optimal for
running analytical queries
 Manual data quality issues
 Data capture was performed with varying
consistency due to high cost associated with
manual data collection
SOLUTION
 Introduced a new data model to make data
accessible and enable analytics (including
LIMS and DeltaV)
 Built automated outlier detection/correction
methods to address manual data entry
quality issues
 Devised imputation methods to deal with
data completeness issues
 Built predictive models with high accuracy
58© Copyright 2015 Pivotal. All rights reserved.
http://blog.pivotal.io/data-science-pivotal
Check out the Pivotal Data Science Blog!
59© Copyright 2015 Pivotal. All rights reserved.
FOR FURTHER INFO…
• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal
• Pivotal Academy @ https://pivotal.biglms.com
• Or reach out to your local Pivotal Account Executive…
60© Copyright 2015 Pivotal. All rights reserved.
Pivotal Data Science Labs: Packaged Services
• Analytics
Roadmap
• Prioritized
Opportunities
• Architectural
Recommendati
ons
• Hands-on
training
• Hosted data on
Pivotal Data
stack
• Results review
&
assessment
• On-site MPP
analytics
training
• Analytics tool-
kit
• Quick insight
(2 weeks)
• Prof. services
• Data science
model building
• Ready-to-
deploy
model(s)
• Prof. services
• Data science
model building
• Ready-to-
deploy
model(s)
LAB PRIMER
(2-Week
Roadmapping)
LAB 600
(6-Week Lab)
LAB 1200
(12-Week Lab)
LAB 100
(Analytics Bundle)
DATA JAM
(Internal DS
Contest)
61© Copyright 2015 Pivotal. All rights reserved.
Data Streaming and
Predictive Analytics
Using Pivotal Big Data Suite
62© Copyright 2015 Pivotal. All rights reserved.
Converging
Trends
Innovation
New Data New
Processes
New Insights
The Journey to the Data-Driven Enterprise
Data Science
and Machine
Learning
Big Data
IoT, Mobile Apps,
Social Media
63© Copyright 2015 Pivotal. All rights reserved.
HDFS
Data Lake
Ingest Store Analytics
Hard to change
Labor intensive
Inefficient
Coding based
No real-time information
Based on expensive ETL
Migrating from a Reactive, Static and
Constrained Model…
64© Copyright 2015 Pivotal. All rights reserved.
HDFSData Lake
Expert System /
Machine Learning
In-Memory Real-
Time Data
Continuous Learning
Continuous Improvement
Continuous Adapting
Data Stream Pipeline
Multiple Data Sources
Real-Time Processing
Store Everything
To Pro-Active, Self-Improving, Machine
Learning Systems
65© Copyright 2015 Pivotal. All rights reserved.
New York Times Research: http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
“
50-80% OF THE TIME ON DATA
SCIENCE PROJECTS IS SPENT ON
DATA WRANGLING
”
66© Copyright 2015 Pivotal. All rights reserved.
Data Feeds
Stream Processing
Expert Systems
Machine Learning
Historical Data
Business Value
Smart Decisions
Still…
HDFS
Data Lake
67© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform Sink
SpringXD
GemFire
Data Stream Needs an Agile, Scalable and
Fast Solution
HAWQ GPDB
Data
Lake
68© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform Sink
SpringXD
Distributed
Computing
In-Memory
Real-Time Data
Spring XD Orchestrates and Automates all the
Steps on Data Stream Pipelining
Expert System /
Machine Learning
Extensible
Open-Source
Fault-Tolerant
Horizontally Scalable HAWQ GPDB
Data
Lake
69© Copyright 2015 Pivotal. All rights reserved.
INGEST / SINK PROCESS ANALYZE
• No coding required
• Dozens of built-in
connectors
• Seamless integration with
Kafka, Sqoop
• Create new connectors
easily using Spring
• Call Spark, Reactor or
RxJava
• Built-in configurable
filtering, splitting and
transformation
• Out-of-box configurable
jobs for batch processing
• Import and invoke PMML
jobs easily
• Call Python, R, Madlib and
other tools
• Built-in configurable
counters and gauges
Spring XD
State of the Art Data Pipeline Automation
70© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform Sink
SpringXD
Distributed
Computing
GemFire Provides Scalable, Low-Latency
Data Access, Storage and Event Processing
Expert System /
Machine Learning
GemFire
Extensible
Open-Source
Fault-Tolerant
Horizontally Scalable HAWQ GPDB
Data
Lake
71© Copyright 2015 Pivotal. All rights reserved.
GemFire
• In-Memory Enterprise Data Grid
• Horizontally Scalable, Consistent, Highly
Available
• Event handling
• Continuous Queries
• Enterprise Data Geo Distribution
In-memory Real Time Data
72© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform Sink
SpringXD
Distributed
Computing
Pivotal Provides SQL Based
Advanced Analytics
Expert System /
Machine Learning
GemFire
Extensible
Open-Source
Fault-Tolerant
Horizontally Scalable
Data
Lake
HAWQ GPDB
73© Copyright 2015 Pivotal. All rights reserved.
HAWQ
• Massively Parallel Processing
RDBMS on HADOOP
• ANSI SQL on Hadoop
• Extremely high performance for
analytics (not like Hive)
• Stores all data directly on
HDFS
• Open-Source
Advanced SQL analytics in Hadoop
Combining SQL with Hadoop is key for analytics
SQL remains #1 choice for Data Science
74© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform Sink
SpringXD
Developers and Data Scientists Can Focus on
the Business Value of Data
GemFire
Extensible
Open-Source
Fault-Tolerant
Horizontally Scalable
Data
Lake
HAWQ GPDB
75© Copyright 2015 Pivotal. All rights reserved.
Data Streaming Reference Architecture
Data Feeds Transactional Apps Analytic Apps
Data Stream Pipeline
Distributed
Computing
Real-Time Data
Expert Systems &
Machine Learning
Advanced
Analytics
HDFSData Lake
76© Copyright 2015 Pivotal. All rights reserved.
Data Streaming Reference Architecture
Data Feeds Transactional Apps Analytic Apps
Data Stream Pipeline
HDFSData Lake
GemFire HAWQ GPDB
SpringXD
77© Copyright 2015 Pivotal. All rights reserved.
“
SO WE ARE MOVING TO A WORLD WHERE THE
MACHINES WE WORK WITH ARE NOT JUST
INTELLIGENT; THEY ARE BRILLIANT.THEY ARE
SELF-AWARE, THEY ARE PREDICTIVE, REACTIVE
AND SOCIAL. IT'S A WORLD WHERE INFORMATION
ITSELF BECOMES INTELLIGENT AND COMES TO US
AUTOMATICALLY WHEN WE NEED IT WITHOUT
HAVING TO LOOK FOR IT.
”MARCO ANNUNZIATA, GE
78© Copyright 2015 Pivotal. All rights reserved.
Demo
Powered by Pivotal Big Data Suite
79© Copyright 2015 Pivotal. All rights reserved.
It's all about DATA
Data Sources
Look for patterns
Prediction
Transform Sink
SpringXD
Extensible
Open-Source
Fault-Tolerant
Horizontally Scalable
Cloud-Native
Machine Learning
Enrich Filter
Split
Dashboard
Indicators
1
2
Predict
3
Real data
Simulator
/Stocks
/TechIndicators
/Predictions
81© Copyright 2015 Pivotal. All rights reserved.
Driving Real Insights Through Data Science
91© Copyright 2015 Pivotal. All rights reserved.
“
THE REAL OPPORTUNITY FOR
CHANGE...SURPASSING THE MAGNITUDE OF THE
CONSUMER INTERNET...IS THE INDUSTRIAL
INTERNET, AN OPEN, GLOBAL NETWORK THAT
CONNECTS PEOPLE, DATA AND MACHINES.
”JEFF IMMELT, CEO, GE
100© Copyright 2015 Pivotal. All rights reserved.
FOR FURTHER INFO, CHECKOUT…
• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal
• Pivotal Academy @ https://pivotal.biglms.com
• Or reach out to your local Pivotal Account Executive…
BUILT FOR THE SPEED OF BUSINESS
102© Copyright 2015 Pivotal. All rights reserved. 102© Copyright 2013 Pivotal. All rights reserved.
Accelerating the Generation of
New Insights
October 27, 2015
Sarah Aerni, Data Science
Antonio Petrole, Data Engineering
BUILT FOR THE SPEED OF BUSINESS
104© Copyright 2015 Pivotal. All rights reserved.
Gene Sequencing
Smart Grids
COST TO SEQUENCE
ONE GENOME
HAS FALLEN FROM
$100M IN 2001
TO $10K IN 2011
TO $1K IN 2014
READING SMART METERS
EVERY 15 MINUTES IS
3000X MORE
DATA INTENSIVE
Stock Market
Social Media
FACEBOOK UPLOADS
250 MILLION
PHOTOS EACH DAY
Technology to process and store data is needed in
all industries
Oil Exploration
Video Surveillance
OIL RIGS GENERATE
25000
DATA POINTS
PER SECOND
Medical Imaging
Mobile Sensors
105© Copyright 2015 Pivotal. All rights reserved.
What is Big Data Analytics?
Descriptive
Analytics
WHAT HAPPENED?
Diagnostic
Analytics
WHY DID IT HAPPEN?
Predictive
Analytics
WHAT WILL HAPPEN?
Prescriptive
Analytics
HOW CAN WE MAKE IT HAPPEN?
Complexity
Value of
Analytics
($)
106© Copyright 2015 Pivotal. All rights reserved.
Opportunities for
Data-Driven Decisions
in Pharma
107© Copyright 2015 Pivotal. All rights reserved.
Data driven drugs: From discovery to delivery
RICH DATA SOURCES
• Molecular data
• Cellular drug screens
• Animal models
• Clinical data including notes, images,
markers (e.g. genomics, lab results)
• Sensor and assay data
• Internal and partner/purchased
external data
• Contact center data
• Patient registries, public and federal
data, clinical partnerships
Clinical
Trials
Manufacturing
Marketing
Distribution and
surveillance
Drug discovery
+ development
108© Copyright 2015 Pivotal. All rights reserved.
A pipeline of sensors and opportunities for optimizing output
Internet of Things in Manufacturing
Input materials Mix Incubate Filter Centrifuge Final Product
0
5
10
15
20
25
30
0 50 100 150 200
Sensors
High-Content
Screens
TEMP
TIME
Absorbance
Elution volume
Velocity
Time
Automated raw
materials mixing
109© Copyright 2015 Pivotal. All rights reserved.
Vaccine Potency Prediction
CUSTOMER
A major pharmaceutical company
BUSINESS PROBLEM
Predict potency and antigen levels of live virus
vaccines based on manufacturing sensor data
and manual data collected throughout the
process.
CHALLENGES
 Customer’s data model was not optimal for
running analytical queries
 Manual data quality issues
 Data capture was performed with varying
consistency due to high cost associated with
manual data collection
SOLUTION
 Introduced a new data model to make data
accessible and enable analytics (including
LIMS and DeltaV)
 Built automated outlier detection/correction
methods to address manual data entry
quality issues
 Devised imputation methods to deal with
data completeness issues
 Built predictive models with high accuracy
110© Copyright 2015 Pivotal. All rights reserved.
Interpreting the utility of a measure obtained during manufacturing based
on model outcomes
Sample model insights
 Some features may reveal
tunable parameters to
alter potency, others may
simply be markers
 Features consistently
absent from models may
be uninformative for
predicting potency
 Opportunities to provide
real-time feedback on data
entry errors and predicted
potency outcomes
Assayed value Duration of a step
Potency
Potency
Correlation=0.45 Correlation=0.38
111© Copyright 2015 Pivotal. All rights reserved.
Need for new environments to process big data?
HDFS STORAGE AND MPP
ARCHITECTURES DISTRIBUTE STORAGE
AND PREVENT DATA MOVEMENT
VARIETY/VELOCITY
DISTRIBUTED COMPUTATION FOR
PARALLELIZATION
PETABYTES OF DATA
OPEN-SOURCE LIBRARY FOR MACHINE
LEARNING AT SCALE AND FRAMEWORK
TO ACCESS COMMON LANGUAGES
RAPIDLY EVOLVING FIELD
OF DATA SCIENCE AND
TOOLS
SQL ENGINE AND ODBC/JDBC
CONNECTIONS TO HADOOP
MANY EXISTING LIBRARIES,
TOOLS AND EXPERTISE
FLEXIBLE
SCALABLE
ENABLING
ACCESSIBLE
112© Copyright 2015 Pivotal. All rights reserved.
Multiple tools with a single, simple goal: Distributed
storage with in-place computation
Pivotal
Hadoop
Pivotal Greenplum
Database
HAWQ
113© Copyright 2015 Pivotal. All rights reserved.
Multiple tools with a single, simple goal: Distributed
storage with in-place computation
Think of it as multiple PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by a particular field (or randomly)
Pivotal
Hadoop
Pivotal Greenplum
Database
HAWQ
114© Copyright 2015 Pivotal. All rights reserved.
Identifying duplicates: counting with grouping
Opportunities for
performance improvements
– Sorting and re-sorting is
required in many
pipelines
– Single-threaded
processes create
bottlenecks in speed and
need to move data
https://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-1-Map_and_Dedup.pdf
Solution: Leverage Pivotal’s
distributed MPP environment by using
common database functions
115© Copyright 2015 Pivotal. All rights reserved.
Identifying duplicates: counting with grouping
Reference genome
Mapped reads
116© Copyright 2015 Pivotal. All rights reserved.
Identifying duplicates: counting with grouping
Duplicates
1
3
1
1
1
1
5
1
3
select locus, count(*) from reads
group by locusReference genome
Mapped reads
117© Copyright 2015 Pivotal. All rights reserved.
Reference genome
Mapped reads
Counting numbers of reads mapped to features
select exon, count(*) from
reads JOIN refseq
ON(<reads overlap exon>)
group by exon
5
17
12
118© Copyright 2015 Pivotal. All rights reserved.
Multiple tools with a single, simple goal: Distributed
storage with in-place computation
Think of it as distributed file system with very large blocks of data
Schema on read allows flexibility for a variety of datasets
Compute using a variety of paradigms (e.g. MapReduce)
Pivotal
Hadoop
Pivotal Greenplum
Database
HAWQ
Name Node
Data
Node 1
Data
Node 2
Data
Node 3
Data
Node 4
1 2 3 2 3 1 1 2
119© Copyright 2015 Pivotal. All rights reserved.
Multiple tools with a single, simple goal: Distributed
storage with in-place computation
• SQL compliant
• World-class query optimizer
• Interactive query
• Horizontal scalability
• Robust data management
• Common Hadoop formats
• Deep analytics
Pivotal
Hadoop
Pivotal Greenplum
Database
HAWQ
Think of it as distributed PostGreSQL (GPDB) on
Hadoop
• SQL compliant
• World-class query optimizer
• Interactive query
• Horizontal scalability
• Robust data management
• Common Hadoop formats
• Deep analytics
120© Copyright 2015 Pivotal. All rights reserved.
A single address for everything analytics
Analytics with Pivotal
Time-to-Insights
FORECASTING CLUSTERING
REGRESSION
CLASSIFICATION
OPTIMIZATION
121© Copyright 2015 Pivotal. All rights reserved.
P L A T F O R M
Data Science Toolkit
KEY TOOLS KEY LANGUAGES
SQL
122© Copyright 2015 Pivotal. All rights reserved.
Historically data was studied in silos
BRCA dataset
Treatments
Protein Assays
Imaging
VariantsPatient History &
Follow-Ups
Gene
Expression
Copy Number
Variation
miRNA
123© Copyright 2015 Pivotal. All rights reserved.
Genomics
Data Center
Researcher
Computing
ClusterUnnecessary data
movement
Network usage
Need for new environment: Data movement
Computation and storage in a single location
124© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
Network
Interconnect
Master
Severs
Segment
Severs
SQL & RCOVARIATES GENOTYPES
Indiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
SNP
1 2 M
A
A
C
C
TT
A
T
C
G
TT
A
A
G
G
T
C
TT C
G
T
C
125© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
Network
Interconnect
Master
Severs
Segment
Severs
SNP1 SNP2 SNPM
SQL & R
Indiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SN
P
Gen
o
1 1 AA
2 1 AT
3 1 AA
1 2 CC
2 2 CG
3 2 GG
N M TC
COVARIATES GENOTYPES
126© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
Network
Interconnect
Master
Severs
Segment
Severs
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & R
Indiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SN
P
Gen
o
1 1 AA
2 1 AT
3 1 AA
1 2 CC
2 2 CG
3 2 GG
N M TC
COVARIATES GENOTYPES
127© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
Network
Interconnect
Master
Severs
Segment
Severs
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & R
Indiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SN
P
Gen
o
1 1 AA
2 1 AT
3 1 AA
1 2 CC
2 2 CG
3 2 GG
N M TC
COVARIATES GENOTYPES
128© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
Network
Interconnect
Master
Severs
Segment
Severs
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & R
Indiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SN
P
Gen
o
1 1 AA
2 1 AT
3 1 AA
1 2 CC
2 2 CG
3 2 GG
N M TC
SNP P-value
1 2.34x10-21
2 0.395
3 7.15x10-17
M 0.000142
COVARIATES GENOTYPES RESULTS
129© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
Network
Interconnect
Master
Severs
Segment
Severs
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & R
Indiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SN
P
Gen
o
1 1 AA
2 1 AT
3 1 AA
1 2 CC
2 2 CG
3 2 GG
N M TC
SNP P-value
1 2.34x10-21
2 0.395
3 7.15x10-17
M 0.000142
COVARIATES GENOTYPES RESULTS
• In-database computation of 1 million loci for thousands of
individuals in seconds
• Results are easily manipulated and explored
130© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
Network
Interconnect
Master
Severs
Segment
Severs
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
LOR1 LOR2 LORM
SQL & R
Indiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SN
P
Gen
o
1 1 AA
2 1 AT
3 1 AA
1 2 CC
2 2 CG
3 2 GG
N M TC
SNP P-value
1 2.34x10-21
2 0.395
3 7.15x10-17
M 0.000142
COVARIATES GENOTYPES RESULTS
• In-database computation of 1 million loci for thousands of
individuals in seconds
• Results are easily manipulated and explored
131© Copyright 2015 Pivotal. All rights reserved.
Procedural Languages in Big Data Science
 HAWQ & PL/X can take advantage of “data
parallel” tasks by performing analyses in
parallel – embarrassingly parallel tasks
– Little/no effort required to break up the problem
into parallel tasks
– No dependency (or communication) between
tasks
 Examples of ‘data parallel’ problems:
– Counting words in documents
– Genome-Wide Association Study
– Studying network anomalies
 Sample Implementations by PDL
– Digital image processing
– Bayesian Inference with MCMC
– Parallel Bagged Decision Trees
Doc1 Doc2 DocM
Stem1 Stem2 StemM
SQL & R
Count1 Count2 Count
M
Network
Interconnect
Master
Severs
Segment
Severs
132© Copyright 2015 Pivotal. All rights reserved.
Finding Causal Variants in Lupus
Customer
Biotech Company
Business Problem
The customer wants to establish internal data
science capabilities: building a culture and
acquiring hardware and people to support it
Challenges
 Customer needs to establish a culture around
sharing and analyzing datasets for value
 Current in-house technology is unable to
support large-scale analysis (e.g. unable to
analyze genomics datasets)
 Need to learn new paradigms for analyzing data
at-scale
Solution
 Train customer employees on our solution stack,
provide one-on-one consulting and run a
hackathon
 Greatly reduced computation time enabling
implementation and results during a 30 hour
period
– Module previously requiring 30min took only 5sec
in using SQL in database
 Novel scientific discovery on untouched data
– Dataset previously untouched for 2 years due to
limited resources
– Mine for statistically significant associations
between ~400,000 variants for 1000 phenotypes
133© Copyright 2015 Pivotal. All rights reserved.
Processing images and building integrated
models at scale
134© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop
Sequence
File
Thousands of
Images
Image
Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS
Map
reduce
Map
reduce
One sequence file
135© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop
Sequence
File
Thousands of
Images
Image
Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS HAWQ/GPDB
Map
reduce
Map
reduce
Join to
additional
datasets
Proteomics
Medical
History
Variants
Additional Datasets
Build Models at-Scale
SQLOne sequence file
136© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop
Sequence
File
Thousands of
Images
One sequence file
Image
Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS HAWQ/GPDB
Map
reduce
Map
reduce
Raw Pixels
img1 [rgb1-rgbK]
img2 [rgb1-rgbK]
imgN [rgb1-rgbK]
Map
reduce
Join to
additional
datasets
Proteomics
Medical
History
Variants
Additional Datasets
Build Models at-Scale
SQL
137© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop
Sequence
File
Thousands of
Images
One sequence file
Image
Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS HAWQ/GPDB
Map
reduce
Map
reduce
Raw Pixels
img1 [rgb1-rgbK]
img2 [rgb1-rgbK]
imgN [rgb1-rgbK]
Map
reduce PL/X
SQL
Join to
additional
datasets
Feature
Generation
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Proteomics
Medical
History
Variants
Additional Datasets
Build Models at-Scale
SQL
138© Copyright 2015 Pivotal. All rights reserved.
Representing an image in HAWQ
HAWQ enables rapid processing of multiple or extremely
large images in parallel without memory limitations
Source Image:
Col
Row
0 1 2
0
1
2
0 0
0 1
0 2
1 0
1 1
1 2
2 0
2 1
2 2
col
row
intsy
Structured:
139© Copyright 2015 Pivotal. All rights reserved.
Translating image processing to simple SQL
Function Distribution of pixel intensities
SQL SELECT intsy, count(*)
FROM tbl
GROUP BY intsy
Output 150, 5
215, 4
HAWQ enables rapid processing of multiple or extremely
large images in parallel without memory limitations
 No data movement required
 Simple SQL queries for data exploration
0 0
0 1
0 2
1 0
1 1
1 2
2 0
2 1
2 2
Source Image:
Col
Row
0 1 2
0
1
2
col
row
intsy
Structured:
140© Copyright 2015 Pivotal. All rights reserved.
Image Processing Pipeline
For Object Counting
Original
Image name # Cells
Tma_001.jpg 359
Tma_002.jpg 1892
Tma_003.jpg 871
… …
Smoothing
Average over
window of pixels
Thresholding
Select pixels under
intensity threshold
Cleanup
Min/max over
window of pixels
Object Detection
Connected
components
Object Counting
Select components
with size filter
141© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop
Sequence
File
Thousands of
Images
One sequence file
Image
Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS HAWQ/GPDB
Map
reduce
Map
reduce
Raw Pixels
img1 [rgb1-rgbK]
img2 [rgb1-rgbK]
imgN [rgb1-rgbK]
Map
reduce PL/X
SQL
Join to
additional
datasets
Feature
Generation
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Proteomics
Medical
History
Variants
Additional Datasets
Build Models at-Scale
SQL
142© Copyright 2015 Pivotal. All rights reserved.
A Drug-Centric Data Lake to Enable Drug Discovery
Customer
A major pharmaceutical company
Business Problem
Identifying promising drug targets leveraging
and integrating the vast datasets available
will reduce time and cost to bring a new
product to market
Challenges
• Data for drugs screens across multiple
modalities cannot be easily integrated
• Current environments cannot support the
growing data form high-content screens
• Researchers are unable to leverage the
entirety of datasets, instead work with
aggregates or summaries
Solution
• Proved that current customer models can be
ported, sped up and scaled in the Pivotal
environment
• Created richer models integrating multiple
types of data (genomics, images, etc)
• Improve models using the raw, most-granular
data
• Demonstrate how the availability of tools and
data enables scientists to interrogate models
and derive a deeper, actionable insight
143© Copyright 2015 Pivotal. All rights reserved.
http://blog.pivotal.io/data-science-pivotal
Check out the Pivotal Data Science Blog!
144© Copyright 2015 Pivotal. All rights reserved.
FOR FURTHER INFO, CHECKOUT…
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Academy @ https://pivotal.biglms.com
145© Copyright 2015 Pivotal. All rights reserved. 145© Copyright 2013 Pivotal. All rights reserved.
Driving Insights
from Data Lakes
Manufacturing
Demo
Matthew Ross & Antonio Petrole
Pivotal Data Engineering
146© Copyright 2015 Pivotal. All rights reserved.
Internet of What????
147© Copyright 2015 Pivotal. All rights reserved.
Industrial Internet of Things?
148© Copyright 2015 Pivotal. All rights reserved.
IoT Goes Mainstream
According to Gartner, Inc. (a technology research and
advisory corporation), there will be nearly 26 billion
devices on the Internet of Things by 2020.
149© Copyright 2015 Pivotal. All rights reserved.
GE Doubles Down
GE invests in IIoT cloud and creates Predix cloud built upon Pivotal Cloud Foundry.
GE estimates that connecting these industrial machines to the IoT could boost
global GDP by $10 trillion to $15 trillion in 20 years. McKinsey Global Institute
research holds that IoT in general could add $6.2 billion to the global economy by
2025.
150© Copyright 2015 Pivotal. All rights reserved.
Converging
Trends
Innovation
New Data New
Processes
New Insights
The Journey to the Data-Driven Enterprise
Data Science
and Machine
Learning
Big Data
Internet
of Things
151© Copyright 2015 Pivotal. All rights reserved.
IoT Key Workflows
Data Flow
Management
Reliable
Infrastructure
Enterprise Level
Tooling
● High Availability
● Fault Tolerant
● Scalable
● Data Overload
● Normalization
● Multiple Sources
and Destinations
● Workflow
Orchestration
● Admin Tooling
● Developer
Enablement
152© Copyright 2015 Pivotal. All rights reserved.
IoT Key Workflows
Data Flow
Management
Reliable
Infrastructure
Enterprise Level
Tooling
● High Availability
● Fault Tolerant
● Scalable
● Data Overload
● Normalization
● Multiple Sources
and Destinations
● Workflow
Orchestration
● Admin Tooling
● Developer
Enablement
153© Copyright 2015 Pivotal. All rights reserved.
Data Flow Management-Data Overload
● The ability to stream
and process massive
amounts of data
● Must have a platform
that can handle that
much data without
losing or corrupting
any of it
154© Copyright 2015 Pivotal. All rights reserved.
Data Flow Management-Data Normalization
and Cleansing
● Organizing Fields
to fit into a
relational structure
● Adding extra fields
or removing
unneeded ones
155© Copyright 2015 Pivotal. All rights reserved.
Data Flow Management-Multiple Sources and
Destinations
● Stream data from
multiple different
sources
● Persist it so multiple
different destinations
● Process multiple
different data formats
156© Copyright 2015 Pivotal. All rights reserved.
IoT Key Workflows
Data Flow
Management
Reliable
Infrastructure
Enterprise Level
Tooling
● High Availability
● Fault Tolerant
● Scalable
● Data Overload
● Normalization
● Multiple Sources
and Destinations
● Workflow
Orchestration
● Admin Tooling
● Developer
Enablement
157© Copyright 2015 Pivotal. All rights reserved.
Reliable Infrastructure-High Availability and
Fault Tolerance
● Resources are
available under any
conditions
● System stays up
even if some
resources go down
158© Copyright 2015 Pivotal. All rights reserved.
Reliable Infrastructure-Scalability
● Must be able to
handle more traffic
demand at any time
● Easily process big
data workloads
159© Copyright 2015 Pivotal. All rights reserved.
IoT Key Workflows
Data Flow
Management
Reliable
Infrastructure
Enterprise Level
Tooling
● High Availability
● Fault Tolerant
● Scalable
● Data Overload
● Normalization
● Multiple Sources
and Destinations
● Workflow
Orchestration
● Admin Tooling
● Developer
Enablement
160© Copyright 2015 Pivotal. All rights reserved.
Enterprise Level Tooling-Workflows and Tooling
● Manage complex
flows of data
● Provide rich User
Interface
Applications
161© Copyright 2015 Pivotal. All rights reserved.
Enterprise Level Tooling-Developer Enablement
● Full Featured APIs
● Extreme module
customization if
needed
162© Copyright 2015 Pivotal. All rights reserved.
Are you making the most out of your data?
163© Copyright 2015 Pivotal. All rights reserved.
Bringing it all together
164© Copyright 2015 Pivotal. All rights reserved. 164© Copyright 2013 Pivotal. All rights reserved.
Reporting is nice, but being
able to take action is what
drives the value of a platform
165© Copyright 2015 Pivotal. All rights reserved.
Predictive
Analytics
Proactive
Monitoring
Reactive
Maintenance
166© Copyright 2015 Pivotal. All rights reserved.
ETL vs Streaming
● Data is loaded in large
batches
● Typically happens once a
day
● Analysis can only be done
once data is transformed
and persisted to data
warehouse
167© Copyright 2015 Pivotal. All rights reserved.
ETL vs Streaming
● Continuous, data streams
are “listening” for data
being emitted from
sensors
● Data can be analyzed in
stream
● Can be integrated into
data driven applications
168© Copyright 2015 Pivotal. All rights reserved.
Reactive Maintenance
● Alert is sent out to someone
on factory floor
● Worker must receive alert, and
be able to react
● Equipment is down for
however long it takes for
worker to receive notification
and complete repairs.
169© Copyright 2015 Pivotal. All rights reserved.
Proactive Maintenance Workflow
● Manager has Dashboard with
all gauges of the system
● Historical Log of Recent Alerts
are visible on the Dashboard
● Has the ability to dispatch a
worker to investigate a
specific line or robot
170© Copyright 2015 Pivotal. All rights reserved.
Predictive Analytics Workflow
● Run Machine Learning and
Data Science models
● Incorporate Business
Intelligence Tools
● Persist data to Data Lake and
run advanced queries
171© Copyright 2015 Pivotal. All rights reserved.
Data Streaming Needs an Agile, Scalable and Fast
Solution
Data Lake
Data Ingestion
Business Intelligence
Real Time
Analytics
Mobile
App
172© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform Sink
SpringXD
Spring XD Orchestrates and Automates all the Steps
on Data Stream Pipelining
HDB PHD
Data
Lake
Extensible
Open-Source
Fault-Tolerant
Horizontally Scalable
173© Copyright 2015 Pivotal. All rights reserved.
INGEST / SINK PROCESS ANALYZE
• No coding required
• Dozens of built-in
connectors
• Seamless integration with
Kafka, Sqoop
• Create new connectors
easily using Spring
• Call Spark, Reactor or
RxJava
• Built-in configurable
filtering, splitting and
transformation
• Out-of-box configurable
jobs for batch processing
• Import and invoke PMML
jobs easily
• Call Python, R, Madlib and
other tools
• Built-in configurable
counters and gauges
Spring XD
State of the Art Data Pipeline Automation
174© Copyright 2015 Pivotal. All rights reserved.
Pivotal HDB
Hadoop Native SQL
• Exceptional Hadoop Native
SQL Performance
• No compatibility risks to
SQL developers or SQL BI
tools and applications
• Support query roll-ups,
dynamic partitions and joins
• Massive MPP scalability to
petabytes
• On premise or on the cloud
• Scale your cluster out, not
up
• World class parallel loading
and unloading
• Fast performance for
complex and advanced data
analytics
• Integrated with MADLib for
advanced machine learning
• Powerful Cost-based Query
Optimizer
BUILT FOR THE SPEED OF BUSINESS

Mais conteúdo relacionado

Mais procurados

Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
Utilities Digital Data Driven Innovation
Utilities Digital Data Driven Innovation Utilities Digital Data Driven Innovation
Utilities Digital Data Driven Innovation Riccardo Romani
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...Precisely
 
Why Are Digital Disruptors Successful And How Can You Become One?
Why Are Digital Disruptors Successful And How Can You Become One? Why Are Digital Disruptors Successful And How Can You Become One?
Why Are Digital Disruptors Successful And How Can You Become One? VMware Tanzu
 
Dell | Your Path – Our Platform & Great Partnerships
Dell | Your Path – Our Platform & Great PartnershipsDell | Your Path – Our Platform & Great Partnerships
Dell | Your Path – Our Platform & Great PartnershipsDataWorks Summit
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...TheInevitableCloud
 
Ensuring Cloud Native Success: The Greenfield Journey
Ensuring Cloud Native Success: The Greenfield JourneyEnsuring Cloud Native Success: The Greenfield Journey
Ensuring Cloud Native Success: The Greenfield JourneyVMware Tanzu
 
Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019   Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019 Timothy Spann
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewVMware Tanzu
 
Optimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big DataOptimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big DataCloudera, Inc.
 
Cloudera for Internet of Things
Cloudera for Internet of ThingsCloudera for Internet of Things
Cloudera for Internet of ThingsCloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Accelerating government agility with cloud computing v1
Accelerating government agility with cloud computing v1Accelerating government agility with cloud computing v1
Accelerating government agility with cloud computing v1David Linthicum
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewVMware Tanzu
 

Mais procurados (20)

Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
Utilities Digital Data Driven Innovation
Utilities Digital Data Driven Innovation Utilities Digital Data Driven Innovation
Utilities Digital Data Driven Innovation
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
 
Why Are Digital Disruptors Successful And How Can You Become One?
Why Are Digital Disruptors Successful And How Can You Become One? Why Are Digital Disruptors Successful And How Can You Become One?
Why Are Digital Disruptors Successful And How Can You Become One?
 
Dell | Your Path – Our Platform & Great Partnerships
Dell | Your Path – Our Platform & Great PartnershipsDell | Your Path – Our Platform & Great Partnerships
Dell | Your Path – Our Platform & Great Partnerships
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
 
Ensuring Cloud Native Success: The Greenfield Journey
Ensuring Cloud Native Success: The Greenfield JourneyEnsuring Cloud Native Success: The Greenfield Journey
Ensuring Cloud Native Success: The Greenfield Journey
 
Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019   Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical Overview
 
Optimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big DataOptimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big Data
 
Cloudera for Internet of Things
Cloudera for Internet of ThingsCloudera for Internet of Things
Cloudera for Internet of Things
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Cloud Digital Transformation
Cloud Digital TransformationCloud Digital Transformation
Cloud Digital Transformation
 
Top 5 IoT Use Cases
Top 5 IoT Use CasesTop 5 IoT Use Cases
Top 5 IoT Use Cases
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Accelerating government agility with cloud computing v1
Accelerating government agility with cloud computing v1Accelerating government agility with cloud computing v1
Accelerating government agility with cloud computing v1
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical Overview
 

Semelhante a Driving Real Insights Through Data Science

Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Data Con LA
 
A New Day for Oracle Analytics
A New Day for Oracle AnalyticsA New Day for Oracle Analytics
A New Day for Oracle AnalyticsRich Clayton
 
Informatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemInformatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemCapgemini
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsDenodo
 
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...CA Technologies
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaCloudera, Inc.
 
Hadoop 2015: what we larned -Think Big, A Teradata Company
Hadoop 2015: what we larned -Think Big, A Teradata CompanyHadoop 2015: what we larned -Think Big, A Teradata Company
Hadoop 2015: what we larned -Think Big, A Teradata CompanyDataWorks Summit
 
10 Reasons Why Smart Organizations are Moving to Cloud BI
10 Reasons Why Smart Organizations are Moving to Cloud BI10 Reasons Why Smart Organizations are Moving to Cloud BI
10 Reasons Why Smart Organizations are Moving to Cloud BIGoodData
 
There are 250 Database products, are you running the right one?
There are 250 Database products, are you running the right one?There are 250 Database products, are you running the right one?
There are 250 Database products, are you running the right one?Aerospike, Inc.
 
Data and its Role in Your Digital Transformation
Data and its Role in Your Digital TransformationData and its Role in Your Digital Transformation
Data and its Role in Your Digital TransformationVMware Tanzu
 
Capgemini Leap Data Transformation Framework with Cloudera
Capgemini Leap Data Transformation Framework with ClouderaCapgemini Leap Data Transformation Framework with Cloudera
Capgemini Leap Data Transformation Framework with ClouderaCapgemini
 
Role of Data in Digital Transformation
Role of Data in Digital TransformationRole of Data in Digital Transformation
Role of Data in Digital TransformationVMware Tanzu
 
Big Data Management: A Unified Approach to Drive Business Results
Big Data Management: A Unified Approach to Drive Business ResultsBig Data Management: A Unified Approach to Drive Business Results
Big Data Management: A Unified Approach to Drive Business ResultsCA Technologies
 
How to implement hadoop successfuly
How to implement hadoop successfulyHow to implement hadoop successfuly
How to implement hadoop successfulyAdir Sharabi
 
Drive Business Outcomes for Big Data Environments
Drive Business Outcomes for Big Data EnvironmentsDrive Business Outcomes for Big Data Environments
Drive Business Outcomes for Big Data EnvironmentsCisco Services
 
Big Iron + Big Data = BIG DEAL! Unlock The Power of Your Mainframe Data
Big Iron + Big Data = BIG DEAL! Unlock The Power of Your Mainframe DataBig Iron + Big Data = BIG DEAL! Unlock The Power of Your Mainframe Data
Big Iron + Big Data = BIG DEAL! Unlock The Power of Your Mainframe DataCA Technologies
 
Data Day - Escuchando la red
Data Day - Escuchando la redData Day - Escuchando la red
Data Day - Escuchando la redSoftware Guru
 
Operationalizing Data Analytics
Operationalizing Data AnalyticsOperationalizing Data Analytics
Operationalizing Data AnalyticsVMware Tanzu
 
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...NuoDB
 

Semelhante a Driving Real Insights Through Data Science (20)

Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
 
A New Day for Oracle Analytics
A New Day for Oracle AnalyticsA New Day for Oracle Analytics
A New Day for Oracle Analytics
 
Informatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemInformatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake Ecosystem
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
 
Hadoop 2015: what we larned -Think Big, A Teradata Company
Hadoop 2015: what we larned -Think Big, A Teradata CompanyHadoop 2015: what we larned -Think Big, A Teradata Company
Hadoop 2015: what we larned -Think Big, A Teradata Company
 
10 Reasons Why Smart Organizations are Moving to Cloud BI
10 Reasons Why Smart Organizations are Moving to Cloud BI10 Reasons Why Smart Organizations are Moving to Cloud BI
10 Reasons Why Smart Organizations are Moving to Cloud BI
 
There are 250 Database products, are you running the right one?
There are 250 Database products, are you running the right one?There are 250 Database products, are you running the right one?
There are 250 Database products, are you running the right one?
 
Data and its Role in Your Digital Transformation
Data and its Role in Your Digital TransformationData and its Role in Your Digital Transformation
Data and its Role in Your Digital Transformation
 
Capgemini Leap Data Transformation Framework with Cloudera
Capgemini Leap Data Transformation Framework with ClouderaCapgemini Leap Data Transformation Framework with Cloudera
Capgemini Leap Data Transformation Framework with Cloudera
 
Role of Data in Digital Transformation
Role of Data in Digital TransformationRole of Data in Digital Transformation
Role of Data in Digital Transformation
 
Big Data Management: A Unified Approach to Drive Business Results
Big Data Management: A Unified Approach to Drive Business ResultsBig Data Management: A Unified Approach to Drive Business Results
Big Data Management: A Unified Approach to Drive Business Results
 
How to implement hadoop successfuly
How to implement hadoop successfulyHow to implement hadoop successfuly
How to implement hadoop successfuly
 
Drive Business Outcomes for Big Data Environments
Drive Business Outcomes for Big Data EnvironmentsDrive Business Outcomes for Big Data Environments
Drive Business Outcomes for Big Data Environments
 
Big Iron + Big Data = BIG DEAL! Unlock The Power of Your Mainframe Data
Big Iron + Big Data = BIG DEAL! Unlock The Power of Your Mainframe DataBig Iron + Big Data = BIG DEAL! Unlock The Power of Your Mainframe Data
Big Iron + Big Data = BIG DEAL! Unlock The Power of Your Mainframe Data
 
Big data for product managers
Big data for product managersBig data for product managers
Big data for product managers
 
Data Day - Escuchando la red
Data Day - Escuchando la redData Day - Escuchando la red
Data Day - Escuchando la red
 
Operationalizing Data Analytics
Operationalizing Data AnalyticsOperationalizing Data Analytics
Operationalizing Data Analytics
 
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
 

Mais de VMware Tanzu

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItVMware Tanzu
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023VMware Tanzu
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleVMware Tanzu
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023VMware Tanzu
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductVMware Tanzu
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready AppsVMware Tanzu
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And BeyondVMware Tanzu
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023VMware Tanzu
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptxVMware Tanzu
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchVMware Tanzu
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishVMware Tanzu
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVMware Tanzu
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - FrenchVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023VMware Tanzu
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootVMware Tanzu
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerVMware Tanzu
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeVMware Tanzu
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsVMware Tanzu
 

Mais de VMware Tanzu (20)

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About It
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at Scale
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a Product
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptx
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - French
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - English
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - French
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs Practice
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
 

Último

Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxFinatron037
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxdhiyaneswaranv1
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsThinkInnovation
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 

Último (16)

Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptx
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in Logistics
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 

Driving Real Insights Through Data Science

  • 1. The Journey to Becoming a Data- Driven Enterprise Pivotal Big Data Roadshow 2015
  • 2. 2© Copyright 2015 Pivotal. All rights reserved. Where we’re going today… 3 Great Keynotes • Journey to a Data-driven Enterprise • Data Science Use Cases • Streaming Data and Predictive Analytics Stock Inference Demo and Architecture Overview Intensive hands-on training sessions
  • 3. 3© Copyright 2015 Pivotal. All rights reserved. Today’s Agenda 12:00 PM – 12:45 PM - Check-In & Lunch 12:45 PM – 1:00 PM – Welcome and Agenda Review 1:00 PM – 1:PM AM – How Pivotal’s Tools Help Drive Value from Data Science 1:20 PM – 2:20 PM – Accelerating the Generation of New Insights – R&D Use Case Review and Demo 2:20 PM – 2:30 PM – Coffee Break 2:30 PM – 3:00 PM – Manufacturing Use Case Review and Demo 3:00 PM – 3:15 PM – Closing Remarks
  • 4. © Copyright 2015 Pivotal. All rights reserved. MASHING BIG DATA WITH BIG MACHINES IS ‘BEAUTIFUL, DESIRABLE, INVESTABLE’ - IT COULD TRANSFORM GE'S BUSINESS - AND THE ECONOMY. “ ”JEFF IMMELT, CEO, GE
  • 5. © Copyright 2015 Pivotal. All rights reserved. THE POWER OF 1 R X Increasing Freight Utilization Rail Predictive Maintenance Healthcare Predictive Diagnostics Power Driving Outcomes That Matter One Percent Improvement Equals $27B Industry Value by Reducing System Inefficiency $63B Industry Value by Reducing Process Inefficiency $66B Industry Value with Efficiency Improvements In Gas-fired Power Plant Fleets Source: General Electric
  • 6. © Copyright 2015 Pivotal. All rights reserved. DATA-DRIVEN ENTERPRISE JOURNEY STORE • Structured • Unstructured • High Volume • High Velocity ANALYZE • Predictive Analytics • Machine Learning • Advance Data Science • Realtime Analytics DEVELOP • Advanced Analytic Pipelines • Realtime Analytical Applications • Global Scale Data-Driven Applications • Enterprise, Consumer, and Mobile INNOVATE • Agile Dev Expertise • DevOps • Microservice • Continuous Delivery • Closed Loop Applications AGILE DEVELOPMENT BIG DATA PREDICTIVE ANALYTICS CLOUD NATIVE PLATFORM
  • 7. 8© Copyright 2015 Pivotal. All rights reserved. 0% of CIOs think their IT infrastructure is fully prepared for big data (3) 30% of companies have deployed advanced analytics, 11% big data analysis (4) 44% of new applications failed to meet performance expectations (5) 2X 90% of companies allocate at least 2X more cloud capacity than needed to ensure performance (6) But… 80% of CEOs thinking data mining and analysis are strategically important (1) 4% of companies use analytics effectively (2) (1) 2015 PWC CEO Survey; (2)2013 Baine and Company - The Value of Big Data; (3) 2014 IT Infrastructure Conversation - IBM; (4) Ernest and Young - 2014 Enterprise IT Trends and Investments; (5) 2014 Riverbed Tecnologies - The Transformers; (6) 2014 ElasticHosts CIO Study LARGE ENTERPRISE BIG DATA TROUBLE
  • 8. 9© Copyright 2015 Pivotal. All rights reserved. BIG DATA CHASM 70% of data generated by customers 80% of data stored 3% prepared for analysis 0.5% being analyzed <0.5% being operationalized 9 THE DATA DIVIDE
  • 9. 10© Copyright 2015 Pivotal. All rights reserved. Software Is Eating The World Data Is Fueling Software SOFTWARE IS EATING THE WORLD
  • 10. 11© Copyright 2015 Pivotal. All rights reserved. WE CHOSE PIVOTAL BECAUSE WE BELIEVE IT PROVIDES A 360-DEGREE VIEW OF THE PROCESS. FROM A DATA SCIENCE AND DATA TECHNOLOGY PERSPECTIVE, IT MEANS DELIVERING BEST-IN- CLASS DATA TECHNOLOGIES AND ENABLING THEM ON THEIR PLATFORM. “ ”
  • 11. 12© Copyright 2015 Pivotal. All rights reserved. ACROSS INDUSTRIES
  • 12. 13© Copyright 2015 Pivotal. All rights reserved. THE NEW DATA IMPERATIVES Converged Data & Cloud OpenData-Driven Apps
  • 13. 14© Copyright 2015 Pivotal. All rights reserved. THE BIG DATA PROBLEM Fragmentation ConstraintsComplexity
  • 14. 15© Copyright 2015 Pivotal. All rights reserved. • Remove Lock-in • Leverage Ecosystem • Co-innovate GUIDING PRINCIPLES IN THE NEW ERA OPEN AGILE CLOUD-READY • Shorten innovation cycles • Reduce TCO • Improve TTM • Solve business problems • Avoid lock-in • Appropriate security
  • 15. 16© Copyright 2015 Pivotal. All rights reserved. JOURNEY TO A DATA-DRIVEN ENTERPRISE Deploy analytic apps and automate at scale Perform advanced analytics Discover insights Modernize data infrastructure
  • 16. 17© Copyright 2015 Pivotal. All rights reserved. Deploy analytic apps and automate at scale Perform advanced analytics Discover insights Modernize data infrastructure DATA-DRIVEN COMPANIES: USE MODERN DATA INFRASTRUCTURE
  • 17. 18© Copyright 2015 Pivotal. All rights reserved. MODERNIZE DATA INFRASTRUCTURE Elastic, Scale-out storage and processing Flexible data types and pipelining ETL on demand: low operational cost Expanded use cases Higher quality analytics Lowered storage/processing cost Less fragmented ecosystem Reduced vendor lock-in REQUIREMENTS BENEFITS Cloud friendly and open-source based
  • 18. 19© Copyright 2015 Pivotal. All rights reserved. Modernize data infrastructure Deploy analytic apps and automate at scale Perform advanced analytics Discover insights DATA-DRIVEN COMPANIES: STRATEGICALLY USE ADVANCED ANALYTICS
  • 19. 20© Copyright 2015 Pivotal. All rights reserved. ADVANCED ANALYTICS Leverage existing skills and tools Rapid time to insights Internet of Things use cases Rapid time to insights Solve business problems Predictive insights: proactive execution REQUIREMENTS BENEFITS Machine learning and advanced analytics 01010101010101 01001010101010 10101100101010 SQL- compliant batch and interactive queries Massive stream processing 0101010101010101001010 1010101010110010101010 10101010
  • 20. 21© Copyright 2015 Pivotal. All rights reserved. Modernize data infrastructure Perform advanced analytics Discover insights DATA-DRIVEN COMPANIES: INNOVATE AT SCALE Deploy analytic apps and automate at scale
  • 21. 22© Copyright 2015 Pivotal. All rights reserved. ANALYTIC APPS AND AUTOMATION AT SCALE Reduced time to action Low ‘analytics  app-dev’ integration cost Reduced time to insights Flexible ingestion: low operating cost High performance: low operating cost Transactional safety: business critical ops REQUIREMENTS BENEFITS Low-latency, distributed in-memory transactions Resilient, scale-out messaging and object storage Agile analytic app-dev with enterprise PaaSPaaS
  • 22. 23© Copyright 2015 Pivotal. All rights reserved. JOURNEY TO A DATA-DRIVEN ENTERPRISE Deploy analytic apps and automate at scale Perform advanced analytics Discover insights Modernize data infrastructure Pivotal Data Science helps you move from BI to Data Science Pivotal Labs helps you move to an agile development of apps at scale Pivotal Data Engineering helps you move from data administration to data engineering
  • 23. 26© Copyright 2015 Pivotal. All rights reserved. PIVOTAL BIG DATA SUITE
  • 24. 27© Copyright 2015 Pivotal. All rights reserved. Open sourcing all Pivotal Big Data Suite components including: WORLD’S FIRST OPEN SOURCED BIG DATA PORTFOLIO BUILDING ON SUCCESS OF CLOUD FOUNDRY FOUNDATION BUILT FOR ENTERPRISES Pivotal GemFire Apache Geode Apache HAWQ Pivotal HDB Pivotal Greenplum Database
  • 25. 28© Copyright 2015 Pivotal. All rights reserved. BUILT FOR ENTERPRISES Value added features: enterprise grade performance + robustness without lock-in • Advanced Query Optimization in analytics • WAN replication and continuous query in transactional processing Flexible Deployment models: align to business objectives and needs • Balance cost objectives with policy and compliance requirements • Leverage Pivotal’s pre-integration + certification on supported configurations Enterprise grade support: one throat to choke for the suite • Focus on business problems – not on lifecycle management • Expert support on Big Data Suite means reduced business risk
  • 26. 29© Copyright 2015 Pivotal. All rights reserved. • Common core for Hadoop ecosystem • Rapidly accelerated certifications, ecosystem development and enterprise-grade quality OpenDataPlatform.org OPEN
  • 27. 30© Copyright 2015 Pivotal. All rights reserved. AGILE Deploy analytic apps and automate at scale Perform advanced analytics Discover insights Modernize data infrastructure Spring XD Spark Pivotal HD & Open Data Platform Pivotal Greenplum Database Pivotal HDB Rabbit MQ Redis Pivotal GemFire Pivotal BDS on PCF Pivotal Cloud Foundry
  • 28. 31© Copyright 2015 Pivotal. All rights reserved. CLOUD-READY COMMODITY HARDWARE APPLIANCE HYBRID CLOUDCLOUD IaaS IaaS PAAS
  • 29. 32© Copyright 2015 Pivotal. All rights reserved. DATA-DRIVEN ENTERRPRISE JOURNEY WITH PIVOTAL BIG DATA SUITE STORE • Structured • Unstructured • High Volume • High Velocity ANALYZE • Predictive Analytics • Machine Learning • Advance Data Science • Realtime Analytics DEVELOP • Advanced Analytic Pipelines • Realtime Analytical Applications • Global Scale Data-Driven Applications • Enterprise, Consumer, IoT, and Mobile INNOVATE • Agile Dev Expertise • DevOps • Microservices • Continuous Delivery • Closed Loop Applications AGILE DEVELOPMENT BIG DATA PREDICTIVE ANALYTICS CLOUD NATIVE PLATFORM Spring XD Spark Pivotal HD & Open Data Platform Spring XD Pivotal Greenplum Database Pivotal HDB Spring XD Pivotal GemFire Rabbit MQ Spring Cloud Pivotal BDS on PCF Pivotal Cloud Foundry Pivotal LabsData ScienceData Engineering
  • 30. 35© Copyright 2015 Pivotal. All rights reserved. FOR FURTHER INFO, CHECKOUT… • Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data • Pivotal Blog @ http://blog.pivotal.io • Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal • Pivotal Academy @ https://pivotal.biglms.com Or reach out to your local Pivotal Account Executive…
  • 31. 36© Copyright 2015 Pivotal. All rights reserved. 36© Copyright 2013 Pivotal. All rights reserved. Pivotal Data Science Overview and Use Cases Pivotal Big Data Roadshow
  • 32. 37© Copyright 2015 Pivotal. All rights reserved. DATA SCIENCE? App Development Analytics Business Intelligence Reporting Visualization Dashboards Insights Big Data Machine Learning Statistics Mathematics Time Series Algorithms Databases Software Modeling Queries Real-Time Sensors Predictive Models ETL Research Hadoop Distributed Computing MapReduce SQL In-Memory OLAP Text Mining Unstructured Data Open Source Decision Science Ad Hoc Queries Hacking In-Database Analytics Internet of Things Data Cleansing Sentiment
  • 33. 38© Copyright 2015 Pivotal. All rights reserved. • ETL • Unstructured • Data Cleansing • Sensors Data Related • Algorithms • Mathematics • Statistics • Econometrics • Predictive Modeling • Machine Learning • Text Mining • Sentiment • Map Reduce Fields of Study & Techniques • Dashboards • Insights • Visualization • Ad Hoc Queries • Reporting Business Intelligence • Software • In-Database Analysis • Distributed Computing • Hadoop • Open Source Implementation • Big Data • Decision Science • Internet of Things • Real-Time • Hacking • In-Memory Industry Buzzwords
  • 34. 39© Copyright 2015 Pivotal. All rights reserved. What is Data Science? The use of statistical and machine learning techniques on big multi-structured data in a distributed computing environment to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment. DRIVE AUTOMATED, LOW-LATENCY ACTIONS IN RESPONSE TO EVENTS OF INTEREST
  • 35. 40© Copyright 2015 Pivotal. All rights reserved. Gene Sequencing Smart Grids COST TO SEQUENCE ONE GENOME HAS FALLEN FROM $100M IN 2001 TO $10K IN 2011 TO $1K IN 2014 READING SMART METERS EVERY 15 MINUTES IS 3000X MORE DATA INTENSIVE Stock Market Social Media FACEBOOK UPLOADS 250 MILLION PHOTOS EACH DAY Billions of Data Points Oil Exploration Video Surveillance OIL RIGS GENERATE 25000 DATA POINTS PER SECOND Medical Imaging Mobile Sensors
  • 36. 41© Copyright 2015 Pivotal. All rights reserved. What is Big Data Analytics? Descriptive Analytics WHAT HAPPENED? Diagnostic Analytics WHY DID IT HAPPENED? Predictive Analytics WHAT WILL HAPPEN? Prescriptive Analytics HOW CAN WE MAKE IT HAPPEN? Complexity Value of Analytics ($)
  • 37. 42© Copyright 2015 Pivotal. All rights reserved. P L A T F O R M Data Science Toolkit KEY TOOLS KEY LANGUAGES SQL
  • 38. 43© Copyright 2015 Pivotal. All rights reserved. Scalable, In-Database ML • Open Source https://github.com/madlib/madlib • Works on Greenplum DB, HAWQ and PostgreSQL • In active development by Pivotal • Downloads and Docs: http://madlib.net/
  • 39. 44© Copyright 2015 Pivotal. All rights reserved. Functions Supervised Learning Regression Models • Cox Proportional Hazards Regression • Elastic Net Regularization • Generalized Linear Models • Linear Regression • Logistic Regression • Marginal Effects • Multinomial Regression • Ordinal Regression • Robust Variance, Clustered Variance • Support Vector Machines Tree Methods • Decision Tree • Random Forest Other Methods • Conditional Random Field • Naïve Bayes Unsupervised Learning • Association Rules (Apriori) • Clustering (K-means) • Topic Modeling (LDA) Statistics Descriptive • Cardinality Estimators • Correlation • Summary Inferential • Hypothesis Tests Other Statistics • Probability Functions Other Modules • Conjugate Gradient • Linear Solvers • PMML Export • Random Sampling • Term Frequency for Text Time Series • ARIMA Aug 2015 Data Types and Transformations • Array Operations • Dimensionality Reduction (PCA) • Encoding Categorical Variables • Matrix Operations • Matrix Factorization (SVD, Low Rank) • Norms and Distance Functions • Sparse Vectors Model Evaluation • Cross Validation Predictive Analytics Library
  • 40. 45© Copyright 2015 Pivotal. All rights reserved. A single address for everything analytics Analytics with Pivotal Time-to-Insights FORECASTING CLUSTERING REGRESSION CLASSIFICATION OPTIMIZATION
  • 41. 46© Copyright 2015 Pivotal. All rights reserved. Smart Systems = Sensors + Digital Brain + Actuators Problem Formulation Modeling Step Data Step Application Step Data Science for Building Models Sensors & Actuators Data Lake
  • 42. 47© Copyright 2015 Pivotal. All rights reserved. 47© Copyright 2013 Pivotal. All rights reserved. Data Science Use Cases
  • 43. 48© Copyright 2015 Pivotal. All rights reserved. 48© Copyright 2013 Pivotal. All rights reserved. Financial Services
  • 44. 49© Copyright 2015 Pivotal. All rights reserved. Identifying and Pricing Cross-Sell Opportunities CUSTOMER A global financial services provider BUSINESS PROBLEM Identify cross-sell opportunities between two business arms of a financial institution. CHALLENGES Integration of large-scale data originating from multiple data warehouses. Developing predictive models to identify novel cross-sell opportunities within the financial institution. Evaluate the identified cross-sell opportunities by their revenue potential. SOLUTIONS  Fast integration of data in Pivotal Greenplum Database.  Predictive models and evaluation of profitability: – Association rule. – Logistic regression for each product offered. – Estimation of revenue opportunity.  On-demand reporting and visualization via custom dashboards connected to in-database models.  Identified multi-million dollar opportunities for the bank.
  • 45. 50© Copyright 2015 Pivotal. All rights reserved. Financial Compliance BUSINESS PROBLEM • Ensure compliance with Dodd-Frank and Basel Committee regulations • Identify underlying risk and fraud while reducing the compliance department’s overburdened Emails Chats Trades Transactions Policy Securities Phone Calls Watch Lists … Financial compliance Data Lake Data integration Data clean up Modeling Classification and ranking Analyst user interfaces Feedback Analytics Analyst feedback Data integration: e.g., append trade information with email and chat communications Data cleanup: e.g., identify newsletters and spam emails Modeling: • Predictive modeling to flag messages and trades • Graph and cohort analysis Analyst feedback Reviewed fraud instances included in periodic model refreshes SOLUTION  A data lake platform coupled with cutting edge data science techniques  Flexible user interface to promote an adaptive, continuously learning compliance framework
  • 46. 51© Copyright 2015 Pivotal. All rights reserved. 51© Copyright 2013 Pivotal. All rights reserved. Telco & Mobile
  • 47. 52© Copyright 2015 Pivotal. All rights reserved. Subscriber Micro-Segmentation CUSTOMER A major telco with cable & VOD, internet, and phone business units BUSINESS PROBLEM Better understand aggregated subscriber behavior to drive business strategy using newly available data sources CHALLENGES ▪ Large quantities of deep packet inspection data and set top box data that had not been analyzed before ▪ Needed to incorporate internet usage and TV consumption information into pre-existing subscriber segments SOLUTION ▪ Generated new subscriber segments that incorporated features based on consumption of TV and internet services across a variety of devices ▪ Crossed new segments with existing segments to generate new micro-segments for cross-sell/upsell and new product development opportunities Customized Micro-Segments
  • 48. 53© Copyright 2015 Pivotal. All rights reserved. Newly Identified Behavior-Based SegmentsSubscribers Moderates OTT & Data Heavyweights Portable OTT Entertainment Seekers iPhone Heavy Android Heavy iPad Heavy In-Home OTT Entertainment Seekers In-Home Native Content Seekers VOD Heavy TV Heavy
  • 49. 54© Copyright 2015 Pivotal. All rights reserved. Opportunities for Data-Driven Decisions in Pharma
  • 50. 55© Copyright 2015 Pivotal. All rights reserved. Data driven drugs: From discovery to delivery RICH DATA SOURCES • Molecular data • Cellular drug screens • Animal models • Clinical data including notes, images, markers (e.g. genomics, lab results) • Sensor and assay data • Internal and partner/purchased external data • Contact center data • Patient registries, public and federal data, clinical partnerships Clinical Trials Manufacturing Marketing Distribution and surveillance Drug discovery + development
  • 51. 56© Copyright 2015 Pivotal. All rights reserved. A pipeline of sensors and opportunities for optimizing output Internet of Things in Manufacturing Input materials Mix Incubate Filter Centrifuge Final Product 0 5 10 15 20 25 30 0 50 100 150 200 Sensors High-Content Screens TEMP TIME Absorbance Elution volume Velocity Time Automated raw materials mixing
  • 52. 57© Copyright 2015 Pivotal. All rights reserved. Vaccine Potency Prediction CUSTOMER A major pharmaceutical company BUSINESS PROBLEM Predict potency and antigen levels of live virus vaccines based on manufacturing sensor data and manual data collected throughout the process. CHALLENGES  Customer’s data model was not optimal for running analytical queries  Manual data quality issues  Data capture was performed with varying consistency due to high cost associated with manual data collection SOLUTION  Introduced a new data model to make data accessible and enable analytics (including LIMS and DeltaV)  Built automated outlier detection/correction methods to address manual data entry quality issues  Devised imputation methods to deal with data completeness issues  Built predictive models with high accuracy
  • 53. 58© Copyright 2015 Pivotal. All rights reserved. http://blog.pivotal.io/data-science-pivotal Check out the Pivotal Data Science Blog!
  • 54. 59© Copyright 2015 Pivotal. All rights reserved. FOR FURTHER INFO… • Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data • Pivotal Blog @ http://blog.pivotal.io • Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal • Pivotal Academy @ https://pivotal.biglms.com • Or reach out to your local Pivotal Account Executive…
  • 55. 60© Copyright 2015 Pivotal. All rights reserved. Pivotal Data Science Labs: Packaged Services • Analytics Roadmap • Prioritized Opportunities • Architectural Recommendati ons • Hands-on training • Hosted data on Pivotal Data stack • Results review & assessment • On-site MPP analytics training • Analytics tool- kit • Quick insight (2 weeks) • Prof. services • Data science model building • Ready-to- deploy model(s) • Prof. services • Data science model building • Ready-to- deploy model(s) LAB PRIMER (2-Week Roadmapping) LAB 600 (6-Week Lab) LAB 1200 (12-Week Lab) LAB 100 (Analytics Bundle) DATA JAM (Internal DS Contest)
  • 56. 61© Copyright 2015 Pivotal. All rights reserved. Data Streaming and Predictive Analytics Using Pivotal Big Data Suite
  • 57. 62© Copyright 2015 Pivotal. All rights reserved. Converging Trends Innovation New Data New Processes New Insights The Journey to the Data-Driven Enterprise Data Science and Machine Learning Big Data IoT, Mobile Apps, Social Media
  • 58. 63© Copyright 2015 Pivotal. All rights reserved. HDFS Data Lake Ingest Store Analytics Hard to change Labor intensive Inefficient Coding based No real-time information Based on expensive ETL Migrating from a Reactive, Static and Constrained Model…
  • 59. 64© Copyright 2015 Pivotal. All rights reserved. HDFSData Lake Expert System / Machine Learning In-Memory Real- Time Data Continuous Learning Continuous Improvement Continuous Adapting Data Stream Pipeline Multiple Data Sources Real-Time Processing Store Everything To Pro-Active, Self-Improving, Machine Learning Systems
  • 60. 65© Copyright 2015 Pivotal. All rights reserved. New York Times Research: http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html “ 50-80% OF THE TIME ON DATA SCIENCE PROJECTS IS SPENT ON DATA WRANGLING ”
  • 61. 66© Copyright 2015 Pivotal. All rights reserved. Data Feeds Stream Processing Expert Systems Machine Learning Historical Data Business Value Smart Decisions Still… HDFS Data Lake
  • 62. 67© Copyright 2015 Pivotal. All rights reserved. Ingest Transform Sink SpringXD GemFire Data Stream Needs an Agile, Scalable and Fast Solution HAWQ GPDB Data Lake
  • 63. 68© Copyright 2015 Pivotal. All rights reserved. Ingest Transform Sink SpringXD Distributed Computing In-Memory Real-Time Data Spring XD Orchestrates and Automates all the Steps on Data Stream Pipelining Expert System / Machine Learning Extensible Open-Source Fault-Tolerant Horizontally Scalable HAWQ GPDB Data Lake
  • 64. 69© Copyright 2015 Pivotal. All rights reserved. INGEST / SINK PROCESS ANALYZE • No coding required • Dozens of built-in connectors • Seamless integration with Kafka, Sqoop • Create new connectors easily using Spring • Call Spark, Reactor or RxJava • Built-in configurable filtering, splitting and transformation • Out-of-box configurable jobs for batch processing • Import and invoke PMML jobs easily • Call Python, R, Madlib and other tools • Built-in configurable counters and gauges Spring XD State of the Art Data Pipeline Automation
  • 65. 70© Copyright 2015 Pivotal. All rights reserved. Ingest Transform Sink SpringXD Distributed Computing GemFire Provides Scalable, Low-Latency Data Access, Storage and Event Processing Expert System / Machine Learning GemFire Extensible Open-Source Fault-Tolerant Horizontally Scalable HAWQ GPDB Data Lake
  • 66. 71© Copyright 2015 Pivotal. All rights reserved. GemFire • In-Memory Enterprise Data Grid • Horizontally Scalable, Consistent, Highly Available • Event handling • Continuous Queries • Enterprise Data Geo Distribution In-memory Real Time Data
  • 67. 72© Copyright 2015 Pivotal. All rights reserved. Ingest Transform Sink SpringXD Distributed Computing Pivotal Provides SQL Based Advanced Analytics Expert System / Machine Learning GemFire Extensible Open-Source Fault-Tolerant Horizontally Scalable Data Lake HAWQ GPDB
  • 68. 73© Copyright 2015 Pivotal. All rights reserved. HAWQ • Massively Parallel Processing RDBMS on HADOOP • ANSI SQL on Hadoop • Extremely high performance for analytics (not like Hive) • Stores all data directly on HDFS • Open-Source Advanced SQL analytics in Hadoop Combining SQL with Hadoop is key for analytics SQL remains #1 choice for Data Science
  • 69. 74© Copyright 2015 Pivotal. All rights reserved. Ingest Transform Sink SpringXD Developers and Data Scientists Can Focus on the Business Value of Data GemFire Extensible Open-Source Fault-Tolerant Horizontally Scalable Data Lake HAWQ GPDB
  • 70. 75© Copyright 2015 Pivotal. All rights reserved. Data Streaming Reference Architecture Data Feeds Transactional Apps Analytic Apps Data Stream Pipeline Distributed Computing Real-Time Data Expert Systems & Machine Learning Advanced Analytics HDFSData Lake
  • 71. 76© Copyright 2015 Pivotal. All rights reserved. Data Streaming Reference Architecture Data Feeds Transactional Apps Analytic Apps Data Stream Pipeline HDFSData Lake GemFire HAWQ GPDB SpringXD
  • 72. 77© Copyright 2015 Pivotal. All rights reserved. “ SO WE ARE MOVING TO A WORLD WHERE THE MACHINES WE WORK WITH ARE NOT JUST INTELLIGENT; THEY ARE BRILLIANT.THEY ARE SELF-AWARE, THEY ARE PREDICTIVE, REACTIVE AND SOCIAL. IT'S A WORLD WHERE INFORMATION ITSELF BECOMES INTELLIGENT AND COMES TO US AUTOMATICALLY WHEN WE NEED IT WITHOUT HAVING TO LOOK FOR IT. ”MARCO ANNUNZIATA, GE
  • 73. 78© Copyright 2015 Pivotal. All rights reserved. Demo Powered by Pivotal Big Data Suite
  • 74. 79© Copyright 2015 Pivotal. All rights reserved. It's all about DATA Data Sources Look for patterns Prediction
  • 75. Transform Sink SpringXD Extensible Open-Source Fault-Tolerant Horizontally Scalable Cloud-Native Machine Learning Enrich Filter Split Dashboard Indicators 1 2 Predict 3 Real data Simulator /Stocks /TechIndicators /Predictions
  • 76. 81© Copyright 2015 Pivotal. All rights reserved.
  • 78. 91© Copyright 2015 Pivotal. All rights reserved. “ THE REAL OPPORTUNITY FOR CHANGE...SURPASSING THE MAGNITUDE OF THE CONSUMER INTERNET...IS THE INDUSTRIAL INTERNET, AN OPEN, GLOBAL NETWORK THAT CONNECTS PEOPLE, DATA AND MACHINES. ”JEFF IMMELT, CEO, GE
  • 79. 100© Copyright 2015 Pivotal. All rights reserved. FOR FURTHER INFO, CHECKOUT… • Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data • Pivotal Blog @ http://blog.pivotal.io • Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal • Pivotal Academy @ https://pivotal.biglms.com • Or reach out to your local Pivotal Account Executive…
  • 80. BUILT FOR THE SPEED OF BUSINESS
  • 81. 102© Copyright 2015 Pivotal. All rights reserved. 102© Copyright 2013 Pivotal. All rights reserved. Accelerating the Generation of New Insights October 27, 2015 Sarah Aerni, Data Science Antonio Petrole, Data Engineering
  • 82. BUILT FOR THE SPEED OF BUSINESS
  • 83. 104© Copyright 2015 Pivotal. All rights reserved. Gene Sequencing Smart Grids COST TO SEQUENCE ONE GENOME HAS FALLEN FROM $100M IN 2001 TO $10K IN 2011 TO $1K IN 2014 READING SMART METERS EVERY 15 MINUTES IS 3000X MORE DATA INTENSIVE Stock Market Social Media FACEBOOK UPLOADS 250 MILLION PHOTOS EACH DAY Technology to process and store data is needed in all industries Oil Exploration Video Surveillance OIL RIGS GENERATE 25000 DATA POINTS PER SECOND Medical Imaging Mobile Sensors
  • 84. 105© Copyright 2015 Pivotal. All rights reserved. What is Big Data Analytics? Descriptive Analytics WHAT HAPPENED? Diagnostic Analytics WHY DID IT HAPPEN? Predictive Analytics WHAT WILL HAPPEN? Prescriptive Analytics HOW CAN WE MAKE IT HAPPEN? Complexity Value of Analytics ($)
  • 85. 106© Copyright 2015 Pivotal. All rights reserved. Opportunities for Data-Driven Decisions in Pharma
  • 86. 107© Copyright 2015 Pivotal. All rights reserved. Data driven drugs: From discovery to delivery RICH DATA SOURCES • Molecular data • Cellular drug screens • Animal models • Clinical data including notes, images, markers (e.g. genomics, lab results) • Sensor and assay data • Internal and partner/purchased external data • Contact center data • Patient registries, public and federal data, clinical partnerships Clinical Trials Manufacturing Marketing Distribution and surveillance Drug discovery + development
  • 87. 108© Copyright 2015 Pivotal. All rights reserved. A pipeline of sensors and opportunities for optimizing output Internet of Things in Manufacturing Input materials Mix Incubate Filter Centrifuge Final Product 0 5 10 15 20 25 30 0 50 100 150 200 Sensors High-Content Screens TEMP TIME Absorbance Elution volume Velocity Time Automated raw materials mixing
  • 88. 109© Copyright 2015 Pivotal. All rights reserved. Vaccine Potency Prediction CUSTOMER A major pharmaceutical company BUSINESS PROBLEM Predict potency and antigen levels of live virus vaccines based on manufacturing sensor data and manual data collected throughout the process. CHALLENGES  Customer’s data model was not optimal for running analytical queries  Manual data quality issues  Data capture was performed with varying consistency due to high cost associated with manual data collection SOLUTION  Introduced a new data model to make data accessible and enable analytics (including LIMS and DeltaV)  Built automated outlier detection/correction methods to address manual data entry quality issues  Devised imputation methods to deal with data completeness issues  Built predictive models with high accuracy
  • 89. 110© Copyright 2015 Pivotal. All rights reserved. Interpreting the utility of a measure obtained during manufacturing based on model outcomes Sample model insights  Some features may reveal tunable parameters to alter potency, others may simply be markers  Features consistently absent from models may be uninformative for predicting potency  Opportunities to provide real-time feedback on data entry errors and predicted potency outcomes Assayed value Duration of a step Potency Potency Correlation=0.45 Correlation=0.38
  • 90. 111© Copyright 2015 Pivotal. All rights reserved. Need for new environments to process big data? HDFS STORAGE AND MPP ARCHITECTURES DISTRIBUTE STORAGE AND PREVENT DATA MOVEMENT VARIETY/VELOCITY DISTRIBUTED COMPUTATION FOR PARALLELIZATION PETABYTES OF DATA OPEN-SOURCE LIBRARY FOR MACHINE LEARNING AT SCALE AND FRAMEWORK TO ACCESS COMMON LANGUAGES RAPIDLY EVOLVING FIELD OF DATA SCIENCE AND TOOLS SQL ENGINE AND ODBC/JDBC CONNECTIONS TO HADOOP MANY EXISTING LIBRARIES, TOOLS AND EXPERTISE FLEXIBLE SCALABLE ENABLING ACCESSIBLE
  • 91. 112© Copyright 2015 Pivotal. All rights reserved. Multiple tools with a single, simple goal: Distributed storage with in-place computation Pivotal Hadoop Pivotal Greenplum Database HAWQ
  • 92. 113© Copyright 2015 Pivotal. All rights reserved. Multiple tools with a single, simple goal: Distributed storage with in-place computation Think of it as multiple PostGreSQL servers Segments/Workers Master Rows are distributed across segments by a particular field (or randomly) Pivotal Hadoop Pivotal Greenplum Database HAWQ
  • 93. 114© Copyright 2015 Pivotal. All rights reserved. Identifying duplicates: counting with grouping Opportunities for performance improvements – Sorting and re-sorting is required in many pipelines – Single-threaded processes create bottlenecks in speed and need to move data https://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-1-Map_and_Dedup.pdf Solution: Leverage Pivotal’s distributed MPP environment by using common database functions
  • 94. 115© Copyright 2015 Pivotal. All rights reserved. Identifying duplicates: counting with grouping Reference genome Mapped reads
  • 95. 116© Copyright 2015 Pivotal. All rights reserved. Identifying duplicates: counting with grouping Duplicates 1 3 1 1 1 1 5 1 3 select locus, count(*) from reads group by locusReference genome Mapped reads
  • 96. 117© Copyright 2015 Pivotal. All rights reserved. Reference genome Mapped reads Counting numbers of reads mapped to features select exon, count(*) from reads JOIN refseq ON(<reads overlap exon>) group by exon 5 17 12
  • 97. 118© Copyright 2015 Pivotal. All rights reserved. Multiple tools with a single, simple goal: Distributed storage with in-place computation Think of it as distributed file system with very large blocks of data Schema on read allows flexibility for a variety of datasets Compute using a variety of paradigms (e.g. MapReduce) Pivotal Hadoop Pivotal Greenplum Database HAWQ Name Node Data Node 1 Data Node 2 Data Node 3 Data Node 4 1 2 3 2 3 1 1 2
  • 98. 119© Copyright 2015 Pivotal. All rights reserved. Multiple tools with a single, simple goal: Distributed storage with in-place computation • SQL compliant • World-class query optimizer • Interactive query • Horizontal scalability • Robust data management • Common Hadoop formats • Deep analytics Pivotal Hadoop Pivotal Greenplum Database HAWQ Think of it as distributed PostGreSQL (GPDB) on Hadoop • SQL compliant • World-class query optimizer • Interactive query • Horizontal scalability • Robust data management • Common Hadoop formats • Deep analytics
  • 99. 120© Copyright 2015 Pivotal. All rights reserved. A single address for everything analytics Analytics with Pivotal Time-to-Insights FORECASTING CLUSTERING REGRESSION CLASSIFICATION OPTIMIZATION
  • 100. 121© Copyright 2015 Pivotal. All rights reserved. P L A T F O R M Data Science Toolkit KEY TOOLS KEY LANGUAGES SQL
  • 101. 122© Copyright 2015 Pivotal. All rights reserved. Historically data was studied in silos BRCA dataset Treatments Protein Assays Imaging VariantsPatient History & Follow-Ups Gene Expression Copy Number Variation miRNA
  • 102. 123© Copyright 2015 Pivotal. All rights reserved. Genomics Data Center Researcher Computing ClusterUnnecessary data movement Network usage Need for new environment: Data movement Computation and storage in a single location
  • 103. 124© Copyright 2015 Pivotal. All rights reserved. In-database genome-wide association study Network Interconnect Master Severs Segment Severs SQL & RCOVARIATES GENOTYPES Indiv Covariates 1 2 10 1 F 23 18 2 M 39 41 3 M 50 23 N F 19 24 SNP 1 2 M A A C C TT A T C G TT A A G G T C TT C G T C
  • 104. 125© Copyright 2015 Pivotal. All rights reserved. In-database genome-wide association study Network Interconnect Master Severs Segment Severs SNP1 SNP2 SNPM SQL & R Indiv Covariates 1 2 10 1 F 23 18 2 M 39 41 3 M 50 23 N F 19 24 Indiv SN P Gen o 1 1 AA 2 1 AT 3 1 AA 1 2 CC 2 2 CG 3 2 GG N M TC COVARIATES GENOTYPES
  • 105. 126© Copyright 2015 Pivotal. All rights reserved. In-database genome-wide association study Network Interconnect Master Severs Segment Severs SNP1 SNP2 SNPM Pval1 Pval2 PvalM SQL & R Indiv Covariates 1 2 10 1 F 23 18 2 M 39 41 3 M 50 23 N F 19 24 Indiv SN P Gen o 1 1 AA 2 1 AT 3 1 AA 1 2 CC 2 2 CG 3 2 GG N M TC COVARIATES GENOTYPES
  • 106. 127© Copyright 2015 Pivotal. All rights reserved. In-database genome-wide association study Network Interconnect Master Severs Segment Severs SNP1 SNP2 SNPM Pval1 Pval2 PvalM SQL & R Indiv Covariates 1 2 10 1 F 23 18 2 M 39 41 3 M 50 23 N F 19 24 Indiv SN P Gen o 1 1 AA 2 1 AT 3 1 AA 1 2 CC 2 2 CG 3 2 GG N M TC COVARIATES GENOTYPES
  • 107. 128© Copyright 2015 Pivotal. All rights reserved. In-database genome-wide association study Network Interconnect Master Severs Segment Severs SNP1 SNP2 SNPM Pval1 Pval2 PvalM SQL & R Indiv Covariates 1 2 10 1 F 23 18 2 M 39 41 3 M 50 23 N F 19 24 Indiv SN P Gen o 1 1 AA 2 1 AT 3 1 AA 1 2 CC 2 2 CG 3 2 GG N M TC SNP P-value 1 2.34x10-21 2 0.395 3 7.15x10-17 M 0.000142 COVARIATES GENOTYPES RESULTS
  • 108. 129© Copyright 2015 Pivotal. All rights reserved. In-database genome-wide association study Network Interconnect Master Severs Segment Severs SNP1 SNP2 SNPM Pval1 Pval2 PvalM SQL & R Indiv Covariates 1 2 10 1 F 23 18 2 M 39 41 3 M 50 23 N F 19 24 Indiv SN P Gen o 1 1 AA 2 1 AT 3 1 AA 1 2 CC 2 2 CG 3 2 GG N M TC SNP P-value 1 2.34x10-21 2 0.395 3 7.15x10-17 M 0.000142 COVARIATES GENOTYPES RESULTS • In-database computation of 1 million loci for thousands of individuals in seconds • Results are easily manipulated and explored
  • 109. 130© Copyright 2015 Pivotal. All rights reserved. In-database genome-wide association study Network Interconnect Master Severs Segment Severs SNP1 SNP2 SNPM Pval1 Pval2 PvalM LOR1 LOR2 LORM SQL & R Indiv Covariates 1 2 10 1 F 23 18 2 M 39 41 3 M 50 23 N F 19 24 Indiv SN P Gen o 1 1 AA 2 1 AT 3 1 AA 1 2 CC 2 2 CG 3 2 GG N M TC SNP P-value 1 2.34x10-21 2 0.395 3 7.15x10-17 M 0.000142 COVARIATES GENOTYPES RESULTS • In-database computation of 1 million loci for thousands of individuals in seconds • Results are easily manipulated and explored
  • 110. 131© Copyright 2015 Pivotal. All rights reserved. Procedural Languages in Big Data Science  HAWQ & PL/X can take advantage of “data parallel” tasks by performing analyses in parallel – embarrassingly parallel tasks – Little/no effort required to break up the problem into parallel tasks – No dependency (or communication) between tasks  Examples of ‘data parallel’ problems: – Counting words in documents – Genome-Wide Association Study – Studying network anomalies  Sample Implementations by PDL – Digital image processing – Bayesian Inference with MCMC – Parallel Bagged Decision Trees Doc1 Doc2 DocM Stem1 Stem2 StemM SQL & R Count1 Count2 Count M Network Interconnect Master Severs Segment Severs
  • 111. 132© Copyright 2015 Pivotal. All rights reserved. Finding Causal Variants in Lupus Customer Biotech Company Business Problem The customer wants to establish internal data science capabilities: building a culture and acquiring hardware and people to support it Challenges  Customer needs to establish a culture around sharing and analyzing datasets for value  Current in-house technology is unable to support large-scale analysis (e.g. unable to analyze genomics datasets)  Need to learn new paradigms for analyzing data at-scale Solution  Train customer employees on our solution stack, provide one-on-one consulting and run a hackathon  Greatly reduced computation time enabling implementation and results during a 30 hour period – Module previously requiring 30min took only 5sec in using SQL in database  Novel scientific discovery on untouched data – Dataset previously untouched for 2 years due to limited resources – Mine for statistically significant associations between ~400,000 variants for 1000 phenotypes
  • 112. 133© Copyright 2015 Pivotal. All rights reserved. Processing images and building integrated models at scale
  • 113. 134© Copyright 2015 Pivotal. All rights reserved. Image Computation Framework Hadoop Sequence File Thousands of Images Image Pre-Processing Features img1 [x1-xM] img2 [x1-xM] imgN [x1-xM] Feature Generation HDFS Map reduce Map reduce One sequence file
  • 114. 135© Copyright 2015 Pivotal. All rights reserved. Image Computation Framework Hadoop Sequence File Thousands of Images Image Pre-Processing Features img1 [x1-xM] img2 [x1-xM] imgN [x1-xM] Feature Generation HDFS HAWQ/GPDB Map reduce Map reduce Join to additional datasets Proteomics Medical History Variants Additional Datasets Build Models at-Scale SQLOne sequence file
  • 115. 136© Copyright 2015 Pivotal. All rights reserved. Image Computation Framework Hadoop Sequence File Thousands of Images One sequence file Image Pre-Processing Features img1 [x1-xM] img2 [x1-xM] imgN [x1-xM] Feature Generation HDFS HAWQ/GPDB Map reduce Map reduce Raw Pixels img1 [rgb1-rgbK] img2 [rgb1-rgbK] imgN [rgb1-rgbK] Map reduce Join to additional datasets Proteomics Medical History Variants Additional Datasets Build Models at-Scale SQL
  • 116. 137© Copyright 2015 Pivotal. All rights reserved. Image Computation Framework Hadoop Sequence File Thousands of Images One sequence file Image Pre-Processing Features img1 [x1-xM] img2 [x1-xM] imgN [x1-xM] Feature Generation HDFS HAWQ/GPDB Map reduce Map reduce Raw Pixels img1 [rgb1-rgbK] img2 [rgb1-rgbK] imgN [rgb1-rgbK] Map reduce PL/X SQL Join to additional datasets Feature Generation Features img1 [x1-xM] img2 [x1-xM] imgN [x1-xM] Proteomics Medical History Variants Additional Datasets Build Models at-Scale SQL
  • 117. 138© Copyright 2015 Pivotal. All rights reserved. Representing an image in HAWQ HAWQ enables rapid processing of multiple or extremely large images in parallel without memory limitations Source Image: Col Row 0 1 2 0 1 2 0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2 col row intsy Structured:
  • 118. 139© Copyright 2015 Pivotal. All rights reserved. Translating image processing to simple SQL Function Distribution of pixel intensities SQL SELECT intsy, count(*) FROM tbl GROUP BY intsy Output 150, 5 215, 4 HAWQ enables rapid processing of multiple or extremely large images in parallel without memory limitations  No data movement required  Simple SQL queries for data exploration 0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2 Source Image: Col Row 0 1 2 0 1 2 col row intsy Structured:
  • 119. 140© Copyright 2015 Pivotal. All rights reserved. Image Processing Pipeline For Object Counting Original Image name # Cells Tma_001.jpg 359 Tma_002.jpg 1892 Tma_003.jpg 871 … … Smoothing Average over window of pixels Thresholding Select pixels under intensity threshold Cleanup Min/max over window of pixels Object Detection Connected components Object Counting Select components with size filter
  • 120. 141© Copyright 2015 Pivotal. All rights reserved. Image Computation Framework Hadoop Sequence File Thousands of Images One sequence file Image Pre-Processing Features img1 [x1-xM] img2 [x1-xM] imgN [x1-xM] Feature Generation HDFS HAWQ/GPDB Map reduce Map reduce Raw Pixels img1 [rgb1-rgbK] img2 [rgb1-rgbK] imgN [rgb1-rgbK] Map reduce PL/X SQL Join to additional datasets Feature Generation Features img1 [x1-xM] img2 [x1-xM] imgN [x1-xM] Proteomics Medical History Variants Additional Datasets Build Models at-Scale SQL
  • 121. 142© Copyright 2015 Pivotal. All rights reserved. A Drug-Centric Data Lake to Enable Drug Discovery Customer A major pharmaceutical company Business Problem Identifying promising drug targets leveraging and integrating the vast datasets available will reduce time and cost to bring a new product to market Challenges • Data for drugs screens across multiple modalities cannot be easily integrated • Current environments cannot support the growing data form high-content screens • Researchers are unable to leverage the entirety of datasets, instead work with aggregates or summaries Solution • Proved that current customer models can be ported, sped up and scaled in the Pivotal environment • Created richer models integrating multiple types of data (genomics, images, etc) • Improve models using the raw, most-granular data • Demonstrate how the availability of tools and data enables scientists to interrogate models and derive a deeper, actionable insight
  • 122. 143© Copyright 2015 Pivotal. All rights reserved. http://blog.pivotal.io/data-science-pivotal Check out the Pivotal Data Science Blog!
  • 123. 144© Copyright 2015 Pivotal. All rights reserved. FOR FURTHER INFO, CHECKOUT… • Pivotal Blog @ http://blog.pivotal.io • Pivotal Academy @ https://pivotal.biglms.com
  • 124. 145© Copyright 2015 Pivotal. All rights reserved. 145© Copyright 2013 Pivotal. All rights reserved. Driving Insights from Data Lakes Manufacturing Demo Matthew Ross & Antonio Petrole Pivotal Data Engineering
  • 125. 146© Copyright 2015 Pivotal. All rights reserved. Internet of What????
  • 126. 147© Copyright 2015 Pivotal. All rights reserved. Industrial Internet of Things?
  • 127. 148© Copyright 2015 Pivotal. All rights reserved. IoT Goes Mainstream According to Gartner, Inc. (a technology research and advisory corporation), there will be nearly 26 billion devices on the Internet of Things by 2020.
  • 128. 149© Copyright 2015 Pivotal. All rights reserved. GE Doubles Down GE invests in IIoT cloud and creates Predix cloud built upon Pivotal Cloud Foundry. GE estimates that connecting these industrial machines to the IoT could boost global GDP by $10 trillion to $15 trillion in 20 years. McKinsey Global Institute research holds that IoT in general could add $6.2 billion to the global economy by 2025.
  • 129. 150© Copyright 2015 Pivotal. All rights reserved. Converging Trends Innovation New Data New Processes New Insights The Journey to the Data-Driven Enterprise Data Science and Machine Learning Big Data Internet of Things
  • 130. 151© Copyright 2015 Pivotal. All rights reserved. IoT Key Workflows Data Flow Management Reliable Infrastructure Enterprise Level Tooling ● High Availability ● Fault Tolerant ● Scalable ● Data Overload ● Normalization ● Multiple Sources and Destinations ● Workflow Orchestration ● Admin Tooling ● Developer Enablement
  • 131. 152© Copyright 2015 Pivotal. All rights reserved. IoT Key Workflows Data Flow Management Reliable Infrastructure Enterprise Level Tooling ● High Availability ● Fault Tolerant ● Scalable ● Data Overload ● Normalization ● Multiple Sources and Destinations ● Workflow Orchestration ● Admin Tooling ● Developer Enablement
  • 132. 153© Copyright 2015 Pivotal. All rights reserved. Data Flow Management-Data Overload ● The ability to stream and process massive amounts of data ● Must have a platform that can handle that much data without losing or corrupting any of it
  • 133. 154© Copyright 2015 Pivotal. All rights reserved. Data Flow Management-Data Normalization and Cleansing ● Organizing Fields to fit into a relational structure ● Adding extra fields or removing unneeded ones
  • 134. 155© Copyright 2015 Pivotal. All rights reserved. Data Flow Management-Multiple Sources and Destinations ● Stream data from multiple different sources ● Persist it so multiple different destinations ● Process multiple different data formats
  • 135. 156© Copyright 2015 Pivotal. All rights reserved. IoT Key Workflows Data Flow Management Reliable Infrastructure Enterprise Level Tooling ● High Availability ● Fault Tolerant ● Scalable ● Data Overload ● Normalization ● Multiple Sources and Destinations ● Workflow Orchestration ● Admin Tooling ● Developer Enablement
  • 136. 157© Copyright 2015 Pivotal. All rights reserved. Reliable Infrastructure-High Availability and Fault Tolerance ● Resources are available under any conditions ● System stays up even if some resources go down
  • 137. 158© Copyright 2015 Pivotal. All rights reserved. Reliable Infrastructure-Scalability ● Must be able to handle more traffic demand at any time ● Easily process big data workloads
  • 138. 159© Copyright 2015 Pivotal. All rights reserved. IoT Key Workflows Data Flow Management Reliable Infrastructure Enterprise Level Tooling ● High Availability ● Fault Tolerant ● Scalable ● Data Overload ● Normalization ● Multiple Sources and Destinations ● Workflow Orchestration ● Admin Tooling ● Developer Enablement
  • 139. 160© Copyright 2015 Pivotal. All rights reserved. Enterprise Level Tooling-Workflows and Tooling ● Manage complex flows of data ● Provide rich User Interface Applications
  • 140. 161© Copyright 2015 Pivotal. All rights reserved. Enterprise Level Tooling-Developer Enablement ● Full Featured APIs ● Extreme module customization if needed
  • 141. 162© Copyright 2015 Pivotal. All rights reserved. Are you making the most out of your data?
  • 142. 163© Copyright 2015 Pivotal. All rights reserved. Bringing it all together
  • 143. 164© Copyright 2015 Pivotal. All rights reserved. 164© Copyright 2013 Pivotal. All rights reserved. Reporting is nice, but being able to take action is what drives the value of a platform
  • 144. 165© Copyright 2015 Pivotal. All rights reserved. Predictive Analytics Proactive Monitoring Reactive Maintenance
  • 145. 166© Copyright 2015 Pivotal. All rights reserved. ETL vs Streaming ● Data is loaded in large batches ● Typically happens once a day ● Analysis can only be done once data is transformed and persisted to data warehouse
  • 146. 167© Copyright 2015 Pivotal. All rights reserved. ETL vs Streaming ● Continuous, data streams are “listening” for data being emitted from sensors ● Data can be analyzed in stream ● Can be integrated into data driven applications
  • 147. 168© Copyright 2015 Pivotal. All rights reserved. Reactive Maintenance ● Alert is sent out to someone on factory floor ● Worker must receive alert, and be able to react ● Equipment is down for however long it takes for worker to receive notification and complete repairs.
  • 148. 169© Copyright 2015 Pivotal. All rights reserved. Proactive Maintenance Workflow ● Manager has Dashboard with all gauges of the system ● Historical Log of Recent Alerts are visible on the Dashboard ● Has the ability to dispatch a worker to investigate a specific line or robot
  • 149. 170© Copyright 2015 Pivotal. All rights reserved. Predictive Analytics Workflow ● Run Machine Learning and Data Science models ● Incorporate Business Intelligence Tools ● Persist data to Data Lake and run advanced queries
  • 150. 171© Copyright 2015 Pivotal. All rights reserved. Data Streaming Needs an Agile, Scalable and Fast Solution Data Lake Data Ingestion Business Intelligence Real Time Analytics Mobile App
  • 151. 172© Copyright 2015 Pivotal. All rights reserved. Ingest Transform Sink SpringXD Spring XD Orchestrates and Automates all the Steps on Data Stream Pipelining HDB PHD Data Lake Extensible Open-Source Fault-Tolerant Horizontally Scalable
  • 152. 173© Copyright 2015 Pivotal. All rights reserved. INGEST / SINK PROCESS ANALYZE • No coding required • Dozens of built-in connectors • Seamless integration with Kafka, Sqoop • Create new connectors easily using Spring • Call Spark, Reactor or RxJava • Built-in configurable filtering, splitting and transformation • Out-of-box configurable jobs for batch processing • Import and invoke PMML jobs easily • Call Python, R, Madlib and other tools • Built-in configurable counters and gauges Spring XD State of the Art Data Pipeline Automation
  • 153. 174© Copyright 2015 Pivotal. All rights reserved. Pivotal HDB Hadoop Native SQL • Exceptional Hadoop Native SQL Performance • No compatibility risks to SQL developers or SQL BI tools and applications • Support query roll-ups, dynamic partitions and joins • Massive MPP scalability to petabytes • On premise or on the cloud • Scale your cluster out, not up • World class parallel loading and unloading • Fast performance for complex and advanced data analytics • Integrated with MADLib for advanced machine learning • Powerful Cost-based Query Optimizer
  • 154. BUILT FOR THE SPEED OF BUSINESS