Data is at the center of digital transformation; using data to drive action is how transformation happens. But data is messy, and it’s everywhere. It’s in the cloud and on-premises. It’s in different types and formats. By the time all this data is moved, consolidated, and cleansed, it can take weeks to build a predictive model.
Even with data lakes, efficiently integrating multi-structured data from different data sources and streams is a major challenge. Enterprises struggle with a stew of data integration tools, application integration middleware, and various data quality and master data management software. How can we simplify this complexity to accelerate and de-risk analytic projects?
The data warehouse—once seen as only for traditional business intelligence applications — has learned new tricks. Join James Curtis from 451 Research and Pivotal’s Bob Glithero for an interactive discussion about the modern analytic data warehouse. In this webinar, we’ll share insights such as:
- Why after much experimentation with other architectures such as data lakes, the data warehouse has reemerged as the platform for integrated operational analytics
- How consolidating structured and unstructured data in one environment—including text, graph, and geospatial data—makes in-database, highly parallel, analytics practical
- How bringing open-source machine learning, graph, and statistical methods to data accelerates analytical projects
- How open-source contributions from a vibrant community of Postgres developers reduces adoption risk and accelerates innovation
We thank you in advance for joining us.
Presenter : Bob Glithero, PMM, Pivotal and James Curtis Senior Analyst, 451 Research
2. Welcome!
2
James Curtis
Sr Analyst, Data Platforms & Analytics
james.curtis@451research.com
@jmscrts
www.451research.com
Bob Glithero
Principal Product Marketing Mgr
linkedin.com/in/glithero
@bglithero
www.pivotal.io
Bharath Sitaraman
Principal Product Manager
linkedin.com/in/bsitaraman
@bharath1028
www.pivotal.io
3. Cover w/ Image
Agenda
● Expanding Analytics with EDW
● Integrating Data for Analytical
Transformation
● Use Case: Layered Analytics in
Cybersecurity
● Q&A
4. 451 Research is a leading IT research & advisory company
4
Founded in 2000
300+ employees, including over 120 analysts
2,000+ clients: Technology & Service providers, corporate
advisory, finance, professional services, and IT decision makers
70,000+ IT professionals, business users and consumers in our research
community
Over 52 million data points published each quarter and 4,500+ reports
published each year
3,000+ technology & service providers under coverage
451 Research and its sister company, Uptime Institute, are the two divisions
of The 451 Group
Headquartered in New York City, with offices in London, Boston, San
Francisco, Washington DC, Mexico, Costa Rica, Brazil, Spain, UAE, Russia,
Taiwan, Singapore and Malaysia
Research & Data
Advisory
Events
Go 2 Market
11. ENTERPRISE
APPLICATIONS
CLOUD STORAGE
MOBILE
APPS
BOTS
IOT DEVICES
AND SENSORS
SOCIAL
MEDIA
BUSINESS
USERS
DATA-DRIVEN
APPLICATIONS
DATA
SCIENTISTS
DECISION
MAKERS
HADOOP
SPARK
AI+ML
DATA
ANALYSTS
IT PROS
LOG AND
CLICKSTREAM
DATA
OT
USERS
DATA
WAREHOUSE
3
Which
Leads to
More
Advanced
Decision-
Making
Processes
13. 3
Consolidate Analytical Frameworks
• Fewer systems to maintain
• Minimize data movement
• Analytic optimization
• Resource efficiency
CLOUD
STORAGE
HADOOP
SPARK
AI+ML
DATA
WAREHOUSE
14. Consolidated Systems Enable In-Database Machine Learning
14
✔︎ Operate on all of the data,
including varied types
✔︎
✔︎
✔︎Algorithms optimized to
the architecture
No moving of data
Leverage the use of SQL
15. 3
Level Set on Machine Learning
! The terms ‘algorithm’ and ‘model’ are often used to mean the same thing. They are not.
! An algorithm is a set of computational instructions, such as Random Forest.
! A model in that context would be the result of applying an Random Forest to a dataset
—its output, which is based on the algorithm.
16. In-Database Machine Learning Works Best When You...
16
Understand the
business problem
(and rules) thoroughly
1
CLOUD
STORAGE
HADOOP
SPARK
AI+ML
DATA
WAREHOUSE
17. In-Database Machine Learning Works Best When You...
17
Understand the
business problem
(and rules) thoroughly
Have a decent
amount of data
that is consolidated
1 2
CLOUD
STORAGE
HADOOP
SPARK
AI+ML
DATA
WAREHOUSE
18. In-Database Machine Learning Works Best When You...
18
Understand the
business problem
(and rules) thoroughly
Have a decent
amount of data
that is consolidated
Algorithms and tools for
data analysis and
preparation (optimized)
1 2
3
CLOUD
STORAGE
HADOOP
SPARK
AI+ML
DATA
WAREHOUSE
19. In-Database Machine Learning Works Best When You...
19
Understand the
business problem
(and rules) thoroughly
Have a decent
amount of data
that is consolidated
Algorithms and tools for
data analysis and
preparation (optimized)
Algorithms for machine
learning development
(optimized)
1 2
3 4
CLOUD
STORAGE
HADOOP
SPARK
AI+ML
DATA
WAREHOUSE
20. In-Database Machine Learning Works Best When You...
20
Understand the
business problem
(and rules) thoroughly
Have a decent
amount of data
that is consolidated
Algorithms and tools for
data analysis and
preparation (optimized)
Algorithms and tools to
carrying out
maintenance, validation,
updating
Algorithms for machine
learning development
(optimized)
1 2
3 4
5
CLOUD
STORAGE
HADOOP
SPARK
AI+ML
DATA
WAREHOUSE
21. In-Database Machine Learning Works Best When You...
21
Understand the
business problem
(and rules) thoroughly
Have a decent
amount of data
that is consolidated
Algorithms and tools for
data analysis and
preparation (optimized)
Algorithms and tools to
carrying out
maintenance, validation,
updating
Algorithms for machine
learning development
(optimized)
Methods for machine
learning model
deployment
1 2
3 4
5 6
CLOUD
STORAGE
HADOOP
SPARK
AI+ML
DATA
WAREHOUSE
26. So How Can We Use Data Effectively?
● Over 30% of organizations have failed on big data
projects
● Recent research says it takes an average of 52 days
to build a predictive model
● Consolidating data and analytics in fewer
environments simplifies modeling and deployment
26
27. 1. Converge Analytics and Data
● Run algorithms in a database, as close to the data as
possible
● Leverage MPP architectures for rapid data science
● Avoid ETL by moving data only when necessary
● Integrate structured and unstructured data in one
environment, reducing footprint of specialist databases
27
28. 2. Remove Friction from Data Science
● Test as many hypotheses in parallel to find the
most relevant features as quickly as possible
● Don’t push/pull data into other environments,
train and test, over and over...
● Instead, develop a process that lets you
○ train a model with consolidated, cleansed
data sets in the database
○ deploy the model as a pre-computed
object
○ quickly retrain if data patterns change
28
29. 3. Choose a Data Platform for Rapid Analytics
Some algorithms are iterative, like clustering or
graphing
Some algorithms can be parallelized, like random
forests
Sometimes patterns in the underlying data change, so
models become obsolete 29
30. 4. Start with Standard Statistical Methods and ML
A lot of useful data science can
be done with standard algorithms
Exotic algorithms need more data
to train
Standard algorithms can be
combined into ensembles for
greater predictive power
Source: “A Few Useful Things to Know About Machine Learning,” Pedro Domingos, Communications of the ACM, October 2012
Starting with simpler algorithms and ensembles paves the way for more advanced data science
30
31. But Don’t Just Take Our Word for It...
“As we built the ML model, we were
surprised to learn that none of the most
hyped data science tools — such as deep
learning, AutoML, and ‘AI that creates AI’ —
were needed to make it work.”
32. Pivotal Solutions for Data and Analytics
Pivotal Greenplum
Multi-Cloud, MPP data platform
for complex analytics with
diverse data locality and data
types
Pivotal GemFire
Fast, transactional
in-memory grid for rapid
data refresh
Complete portfolio
Multi-Cloud and
on premises
Based on
open source
Flexible licensing
Advanced
data services
Pivotal Cloud Cache
On-demand in-memory
caching for cloud native
apps
Pivotal Cloud Foundry
Proven solution for
operationalization of
analytics and software-led,
digital transformation
Pivotal Data Science
World-class Data Science
consulting to drive more
insights from data for Data-
Driven Applications.
Apache MADlib
Distributed, in-database
analytical library on large-scale
data set.
32
33. Consolidating Diverse Data Enables New Use Cases
Native Graph
Relationship
Intelligence
Greenplum
GPText
Fast Search and
Semantic
Intelligence
Greenplum
PostGIS
Location
Intelligence
Greenplum
Integrated,
Cleansed Data
33
New use cases for locations, flows, connections, relationships, and intent
34. Text Analytics with GPText
Extracts content, structure from
many binary formats
Fast heterogeneous document
indexing, search, and retrieval
Massive parallelism for
● Topic modeling
● Named entity recognition
● Term frequency
● Stemming
● Topic graph
● Topic cloud
● NLP
34
+
36. Attacks go unnoticed for long periods
Insider Threats Increasingly Evade Network-Level
Visibility
Source: Verizon, 2017, n = 77
36
37. Layering Analytic Techniques for a More Complete
Picture
Understanding user/entity behavior (graphing, clustering,
predictive analytics)
Network-level intelligence (firewalls, IDS, SIEMs)
Semantic understanding (e.g., text analytics, NLP)
Understand
activity
Understand
behavior
Understand
intent
General
Specific
37
38. Engineering-level view misses higher-level activity
Understanding Activity - Network Intelligence
● SIEM/IDS useful if activity matches a signature
or rule
● Small-ish data sets - APTs unfold over months
or years
● Inflexible schema - difficult to extend with user-
level attributes
● User is not same as device, IP address
Log Collection
Log Analysis
Event
Correlation
Log Forensics
Object Access
Auditing
Alerts
Reports
Log Monitoring
Log Retention
File Integrity
Monitoring
SIEM
38
39. Advanced Analytics Reveal Hidden Patterns
Reveal latent/invisible patterns that stand
out from normal behavior
● Reconnaissance
● Privilege escalation attempts
● Access attempts
● Unusual data flows or exfiltration
attempts
39
40. Predictive Analytics for Understanding Behavior
Chaining models increases predictive power, decreases false positives
Lateral Movement
Ensemble Model
Outcomes
Training data
● Kerberos authentication events
(source, destination, account, type,
success/failure, etc.)
● 10K users, 13K nodes
● >110M events
• Regression analysis
• Constrained diameter
authentication graph
• Robust rank aggregation
Model features
● # of distinct destinations logged in to
● # of distinct sources logged in from
● # of distinct destination user accounts
● # of distinct processes started by user
Signals of credential takeover,
running scripts, and other
behavioral anomalies
https://content.pivotal.io/blog/insider-threat-detection-detecting-variance-in-user-behavior-using-an-ensemble-approach
40
41. Ensemble of methods reveals hidden patterns that signal problem behavior
Revealing the Needle in the Haystack
Surge in access attempts at
regular intervals indicates
possible background script
Regular surges in logins from
unusual user accounts
indicates possible account
takeovers
Graph reveals two access
attempts to a sensitive server
indirectly via other servers
41
42. Understanding Intent with Semantic Intelligence
Semantic intelligence can clarify
ambiguous signals from other analytics or
security appliances
Scan for interesting words, words in
proximity, variations
Scan and index content - “Is this document
leaving the network similar to other known
sensitive documents?”
Scan document to suggest possible
classification tags
“sorry, i sent this file by mistake”
✔
“please don’t share this file
outside the organization”
!
Benign, no further
action
Investigate
42
44. Digital
Transformation
is Real
T-Mobile goes from 7 months and 72 steps to update
software, to same day deployments.
Liberty Mutual builds and deploys an MVP in one
month and delivers revenue-generating version just
months later.
Comcast supports over 1500 developers with an
operator team of 4 people.
The Home Depot ships to production 1,500 times a
month, and 17,000 times a month to all environments.
Leading companies trust Pivotal as
a transformation partner
46. Cover w/ Image
Data is the Key to
Transformation
● Consolidating data simplifies modeling
and deployment
● Executing massively parallel analytics in
the database speeds complex use
cases
● Pivotal Greenplum integrates diverse
data with machine learning at scale for
faster value with less risk
47. Start Your Data Transformation Journey Today!
Pivotal Greenplum
pivotal.io/pivotal-greenplum
Pivotal Data Science
pivotal.io/data-science
Apache MADlib
madlib.apache.org
Greenplum Database
Channel