The blessing and curse of today's database market? So many choices! While relational databases still dominate the day-to-day business, a host of alternatives has evolved around very specific use cases: graph, document, NoSQL, hybrid (HTAP), column store, the list goes on. And the database tools market is teeming with activity as well. Register for this special Research Webcast to hear Dr. Robin Bloor share his early findings about the evolving database market. He'll be joined by Steve Sarsfield of HPE Vertica, and Robert Reeves of Datical in a roundtable discussion with Bloor Group CEO Eric Kavanagh. Send any questions to info@insideanalysis.com, or tweet with #DBSurvival.
2. An Early Thought
Data Lakes and Databases are not
very different things…
Irrespective of what the data lake
enthusiasts claim
3. Everything in flux
u Hardware (network,
storage, servers)
u Data Sources
u Data Staging
u Data Volumes
u Data Flow
u Data Governance
u Query Languages
u Data Usage
u Data Structures
u Schema definition
u Ingest speeds
u Data Workloads
u Applications
4. The Data Lake Picture
Data
Cleansing
Data
Security
Ingest
Metadata
Mgt
Real-Time
Apps
Transform &
Aggregate
Search &
Query
BI, Visual'n
& Analytics
Other
Apps
Data Lake
Mgt
Data
Governance
DATA LAKE
To
Databases
Data Marts
Other Apps
Archive
Life Cycle
Mgt Extracts
Servers, Desktops, Mobile, Network Devices, Embedded
Chips, RFID, IoT, The Cloud, Oses, VMs, Log Files, Sys
Mgt Apps, ESBs, Web Services, SaaS, Business Apps,
Office Apps, BI Apps, Workflow, Data Streams, Social...
u Data Lakes (Yes!):
u Ingest points for data for the
sake of governance
u Analytics sandboxes
u Good places for cool and cold
data – and hence archive
u Data Lakes (No!):
u OLTP databases
u Fast query engines
u High user concurrency
u Bid Data analytics apps
u Unusually structured data
(NoSQL, graph, etc.)
You don’t have one data lake you have
many
Data lakes do not manage data well.
5. Streaming
There’s a spectrum of streaming
capability and thus a spectrum of
streaming platforms:
Spark, in-memory DBMS, SQLstream
6. Database Workload Parameters
q Read-intensive vs. write-
intensive
q Mutable vs. immutable data
q Immediate vs. eventual
consistency
q Short vs. long data latency
q Predictable vs.
unpredictable data access
patterns
q Simple vs. complex data
types
7. Horses for Courses
q Relational row store databases for
conventional OLTP
q Relational databases for ACID
requirements
q Parallel databases (row or column)
for unpredictable or variable query
workloads
q Specialized databases for complex
data query workloads (graph, etc.)
q NoSQL (KVS, DHT) for high scale
OLTP
q NoSQL (KVS, DHT) for low latency
read-mostly data access
q NoSQL / Hadoop /Spark for scale-
out batch analytic workloads
q Cloud Databases can be any of the
above
8. Database Tools
q Have you noticed how databases
are not self-running.
q DBA’s are in short supply and the
need for them is increasing
q Database diversity doesn’t help
in this area.
q DBA Tools:
q SQL analysis
q Performance analysis
q Security management
q Capacity planning
q Database deployment
q We meet the same problem with
data lakes – except that there
are very few tools
27. Picking a DB
3
Structure
• Does the data fit into a nice
clean data model
• Will the schema lack clarity
or be dynamic?
Analytics
• What question(s) do
you want to ask of the
data?
• Short running queries
• Long, deep analytics
including predictive
Size
• Is the data “Big Data”
or will it ever be big
data?
Also:
• Cost per Terabyte
• Staffing considerations
• Familiarity with
technologies
• Company Financials
• Company Ancillary
Portfolio
• Community & Openness
28. Security Analytics
– Are there any attacks happening
right now?
Needing different kinds of analysis is common
Weather Application
– Tell me the current
temperature and pressure
Short, fast queries
Deeper analytics with
bigger data sets
Machine learning and
predictive
– What was the high/low for my
area?
– What was the high/low for my
region?
– What was the average
temperature?
– Highest and lowest of all time?
– Can we predict conditions
tomorrow?
– What IP and where are most of my
events coming from?
– Has traffic spiked compared to
historical?
– Has any event happened liked this
over the last three years
– What new events should we be
tracking to predict security events?
29. HPE Vertica Enterprise
– Columnar storage and advanced compression
– Maximum performance and scalability
HPE Vertica
All built on the same trusted and proven HPE Vertica Core SQL Engine
5
Core HPE Vertica SQL Engine
• Advanced Analytics
• Open ANSI SQL Standards ++
• R, Python, Java, Spark. Scala
• In-database machine learning
HPE Vertica for SQL on Hadoop
– Native support for ORC and Parquet
– Support for industry-leading distributions
– No helper node or single point of failure
HPE Vertica In the Cloud
– Get up and running quickly in the cloud
– Flexible, enterprise-class cloud deployment options
30. The Appeal of Vertica
Requirement Proof
Extreme Optimization
• Columnar design for high performance analytics
• Aggressive compression
• Scalable to petabyte scale
Total Cost of Ownership
• Simply and predictable pricing
• No penalty for additional hardware or connected users
Ready for your Enterprise
• SQL compliant to 100% of the TPC-DS benchmark queries
• Secure and ACID compliant
• No single point of failure
Open and Compatible
• Open platform – Standards compliant SQL, Python, Java
• Working with open source community on Spark, Hadoop, Kafka, etc.
6
31. Vertica Enterprise Unique Value to expand the data warehouse
7
Hadoop Data Lake Vertica Big Data Warehouse
CREATE TABLE customer_visits (
customer_id bigint,
visit_num int)
PARTITIONED BY (page_view_dt date)
STORED AS ORC;
Customer information in Hadoop Customer information in Data Warehouse
SELECT customers.customer_id FROM orders RIGHT OUTER JOIN customers
ON orders.customer_id = customers.customer_id
GROUP BY customers.customer_id HAVING COUNT(orders.customer_id) = 0;
Vertica Engine
Querying data that sits
BOTH in the data
warehouse and Hadoop
is our unique value.
Most solutions require that
you move the data.
ROS
§ Leveraging Web Logs to gain customer insight
§ Sensor and IOT data for pre-emptive service
§ Marketing Programs Tracking
§ Tracking impact of application updates
§ Many more uses
32. Machine Learning in Vertica 8.0.1
Algorithm Example
Linear Regression Demand Forecasting
Model the demand for a service or good (response) based on its features (predictors) for
example; demand for different models of laptops based on monitor size, weight, price,
operating system, etc.
Logistic Regression Engineering
Predicting the likelihood that a particular mechanical part of a system will malfunction or
require maintenance (response) based on operating conditions and diagnostic
measurements (predictors)
K-means Fraud Detection
Identify individual observations that don’t align to a distinct group (cluster) and identify
types of clusters that are more likely to be at risk of
Naïve Bayes Categorization
Using fuzzy logic, identify items that in one group or another. Used in email spam
detection, language detection, sentiment analysis and document sorting
Support the whole workflow of predictive analytics
33. Perhaps the ultimate architecture is all-inclusive
Apache Spark, Hadoop and Kafka
Data Warehouse (Vertica)
Optimal Use Case
– Deep Analysis
– Massive scale
– Many concurrent users
Kafka
Data Lake (Hadoop)
Optimal Use Case
– Data lake
– Warm, cold storage
– Data discovery
– ETL
Operational Analytics (Spark)
Optimal Use Case
– Small, fast running queries
– ETL and complex event processing
– Operational analytics
Features:
– Vertica performs optimized data load from
Spark
– Spark runs queries on Vertica data
Features:
– Analyze-in-place without data movement
via native ORC and Parquet readers
– Any Hadoop
– Run ON the Hadoop cluster or ON Vertica
cluster
Features:
– Share data between
applications that support
Kafka
– Data streaming into Vertica
34. Vertica makes data matter
Purpose built for Big Data from the first line of code
Gain insight into your data 50x-1,000x
faster than legacy products
Fast Analytics
Infinitely scale your solution by addingan
unlimited number of low cost nodes
Massive scalability
Built-in support for Hadoop, R, and a
range of ETL and BI tools
Open architecture
Store 10x-30x more data per server than
row databases with patentedcolumnar
compression
Optimized data storage
HPE Vertica Community
Edition
Download and install community
edition.Manage and analyze up to 1
TB of data across three nodes for an
unlimited time.
Try it on my.vertica.com