Learn how HPE uses visual analytics within a data lake to create an “Industrial Internet of Things” model that solves their data analytics problem at scale.
How Hewlett Packard Enterprise Gets Real with IoT Analytics
1. Arcadia Data. Proprietary and Confidential
How Hewlett-Packard Enterprise
Gets Real with IoT Analytics
June 26, 2018
2. Arcadia Data. Proprietary and Confidential2
Featured Speakers
Dale Kim
Sr. Director, Products/Solutions
Arcadia Data
Siamak Nazari
Chief Software Architect, HP Fellow
Hewlett-Packard Enterprise
3. Arcadia Data. Proprietary and Confidential3
If you have any questions along the way, please type them into the chat window.
If you have audio problems, please chat us for help.
A recording of this presentation will be sent to you in a few days.
Please live tweet! @arcadiadata #bigdata #analytics
Before We Begin Our Presentation
5. Introduction to HPE Storage
–What my business unit does
– HPE Storage is a business unit of
Hewlett-Packard Enterprise
– Produces efficient application-integrated data
storage solutions (“Tier 1 Storage Arrays”)
– Lets customers start small and scale without limits
–What I do
– Chief software architect for HPE 3Par
– Set technical direction for the software
– Currently focus on solid state storage systems and
software systems for a new class of storage
As a start…
6. What We Were Trying to Do
–Storage arrays create a massive
amount of meta-data
– Software, hardware, device, and sensor
data
– Ongoing status/health, performance,
configuration changes, diagnostic events
–Data analysis could benefit multiple
functional teams
– Product management – improve product
– Sales and marketing – identify new
opportunities in the market
– Customer satisfaction – troubleshooting
–Needed to analyze data in
consumable way, with granular drill
down
– Data was already being collected as
millions of text files
– Important to compare real-time data versus
historical trends
– Needed quick access across different
teams to an accurate and complete picture
7. The Primary Challenges
– Scale
– Tens of thousands of storage arrays at
customer data centers
– Hundreds of millions of data points every
24 hours
– Volumes continued to grow
– Data format
– Text files needed to be converted into
analyzable format
– Addressing speed and functional
requirements for analytics
– Previous pilots on RDBMSs fell short
We needed to address:
8. Our Big Data Solution
–Adopt a true big data platform
– We turned to Apache Hadoop for the data
storage
– Hadoop was not previously used
– Became the platform of record for big data
–Added a “BI on Hadoop” analytics
platform
– Arcadia Data was architected for big data
– Runs directly on the Hadoop cluster
– Provided the speed, scale, global view, and
access to granular data
– Use of Arcadia Data simplified many
aspects of getting started with Hadoop
9. Our Results
– Successful deployment despite
inexperience with Hadoop
– Moved to production within 6 months of starting
from scratch
– Loading 60GB/hour
– End user experience requirements were
addressed
– Single global view of utilization, duty cycles,
upgrades, feature uptake, utilization trends,
failure notifications, failure trends, etc.
– Arcadia Data query acceleration provided fast
query results
– Data security was enabled to limit access to
authorized users
– Current and historical data analysis
– Unstructured data analytics possible for the first
time
We achieved:
10. Our Results (continued)
– Business opportunities are being realized
– Identify potential sales opportunities
– Examine and fix underutilized and poorly
provisioned systems
– Study equipment and component reliability
– Offer suggestions to customers on product
usage and future plans
– Processing time was drastically reduced
– Event analysis scripts formerly take hours or
days
– Previously had to be customized for every
analytical problem
– With the new solution, complicated queries
would return in seconds
We achieved:
11. Overall snapshot of entire
install base as well as
drill down into smallest
component details
Install Base Overview
12. Feature licenses over
time brought insight
to product team about
uptake and to
Marketing for pricing
Installations and
Software versions
trend analysis
Trends and Usage
13. Drive failures by
type, model and
firmware analyzed
over time helped
quality control
and customer service
Failure Analysis and Anomaly Detection
14. Systems with low free
capacity reflect an
opportunity to sell more
capacity
Systems with high free
capacity reflect disuse
and potential competitor
threats.
Drill down of various
component capacities
over time enables sales
rep to have a more
detailed conversation
Sales Opportunities and Threats Based on Free Space
15. Unstructured Data Analysis
Filters for
Date-Hour, SystemID
Filters for
Include and Exclude
event string patterns
Subsequent Event
Aggregation by System
Exploratory Query : 1m 0s
Full Data Query : 8m 38s
All Events for 1 wk : 31 B
16. Combined Structured and Unstructured Analysis
3.Clicking on a
system retrieves
its most recent
raw event log
1.Search for an
event pattern and
find the release
versions that most
hit that pattern
2. Clicking on the
release version
reveals the systems
that most hit the
pattern
4.Raw event log shows
the surrounding
context around
the events
17. Recommendations
– If you have IoT data to analyze, think in
terms of big data
– IoT analytics is a big data problem
– Start with big data technologies, especially if
scale is an issue
– You might be successful with standard
technologies on big data, but more than likely
you’ll spend more time on them than necessary
– Give multiple teams access to the data lake
– You can start small, but think long term as well
We achieved:
20. Arcadia Data. Proprietary and Confidential20
“Data” and “Platforms” Have Changed – Why Haven’t BI Tools?
From To
Data
Platforms
BI Tools
rows and columns and multi-structured
batch and interactive and real-time
small and large volumes
many sources
internal and external
tables and docs, search indexes, events
schema on write and schema on read
commodity hardware
ETL and ELT and ELDT
data warehouses and data lakes
rows and columns
batch
smaller data volumes
limited # sources
mainly internal
tables
schema on write
proprietary hardware
ETL
data warehouses
SQL queries
extracts
cubes
BI servers
small/med scale
Why haven’t
BI tools
evolved?
21. Arcadia Data. Proprietary and Confidential21
BI Built for Data Warehouses Fails Us in Data Lakes, Because…
Agile only in name
Pathway to production slow, requires multiple
steps, data duplication and pre-
summarization. Time-to-insight is delayed.
Extract to EDW?
Summarize on BI
Server?
Replicate
Security?
Acquire New
Hardware?
Inefficient scale
Scale to large data comes
at reduced concurrent
access for users.
# users
datavolume
good here
bad here
Cannot handle data variety
Big data is structured + real-time and
streaming + complex + unstructured
structured
multi-structured
small
big
batch
streaming
external
internal
✓
✘
22. Arcadia Data. Proprietary and Confidential22
BI for Data Lakes Must be Architected for Scale and Performance
Edge Node JDBC
BI Server
Data Warehouse BI Architecture
• BI Server can’t scale out
• Significant data movement, modeling, security management
Data Lake Cluster
“Big Data” BI Architecture
• Edge node BI server only scales via long planning
• Performance optimizations require heavy IT intervention
• Only passing SQL with no semantic information (e.g., filters)
Native BI within Data Lake Architecture
• Scales linearly with DataNodes while retaining agility
• Semantic model is “pushed down” and distributed
• Highly optimized “based on usage” physical model
• No data movement; single security model
Native BI = “Lossless”, high-definition analytics
DataNodes
BI Front-End
DataNodes + Arcadia
Data Lake Cluster
BI Front-End
Edge Node BI Server DataNodes
Data Lake Cluster
BI Front-End
23. Arcadia Data. Proprietary and Confidential23
Data Warehouse BI Architecture
23
BI Server
Analytic Process
Optimize Physical
Semantic Layer
Secure Data
Load Data
Big Data Requirements
Native Connection
Semi-Structured
Parallel
Real-time
Data Warehouse
(RDBMS)
24. Arcadia Data. Proprietary and Confidential24
Data Lake BI Architecture
24
BI Server
Analytic Process
Optimize Physical
Semantic Layer
Secure Data
Load Data
Big Data Requirements
Native Connection
Semi-Structured
Parallel
Real-time
Data Warehouse
(RDBMS)
Data Lake
(HDFS, Cloud Object Storage)
You need a BI platform
that runs natively within
data lakes
25. Arcadia Data. Proprietary and Confidential25
Query acceleration for
scale, performance,
and concurrency
Smart Acceleration Leverages What Is Learned during Data Discovery
Ad hoc
queries
Arcadia Enterprise makes
recommendations –
build these with a click.
Hadoop Cluster
• Fast query responses
• Minimal modeling
• Live acceleration (no downtime)
All Granular Data
Analytical
Views
Accelerated
application queries
26. Arcadia Data. Proprietary and Confidential26
Sample Query Acceleration Comparisons
Accelerated next to
unaccelerated
No numbers for unaccelerated queries – queries
did not return in reasonable time frame.
Queries
27. Arcadia Data. Proprietary and Confidential27
Visual Analytics and BI Native to Data Lakes
BI Native to Data Lakes Simplifies the Analytic Process
Data Warehouse or Data Lake
Traditional BI Server
One security model
No movement of data
Self-Service Discovery
AI-driven performance
modeling
Production ReadyTime to Insight/Value in Days
BI Deployment Delayed Weeks
Time to Insight/Value in Weeks or Months
Extract and
Secure
Land / Secure
Data
Build Semantic
Layer
Analytical
Discovery
AI-driven
Performance
Modeling
Production
Land / Secure Data
Performance
Modeling –
Cubes /
Aggregates
Analytical
Discovery
Production
Transform 3NF
or Star Schema
Build Semantic
Layer
Performance
Modeling
(both places)
Data
Movement
28. Arcadia Data. Proprietary and Confidential28
Scale without compromise
Enable real-time, streaming analytics
Unlock complex data not easily reachable before
Act directly from your data discovery
Optimize and productionize based on usage and need
Native BI Unleashes the Power and Flexibility of Your Data Lake
29. Arcadia Data. Proprietary and Confidential29
Advanced Visualizations
and Semantic Layer
Arcadia Is Native BI Built from the Ground Up for Data Lakes
• In-cluster for
high performance, high
concurrency
• Distributed BI on every node
• No data movement
• Unified security
• Single semantic layer
• Blend with additional data
sources, including S3
Data Lake on Hadoop Cluster
Data Node Data Node Data Node
Data Node Data Node
… … … …
… … …
Streaming (via KSQL)
Other external
data sources
Azure Data
Lake Store
30. Social media: @arcadiadataarcadiadata.com
30
Find more IoT information in
our Resource Center
Try Arcadia Instant– Free
Download
Read our blog for more
about big data
arcadiadata.com/resources arcadiadata.com/Instant arcadiadata.com/blog
Thank You
Read more about how we help with
Internet of things.
https://www.arcadiadata.com/solutions/iot
Gartner names Arcadia Data a 2017
Cool Vendor for IoT Analytics, April
2017.
Notas do Editor
Before blockchain, there was Internet of Things (IoT). But beyond the hype, how does IoT apply to real-world use cases?
Hewlett Packard Enterprise (HPE) is a Fortune 500 enterprise information technology company that makes tier 1 storage arrays used by data centers worldwide. To ensure quality service, HPE needed a data analytics platform that could:
(a) Monitor millions of incoming diagnostic data points each day
(b) Visualize this data at scale for internal business users and offices
Join us Jun 26th at 11 am PT to learn how HPE uses visual analytics within a data lake to create an “Industrial Internet of Things” model that solves their data analytics problem at scale. We will discuss:
How to scale IoT Analytics for 24/7 operations
The metrics and insights that benefit customers, support, and product line managers
Before and after: key metrics and results of scaling analytics securely across diverse users and teams to achieve business goals
How visualizing data at scale across diverse internal teams can help achieve business goals
How data lakes can be used to manage data at scale
Siamak Nazari is the chief software architect for HP 3PAR. In this role he is responsible for setting technical direction for HP 3PAR and its portfolio of software enhancements. His current area of focus is solid state storage systems and the software systems for a new class of storage systems. Nazari has over 25 years of experience working on distributed and highly available systems. He has been working on HP 3PAR technology since 2000, responsible for designing and implementing distributed memory management and high availability feature of the system.
Dale Kim is the senior director of products/solutions at Arcadia Data. His background includes a variety of technical and management roles at information technology companies. While Dale’s experience includes work with relational databases, much of his career pertains to nonrelational data in the areas of search, content management, NoSQL, and Hadoop/Spark, and includes senior roles in technical marketing, sales engineering, and support engineering. Dale holds an MBA from Santa Clara University, and a BA in computer science from Berkeley.
The range of data has expanded over the years to include complex structure, real-time, greater volumes, and many more sources.
As a result, platforms have evolved to handle the new types and sources of data. Platforms like Apache Hadoop have emerged as primary technologies for data lakes.
However, BI tools have not evolved to address the changing landscape. Most businesses still look to shoehorn BI tools into a modern data platform.
Context of discussion with Boris at Forrester :
Lossless (both ways … metadata (we do push down all the way to Hadoop. Tableau only passes through SQL), multidimensional, security from Sentry up — it’s not visible outside cluster, requires replication and online sync), cost
PP - What’s the point of a cubing engine sitting behind BI server? They are performing calculations for BI tool by leveraging information/fields/aggregates/common hierarchies/filters … this is collectively called the semantic model. A cube is merely taking those definitions and building an optimized structure within that view/model of the world.
If they are not native then the ability to leverage this, then the optimization happens outside the cluster, unlike US where we take this semantic model (either manually or learned on the fly with SA) and process it on the data nodes of the cluster.
The reason we can do this is because the way in which these engines have been built such that they are NON-PARALLEL … you can NOT just run MSTR on 20 nodes and make it work
We are built from ground up as MPP system … this is why we can do this and scale infinitely better.
This is why we are lossless. They have to result to the lowest common denominator of SQL … they only pass SQL… does not give you semantic information on filters/aggregations, etc…. Those optimization opportunities are LOST.
How does this compare to BO issuing a query through universe? PP - we short-circuit the 50 queries with AV … SQL tools like Impala do not have aggregate join indexes and other optimizations
The final point I want to call out is how we enable production deployments of analytical applications across thousands of concurrent users.
User concurrency is a big problem when using traditional tools on a data lake, which is why our analytical views provide a significant speed boost.
Our software makes recommendations on what to build, and with the click of a button, you can speed up your users’ queries with our patented data structures.
Arcadia Data lets you start with semantic modeling, perform discovery, and then quickly adjust your models in a tight feedback loop.
This eliminates a lot of time-consuming work that’s inserted between semantic modeling and discovery.
The optimize step requires no modeling by humans, as the AI-driven Smart Acceleration quickly identifies query optimizations.