Breaking the Silos: Storage for Analytics & AI

Breaking the Silos:
Storage for Analytics & AI

Agenda
• IBM Software Defined Storage for Analytics & AI
• IBM AI Infrastructure Reference Architecture
• Why customers are choosing IBM Spectrum Scale storage for Hadoop?
• Popular analytics use cases with IBM Spectrum Scale storage

IBM Spectrum Scale is a flexible and scalable software defined file storage
GLOBAL Namespace
Powered by
IBM Spectrum Scale
Automated data placement and data migration
Disk Tape Shared Nothing
Cluster
Flash
Transparent Cloud
Tier
JBOD/JBOF
Spectrum Scale RAID
NFS SMBPOSIX HDFS Object
HPC
Genomics Traditional
applications
New Gen
applications
Enterprise class functionality:
Encryption
Compression
Synchronous Replication
Asynchronous Replication
Backup
Disaster Recovery
Audit Logging
4000+
clients
IBM Spectrum Scale supports file systems with sizes of tens of petabytes that contain billions of files and can be
accessed by thousands of nodes in a cluster.

4
IBM Spectrum Scale – Deployment models
Software
Install software on your own
choice of Industry standard x86/
POWER servers
Pre-built Systems
Elastic Storage Server(ESS)
with Spectrum Scale SW RAID
Cloud Services
Spectrum Scale can be deployed
on IBM Cloud and Amazon Web
Services (AWS)
Spectrum Scale
4
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
EXP3524
8
9
16
17
System x3650 M40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
System x3650 M40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
EXP3524
8
9
16
17
EXP3524
8
9
16
17
EXP3524
8
9
16
17

5
 #1 Pure Open Source Hadoop Distribution
 1300+ customers and 2100+ ecosystem partners
 Employs the original architects, developers and
operators of Hadoop from Yahoo!
 Best-in-class 24x7 customer support
 Leading professional services and training
 #1 SQL Engine for complex, analytical workloads
 #1 Data Science Platform (Source: Gartner)
 Leader in On-premise and Hybrid Cloud solutions
 OpenPOWER performance leadership
 Software defined storage with unmatched scalability
+
The Power of ONEOne enterprise end-to-end solution for big data
#1 open source Hadoop platform + IBM’s leading value adds

IBM Systems:
A Reference
Architecture for
AI Infrastructure
June 2018

7
June 19th Announcement
IBM Systems is announcing IBM PowerAI Enterprise and an
AI infrastructure Reference Architecture for on-premises AI deployments.
IBM Systems is addressing the challenges organizations face experimenting
with PoCs, growing into multitenant, production systems, then expanding to
enterprise scale, all while integrating into an organization’s existing IT
infrastructure.
With a set of easy to use, integrated software tools built on optimized,
accelerated hardware, the architecture enables organizations to jump start AI
and Deep Learning projects, speeds time to model accuracy and provides
Enterprise-grade security, interoperability and support.

8
Autonomous
driving
Accident
avoidance
Location-based
advertising
Sentiment analysis of
what’s hot, problems
$
Market prediction
Fraud/Risk
Experiment sensor
analysis
Drilling exploration
sensor analysis
Consumer
sentiment Analysis
Sensor analysis for
optimal traffic flows
Smart Meter
analysis
for network
capacity,
Threat analysis - social
media monitoring, video
Surveillance
Clinical trials, drug
discovery,
Genomics
People & career
matching
Patient sensors,
medical image interpretation
Captioning,
search, real time
translation
Mfg. quality
Warranty
analysis
AI Examples in Every Industry

9
Data Science is a Team Sport
and Iterative
Extract
Data
Build
models
Prepare
Data
Train
Models
Evaluate Deploy
Use
models
Monetize
$$$
Monitor
Building cognitive apps using deep learning requires multiple skillsets
Connected infrastructure for data, development and iteration.
A common data platform and workflow is crucial for enterprise success.
Biz Analyst Dev OpsData Engineer App DeveloperDev OpsData Scientist
IT Supports & Services the Complete Workflow

10
91% I&O Leaders
Across Inquiries
Cited "Data" as a
Main Inhibitor of AI
Initiatives.
This is not easy…
Source: Gartner "AI State of The Market - and Where HPC intersects”

11
Data Source
New Data
Years of
Data
Work flow and data flow is complex
Inference
Trained Model
Deploy in
Production using
Trained Model
Seconds
to results
Data Preparation
Data Cleansing &
Pre-Processing
Training
Dataset
Testing
Dataset
Weeks &
months
Heavy IO
Iterate
Build, Train, Optimize Models
AI Deep Learning
Frameworks
(Tensorflow & Caffe)
Monitor &
Advise
Instrumentation
Distributed &
Elastic Deep Learning
Parallel Hyper-Parameter
Search & Optimization
Network
Models
Hyper-
Parameters
Days & weeks
Traditional
Business
IoT &
Sensors
Collaboration
Partners
Mobile Apps &
Social Media
Legacy

Training
Dataset
Testing
Dataset
12
Production
Data
Sensor Data
Data from
collaboration
partners
Data from mobile
app and social
media
Legacy Data
Data Preparation
Pre-Processing
Data Source Model Training Inference
AI Deep Learning
Frameworks
(Tensorflow & IBM Caffe)
Monitor &
Advise
Instrumentation
Iterate
Distributed & Elastic Deep
Learning (Fabric)
Parallel Hyper-Parameter
Search & Optimization
Network
Models
Hyper-
Parameters
Trained Model
Deploy in
Production using
Trained Model
New Data
Years of
Data
Hours of
preparation
Weeks and
months of
training
Seconds to
results
Data requirements varies significantly
Data Variety
Data Quantity
Geo-dispersed,
On-perm & Cloud
Data Efficiency
Data Quality
Data
Gravity
HDFS/Spark
Model Velocity
Workflow Integration
Data Access Density
Data Velocity : Low latency
High throughput
Data Caching
Data Security, Governance and Resilience

13© Copyright IBM Corporation 2017
IBM AI Architecture from Experimentation to Expansion
Experimentation
Single Tenant
Stabilization & Production
Secure Multitenant
Expansion
Enterprise Scale / Multiple Lines of Business
Data
Scientist’s
workstations
Internal
SAS
drives &
NVM’s
IBM
Power
Systems
AC922
High-Speed
Network
Subsystem
Existing
Organization
Infrastructure
IBM
Elastic
Storage
Server
(ESS)
Training & Inference Cluster
IBM Power Systems AC922, LC921 & LC922
Master & Failover Master
Nodes IBM Power Systems
LC921 & LC922
Login Nodes
IBM Power Systems
LC921 & LC922
Training Cluster
IBM Power Systems AC922
IBM
Elastic
Storage
Server
(ESS)
High-Speed
Network
Subsystem
Existing
Organization
Infrastructure
One software stack from experimentation to expansion
IBM PowerAI Enterprise
Red Hat Enterprise Linux (RHEL)
IBM Power System & x86 Servers
Services&
Support
IBM Spectrum Scale / IBM Elastic Storage Server (ESS)

AI Adoption Cycle
–Single node
–Single user/tenant
–Small scale data
–Algorithm prototyping,
hyperparameter optimization
Experimentation Production Expansion
–Expanding use cases
–Multi-node
–Cluster
–Medium scale data
–Security
–Data Science Shared
Service
–Multitenant
–Upstream data pipeline
–Model iteration
–Scalable Inference
14

AI Data Journey
–Single node
–Single user/tenant
–Small scale data
–Algorithm prototyping,
hyperparameter optimization
Experimentation Production Expansion
–Expanding use cases
–Multi-node
–Cluster
–Medium scale data
–Security
–Data Science Shared
Service
–Multitenant
–Upstream data pipeline
–Model iteration
–Scalable Inference
15
Hadoop and Spark are the choice for data pipeline.

16
Why customers are choosing
IBM Spectrum Scale Storage
with Hadoop?

17
Reduce datacenter footprint and get
faster ingest with in-place analytics
Data
NFS
SMB POSIX Object
HDFS API
Access to the data using any of the industry standard protocols.
No need to maintain separate copies for different applications.
Grow storage independent of compute with the best data
protection technology
Grow storage independent of compute with pre-integrated ESS system. Eliminate
need for 3 copies of data with SW RAID, Faster disk rebuilds, No data corruption
Extreme scalability with
parallel file system architecture
Data + Metadata
Node
Data + Metadata
Node
Data + Metadata
Node
Data + Metadata
Node
Scale to billions of files.
No centralized metadata node bottleneck.
Global namespace that spans geographies
Stretch clusters and Active – Active replicas of data for real time global collaboration
ESS
Why customers are choosing Spectrum Scale storage for Hadoop?
Faster ingest, unmatched scalability, up-to 60% less storage footprint for Hadoop workloads
1 2
3 4

18
Data Lake: Up to 60% less storage footprint
| 18
Ingest
ObjectFile
Direct Access
POSIX
Raw Data
Analysis
Less hardware
• HDFS Shared Nothing: 15 PB of physical for 5 PB usable
• Spectrum Scale on ESS: 6.5 PB of physical for 5 PB usable
Analytics in place
• No need to maintain copies of data for traditional applications
and analytics applications
Multi-purpose shared data lake
• Shared by Hadoop and many other use cases

19
HDP on Power with Elastic Storage Server
• Improve TCO
Up to 3X reduction of storage and compute
infrastructure moving to Power Systems and Elastic
Storage Server vs commodity scale out x86. Less
infrastructure means reduced costs in many areas
(Energy, cooling, server administration, floor space, SW licensing)
• Position for future growth, avoid hitting the
data center wall with cluster sprawl
Separating storage from compute enables the
selection of the best compute node for the workload
– and Power has the greatest range of options
E E
InfiniBand (RDMA) / 40 GigE / 10 GigE
IBM Power nodes running
HDP services and Spectrum
Scale client
ESS
HDP HDP HDP HDP HDP
ESS Elastic Storage Server(Powered by Spectrum Scale)
C C C C CC
C Spectrum Scale Client + HDFS Connector

20
Popular analytics use cases with
IBM Spectrum Scale storage

21
Challenges …
 Expensive EDW (Enterprise Data Warehouse) setups
 Silos of infrastructure for various analytics workflows
 Multiple copies of the same data
 Time consuming data ingest cycle
 Unmanageable analytics cluster sprawl

22
Popular use-cases that help eliminate analytics silos
I. EDW Optimization
Optimize data warehouse by shifting right workload to Hadoop
Reduce cost & improve efficiency
II. Integrated HPC and Hadoop
Efficiently transform data into insights with single data lake for HPC & Hadoop
Faster & better insights
IV. Unified Analytics Workflows
Single data lake for Hadoop and non-Hadoop analytics
Improve data governance
III. Hadoop Storage Tiering
Disaggregate storage and compute for better utilization
Reduce cluster sprawl

23
I. EDW Optimization
Optimize data warehouse by shifting right workload to Hadoop
Archive Data away from EDW
- Move cold or rarely used data to Hadoop
as active archive
- Store more of data longer
Offload costly ETL process
- Free your EDW to perform high-value functions like
analytics & operations, not ETL
- Use Hadoop for advanced ETL
Optimize the value of your EDW
- Use Hadoop to refine new data sources, such as web and
machine data for new analytical context
Reduce migration effort & skillset gap
- Use existing investment in Oracle/DB2/Netezza skills
- BigSQL allows you to migrate applications without major
code rewrites and additional SQL development
Control cluster sprawl
- Grow storage independent of compute with ESS
- POWER servers deliver 1.7x throughput compared to
Hortonworks on x86
- Up-to 60% less storage footprint
Enterprise Data
Warehouse
DB2 / Dashdb / Oracle /
Netezza / Teradata …
Hot Data
Hadoop
Cold Data, Archive Data,
New Sources
HDP On Power
SQL Interface BigSQL On Power
Analytics Software
(Business Analytics, Visualization like SAS grid, SAP HANA etc)
ESS for
Speed
ESS for
Data Lake
Spectrum Scale
A Financial Services company in Europe is optimizing their DB2 warehouse using
HDP, BigSQL, Power, ESS combination.
New Data Sources
Streaming / IOT data
HDF On Power

24
II. Integrated HPC and Hadoop
Efficiently transform data into insights with single data lake for HPC & Hadoop
NASA and a Healthcare company from middle east are using common Spectrum Scale data
lake to efficiently get insights using traditional HPC and Hadoop analytics.
ESS for
Data Lake
POSIX
Interface
HDFS
Interface
Traditional HPC
Open, Read, Write, MPI, C-code,
Python etc
Hadoop
Map-Reduce,
Spark, ML/DL etc
HDP On Power
NFS/SMB/Object
Interface
Spectrum Scale
Protocol Node
ESS for
Speed
Fast Ingest
POSIX
Interface
Spectrum Scale
Extend HPC to add modern analytics capabilities
- Efficient movement of data between modern and traditional
applications with common namespace
- Spectrum Scale in-place analytics capabilities enable
accessing the same data using NFS/SMB/Object/POSIX/HDFS
without requiring any modifications to the data
- Improve data reliability and governance with single data lake
Ingest fast and improve time to insight
- POSIX interface combined with ESS Flash storage gives super
fast ingest ability
- Common namespace enables running some edge analytics at
the ingest layer as well
- POWER servers deliver 1.7x throughput compared to
Hortonworks on x86

25
III. Hadoop Storage Tiering
Disaggregate storage and compute for better utilization
An Indian conglomerate is implementing ESS based ingest tier to their existing
Hadoop data-lake.
ESS for
Data Lake
POSIX
Interface
HDFS
Interface
New
Hadoop cluster
HDP On PowerESS for
Speed
Fast Ingest
Existing
Hadoop cluster
Native
HDFS Storage
HDFS
Interface
HDFS
Interface
Use ESS as Ingest Tier to existing Hadoop setup
- Get super-fast ingest with POSIX and Flash storage
- Run in-place analytics directly on tier1 storage
Use ESS as Secondary Tier to existing Hadoop setup
- Grow storage independent of compute
- Reduce cluster sprawl
- Share data between old & new Hadoop setups
- Avoid copying data between the two clusters with a common
data lake
- Introduce new IBM Power-based HDP clusters for demanding
next gen analytics workflows on the same data lake

26
IV. Unified Analytics Workflows
Single data lake for Hadoop and non-Hadoop analytics
A bank in South Africa is implementing HDP and SAS grid software on a common
ESS based infrastructure.
ESS for
Data Lake
POSIX
Interface
HDFS
Interface
Other Analytics
Platforms
SAS grid, SAP
HANA/Vora, ML/DL,
Conductor with
Spark etc
Hadoop
Map-Reduce,
Spark, ML/DL etc
HDP On Power
ESS for
Speed
Fast Ingest
POSIX
Interface
Spectrum Scale
All analytics workflows on common storage
- Improve data reliability and governance with single data lake for
Hadoop and non-Hadoop analytics setups
- Build ML/DL workflows that use multiple analytics platforms
- Share data across analytics workflows as appropriate
Ingest fast and improve time to insight
- POSIX interface combined with ESS Flash storage gives super fast
ingest ability
- POWER servers deliver 1.7x throughput compared to Hortonworks
on x86

Breaking the Silos: Storage for Analytics & AI

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Breaking the Silos: Storage for Analytics & AI

Semelhante a Breaking the Silos: Storage for Analytics & AI (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Breaking the Silos: Storage for Analytics & AI

Notas do Editor