Datalagring för AI
Vad bör man att tänka på, hur bygger man och vilken skillnad kan IBM's infrastruktur göra.
Talare: Christofer Jensen, Storage Technical Specialist, IBM
Presentationen hölls på Watson Kista Summit 2018
3. What is AI?
• Machine learning
• Deep learning
• Artificial intelligence ”Set framework”
• Four legs
• Narrow eyes
• Sharp teeth
• Tail
• Etc…
”Tell the system”
”Take action”
4. Three ways how IT uses data … today
Procedural (if…then)
Statistical (big data)
Artificial Intelligence
”One truth” ”Qualified guess” ”Learning Systems”
5. … and in 10 years
Procedural
(if…then)
Statistical
(big data)
AI
6. Current examples
shopping, profiling,
fraud detection …
autonomous driving,
image classification,
chatbots, gaming…
Structured processing
plausible, credible data
Accumulation of data
not 100% precise is ok
(e.g. Recommendations)
Training data
true and false examples
+ independent test data
business as usual
classic / legacy IT
7. Why this will happen
Procedural Statistical AI
Amount of
data used
Manual modeling
Accumulation
of examples
Automatic modeling
Legacy systems
Structured models
Data generation
”Just store the data”
New gen programmers
Automatic consumption
”Set the system free”
8. Procedural:
Archive for
auditing
Statistical:
Store all data for
parallel processing
Machine Learning:
Train sample data, then
offer for data trade
How is data stored?
if…then…else
GB/s
1
2
Structured Unstructured Unstructured + structured
9. What is important for
Image: Business over Broadway
GB/s
• Collected data is analyzed in parallel
• Number of analyzes / second is important
• Data must be close to the CPU
• Transaction latency is irrelevant
• Data consistency is irrelevant
10. What is important for
• Sample data is trained and then archived
• Short training = many training cycles, high quality
• The better the data, the better result
• High throughput at 1 point in the life cycle [1]
• As low as possible maintenance cost after [2]
1
2
11. Storage requirements summary
Primary:
• High throughput for analysis and training
• Scalable due to high data growth
• Low cost long term storage
Secondary:
• Automated archiving
• Data rescillency
• Availabilty
How does IBM solve this???
13. Automotive Industry generates large amounts of data
Sensors
Video
CAN
FlexRay
Radar
LiDAR
Etc etc
Data must be synchronously captured, stored, modified and executed
14. Dev / test is challenging
Test Drives
50TB / day / car
R&D Lab: tagging
R&D Labs: developing
& testing
> 5PB / car model (project)
> 200h / 1h driving
16. Major IT Challenges
4. How to analyze the data – esp.
sensor and video data analytics
2. How to distribute data globally
within an enterprise
1. How to implement & operate an
efficient storage, workflow and
management system
„The Foundation“
3. How to preserve digital data
for decades
6. How to embed analytics/data
management into R&D
Environment
5. How to do efficient IT workload
and resource scheduling?
17. Summary – Solution Elements ADAS/AD
AREMA AgentsAREMA EngineAREMA Interfaces
<
SOAP REST OSLC
Elektrobit ADTF and other
ADTF and testing tools
AREMA Clients
Spectrum Scale client OS
IBM Video Analytics
IBM Reserach
HiL Station(s)
IBM Spectrum
Protec
Job Management, Media Portal
Automatic Video
Tagging/Labelling
ArchiveStorage & DistributionTest Execution
Test- & Lab Management
+ linkages to Development
Manage & Control Video & Testing Workflow
IBM Spectrum
Archive
LTFS Tape
Library
<
other
MiL / SiL
HPC environments IBM Spectrum
Scale
IBM Cloud Object
Storage
The
foundation
Orchestration
Intelligence
20. Recap
Primary:
• High throughput for analysis and training
• Scalable due to high data growth
• Low cost long term storage
Secondary:
• Automated archiving
• Data rescillency
• Availabilty
GB/s
Flexible
Commodity components
Built in intelligence
Data integrity check
Multi sites
21. First thing to consider, storage virtualisation
A B C D
SAN / LAN
Virtualisation
Virtualisation
• Availability
• Reliability
• Performance
• Ease of use
• Automation
• Consolidation
• Hardware agnostic
• Utilisation
• ”Built in AI”
Client
Users and
applicationsCompute
Big Data
Analytics
22. IBM Spectrum Storage Family
FlashSystem
Any Storage
Private, Public or Hybrid Cloud
Spectrum LSF
Spectrum Symphony
Spectrum Conductor
Analytics-driven data management to reduce costs
by up to 50 percent
Optimized data protection to reduce backup costs
by up to 53 percent
Fast data retention that reduces TCO for active
archive data by up to 90%
Virtualization of mixed environments stores up to
5x more data
Enterprise storage for cloud deployed in minutes
instead of months
High-performance, highly scalable storage for
unstructured data
Web-scale secure Object Storage
Data Where And When You Need It
Copy Data Management For Modern IT
Platform computing
23. Spectrum Scale topology
Global namespace
IBM Spectrum Scale
Automated encrypted data placement and data migration
SMB/CIFSNFSPOSIX HDFS Controller
Disk Tape Storage Rich
Servers
Flash
On/Off Premise
OpenStack
Cinder Swift
Glance Manila
Transparent
Cloud Tiering
Site B
Site A
Site C
Cloud Data
Sharing Users and
applications
iSCSI
GB/s
24. Software Only Solution Bundles Off-premises
Software license
Can be deployed on standard hardware
Pre-packaged with IBM Spectrum Scale Software,
Spectrum Scale RAID, I/O servers, drives, support &
subscription
Deploy Spectrum Scale in
IBM Softlayer (Whitepaper)
High Performance Computing offerings with
Spectrum Scale
Spectrum Scale Deployment Options
+
30. IBM Spectrum Storage Family
FlashSystem
Any Storage
Private, Public or Hybrid Cloud
Spectrum LSF
Spectrum Symphony
Spectrum Conductor
Analytics-driven data management to reduce costs
by up to 50 percent
Optimized data protection to reduce backup costs
by up to 53 percent
Fast data retention that reduces TCO for active
archive data by up to 90%
Virtualization of mixed environments stores up to
5x more data
Enterprise storage for cloud deployed in minutes
instead of months
High-performance, highly scalable storage for
unstructured data
Web-scale secure Object Storage
Data Where And When You Need It
Copy Data Management For Modern IT
Platform computing
31. FILE STORAGE OBJECT STORAGE
• Stores hundreds of millions of files
• File system hierarchy
• Can be complex to scale
• Best for file based workflows
• I/O Performance
• Low Latency access
• Structured to be understood by humans
• File system maintains metadata
• Stores hundreds of billions of objects
• One storage pool, Object IDs
• Scales uniformly
• Low TCO
• High Latency access
• Structured to be understood by applications
• Application maintains metadata
32
What is object storage?
S3
Data Object ID
Put
Get
1
2