2. Forward-Looking Statements
During our meeting today we will make forward-looking statements.
Any statement that refers to expectations, projections or other characterizations of future events or
circumstances is a forward-looking statement, including those relating to market growth, industry
trends, future products, product performance and product capabilities. This presentation also
contains forward-looking statements attributed to third parties, which reflect their projections as of the
date of issuance.
Actual results may differ materially from those expressed in these forward-looking statements due
to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors”
and elsewhere in the documents we file from time to time with the SEC, including our annual and
quarterly reports.
We undertake no obligation to update these forward-looking statements, which speak only as
of the date hereof or as of the date of issuance by a third party, as the case may be.
5. Old Model
Monolithic, large upfront
investments, and fork-lift
upgrades
Proprietary storage OS
Costly: $$$$$
New SD-AFS Model
Disaggregate storage, compute, and software
for better scaling and costs
Best-in-class solution components
Open source software - no vendor lock-in
Cost-efficient: $
Software-defined All-Flash Storage
7. Storage performance is hugely affected by seemingly small details
All HW is not equal – Switches, NICs, HBAs, SSDs all matter
• Drivers abstraction doesn’t hide dynamic behavior
All SW is not equal – Distro, Patches, Drivers, Configuration matter
Typically large delta between “default” and “tuned” system perf
What’s a user to do?
Software Defined Storage – what’s NOT new
9. The InfiniFlash™ System
9
64-512 TB
JBOD of flash in
3U
Up to 2M IOPS,
<1ms latency,
Up to 15 GB/s
throughput
Energy Efficient
~400W power draw
Connect
up to 8 servers
Simple yet
Scalable
12. InfiniFlash™
8TB Flash-Card Innovation
• Enterprise-grade power-fail safe
• Latching integrated & monitored
• Directly samples air temp
• New flash form factor not SSD-based
Non-disruptive Scale-Up & Scale-
Out
• Capacity on demand
• Serve high growth Big Data
• 3U chassis starting at 64TB up to
512TB
• 8 to 64 8TB flash cards (SAS)
• Compute on demand
• Serve dynamic apps without
IOPS/TB bottlenecks
• Add up to 8 servers
17. InfiniFlash IF550 (HW + SW)
Ultra-dense High Capacity Flash storage
Highly scalable performance
Cinder, Glance and Swift storage
Enterprise-Class storage features
Ceph Optimized for SanDisk flash
18. InfiniFlash SW + HW Advantage
Software Storage System
Software tuned for
Hardware
• Extensive Ceph mods
Hardware Configured
for Software
• Density, power, architecture
Ceph has over 50 tuning parameters
that results in 5x – 6x performance
improvement
19. IF550 - Enhancing Ceph for Enterprise Consumption
• SanDisk Ceph Distro ensures packaging with stable, production-ready code with consistent quality
• All Ceph Performance improvements developed by SanDisk are contributed back to community
19
SanDisk / RedHat
or Community
Distribution
Out-of-the Box
configurations tuned for
performance with Flash
Sizing & planning tool
InfiniFlash drive
management
integrated into Ceph
management
Ceph installer built for InfiniFlash
High performance iSCSI storage
log collection tool
Enterprise hardened SW + HW QA
21. Starting working with Ceph over 2.5 years ago (Dumpling)
Aligned on vision of scale-out Enterprise storage
• Multi-protocol design
• Cluster / Cloud oriented
• Open Source Commitment
SanDisk’s engagement with Ceph
• Flash levels of performance
• Enterprise quality
• Support tools for our product offerings
Ceph and SanDisk
22. Optimising Ceph for the all-flash Future
Ceph optimized for HDD
• Tuning AND algorithm changes needed for Flash optimization
Quickly determined that the OSD was the major bottleneck
• OSD maxed out at about 1000 IOPS on fastest CPUs (using ~4.5
cores)
Examined and rejected multiple OSDs per SSD
• Failure Domain / Crush rules would be a nightmare
23. SanDisk: OSD Read path Optimisation
Context switches matter at flash rates
• Too much “put it in a queue for another thread” and lock contention
Socket handling matters too!
• Too many “get 1 byte” calls to the kernel for sockets
• Disable Nagle’s algorithm to shorten operation latency
Lots of other simple things
• Eliminate repeated look-ups, Redundant string copies, etc
Contributed improvements to Emperor, Firefly and Giant releases
Now obtain about >200K IOPS / OSD using around 12 CPU cores/OSD (Jewel)
24. SanDisk: OSD Write path Optimization
Write path strategy was classic HDD
• Journal writes for minimum foreground latency
• Process journal in batches in the background
Inefficient for flash
Modified buffering/writing strategy for Flash (Jewel!)
• 2.5x write throughput and avg latency is ½ of Hammer
26. Test Configuration
2 x InfiniFlash systems 256TB
each
8 x OSD nodes
• 2x E5-2697, 14C 2.6Hz v3 ,
8x 16GB DDR4 ECC
2133Mhz
1x Mellanox X3 Dual 40GbE
• Ubuntu 14.04.02 LTS 64bit
8 – 10 Client nodes
Ceph Version: sndk-ifos-1.3.0.317,
based on Ceph 10.2.1 (Jewel)
27. IFOS Block IOPS Performance
Highlights
• 4k numbers are cpu bound, increase in server
cpu will improve IOPS by ~11%
• 64k and higher block size bandwidth is close to
raw box bandwidth.
• 256k Random Read numbers can be increased
further based on number of clients, able to
achieve > 90% drive saturation with 14 clients.
1521231
347628
82456
6
22 21
0
5
10
15
20
25
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
4 64 256
BandwidthinGB/s
IOPs
Block Size
Random Read
Sum of
IOPs
201465
55648
16289
0.8
3.5
4.1
0
1
2
3
4
5
0
50000
100000
150000
200000
250000
4 64 256
BandwidthinGB/s
IOPs
Block Size
Random Write
Sum
of…
Write Performance is on 2 x copy configuration.
28.
29. IFOS Block workload Latency Performance
Environment
• Librbd IO read latency measured on Golden Config with 2 way replication at host level having 8 osd nodes, IO duration 20min.
• fio Read IO profile :64k block with 2 num Jobs & io-depth 16 with 10 clients(each client with one rbd)
• fio Write IO profile :64k block with 2 num Jobs & io-depth 16 with 10 clients(each client with one rbd)
64K Random Read 64K Random Write
Average latency : 6.3ms
Latency Range
(µs) Percentile
500 2.21
750 0.22
1000 7.43
2000 62.72
4000 26.11
10000 1.27
20000 0.03
50000 0.01
Latency Range
(µs) Percentile
1000 0.33
2000 43.16
4000 28.11
10000 21
20000 3.31
50000 2.17
100000 1.35
250000 0.4
500000 0.11
750000 0.04
1000000 0.01
2000000 0.01
Average latency : 1.7ms
2.21 0.22
7.43
62.72
26.12
1.28 0.03 0.01
0
20
40
60
80
Percentile
Latency range in us
Random Read Latency
Histogram
99 percent of
the IOPs is
within 5ms
latency
0.33
43.16
28.11
21
3.312.171.350.40.110.040.010.01
0
10
20
30
40
50
Percentile
Latency Range in us
Random Write Latency
Histogram
99% IOPS : 178367.31 99% IOPS : 227936.61
30.
31. IFOS Object Performance
Erasure Coding provides equivalent of 3x
replica storage with only 1.2x storage
Object performance is on par with Block
performance
Higher node clusters = higher EC ratio = more
storage savings
• Replication Configuration
• OSD level replication with 2 copies
• Erasure Coding Configuration
• Node level Erasure coding with Couchy-Good 6+2
• Couchy Good is better suited to InfiniFlash vs.
Reed Solomon
0
5
10
15
Repl (2x) -
Read
EC (6+2)
Read
Repl (2x) -
Write
EC (6+2)
Write
GBps
Protection Scheme
4M Objects Throughout - Erausre
Coding vs. Replication
32. IF550 Reference Configurations
Workload Small Medium Large
Small Block I/O
2x IF150
o 128TB to 256 TB
Flash per enclosure
using performance
card (4TB)
1 OSD Server per 4-8 cards
o Dual E5-2680
o 64GB RAM
2x IF150
o 128TB to 256 TB
Flash per enclosure
using performance
card (4TB)
1 OSD Server per 4-8 cards
o Dual E5-2687
o 128GB RAM
2+ IF150
o 128TB to 256 TB
Flash per enclosure
using performance
card (4TB)
1 OSD Server per 4-8 cards
o Dual E5-2697+
o 128GB RAM
Throughput
2x IF150
o 128TB to 512 TB
Flash per enclosure
1 OSD Server per 16 cards
o Dual E5-2660
o 64GB RAM
2x IF150
o 128TB to 512 TB
Flash per enclosure
1 OSD Server per 16 cards
o Dual E5-2680
o 128GB RAM
2+ IF150
o 128TB to 512 TB
Flash per enclosure
1 OSD Server per 16 cards
o Dual E5-2680
o 128GB RAM
Mixed
2x IF150
o 128TB to 512 TB
Flash per enclosure
1 OSD Server per 8-16
cards
o Dual E5-2680
o 64GB RAM
2x IF150
o 128TB to 512 TB
Flash per enclosure
1 OSD Server per 8-16
cards
o Dual E5-2690+
o 128GB RAM
2+ IF150
o 128TB to 256 TB
Flash per enclosure
(optional
performance CARD)
1 OSD Server per 8-16
cards
o Dual E5-2695+
o 128GB RAM
33. InfiniFlash TCO Advantage
Reduce the replica count from 3 to 2
Less compute, less HW and SW
• TCO analysis based on a US customer’s OPEX & Cost
data for a 5PB deployment
33
$-
$5
$10
$15
$20
$25
$30
InfiniFlash External AFA DAS SSD node DAS 10k HDD
node
Millions
5 Year TCO
CAPEX OPEX
0 20 40
InfiniFlash
External AFA
DAS SSD node
DAS 10k HDD
node
Data Center
Racks
$- $2,000 $4,000
InfiniFlash
External AFA
DAS SSD node
DAS 10k HDD
node
Thousands
Total Energy Cost
37. SanDisk: Potential Future Improvements
RDMA intra-cluster communication
• Significant reduction in CPU / IOP
BlueStore
• Significant reduction in write amplification -> even higher write
performance
Memory allocation
• tcmalloc/jemalloc/AsyncMessenger tuning shows up to 3x IOPS vs.
default *
Erasure Coding for Blocks (native)
* https://drive.google.com/file/d/0B2gTBZrkrnpZY3U3TUU3RkJVeVk/view
38. Time to fix the write path algorithm
Review of FileStore
• What’s wrong with FileStore
• XFS + levelDB
• Missing transactional semantics for metadata and data
• Missing virtual-copy and merge semantics
• BTRFS implementation of these isn’t general enough
Snapshot/rollback overhead too expensive for frequent use
Transaction semantics aren’t crash proof
Bad Write amp.
Bad jitter due to unpredictable file system
Bad CPU utilization, syncfs is VERY expensive
39. BlueStore
One, Two or Three raw block devices
• Data, Metadata/WAL and KV Journaling
• When combined no fixed partitioning is
needed
Use a single transactional KV store for all
metadata
• Semantics are well matched to
ObjectStore transactions
Use raw block device for data storage
• Support Flash, PMR and SMR HDD
ObjectStore
BlueStore
KeyValueDB
Data
MetaData
BlueFS
Operation Decoder
Journal
Client/Gateway Operations Peer-to-Peer Cluster
Management
Network Interface
40. BlueStore vs FileStore
1 800GB P3700 card (4 OSDs per), 64GB ram, 2 x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 1 x Intel 40GbE link
client fio processes and mon were on the same nodes as the OSDs.
41. Emerging Storage Solutions (EMS) SanDisk Confidential 41
KV Store Options
RocksDB is a Facebook extension of levelDB
– Log Structured Merge (LSM) based
– Ideal when metadata is on HDD
– Merge is effectively host-based GC when run on flash
ZetaScale™ from SanDisk® now open sourced
– B-tree based
– Ideal when metadata is on Flash
– Uses device-based GC for max performance
42. BlueStore ZetaScale v RocksDB Performance
Test Setup:
1 OSD, 8TB SAS SSD, 10GB ram, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz , fio, 32 thds, 64 iodepth, 6TB dataset, 30
min
0.436
1.005
3.970
0.95
2.83
9.29
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
9.000
10.000
0/100 70/30 100/0
IOPsinKs
Read/Write Ratio
Random Read/Write 4K IOPs per OSD
BlueStore(RocksDB)
BlueStore(ZetaScale)
43. The InfiniFlash™ System ...
Power
70% Less
Speed
40x Faster
than SAN
Density
10X Higher
Reliability
20x Better
AFR
Cost
Up to 80%
Lower TCO
Video continues to drive the need for storage, and Point-Of-View cameras like GoPro are producing compelling high resolution videos on our performance cards. People using smartphones to make high resolution videos choose our performance mobile cards also, driving the need for higher capacities.
There is a growing customer base for us around the world, with one billion additional people joining the Global Middle Class between 2013 and 2020. These people will use smart mobile devices as their first choice to spend discretionary income on, and will expand their storage using removable cards and USB drives.
We are not standing still, but creating new product categories to allow people to expand and share their most cherished memories.
___________________________________________________________
Eg. If you’re using it for small blocks, you need more CPUs. However large objects can use less servers. Your choice on how you want to deploy it.
All these are listed as various ineffeciencies. Originally about 10K IOPS before doing all the optimisations
Ran a 1PB test on Hammer with 256TB scaling. Almost linearly scaling.
You can’t get these numbers easily elsewhere.
Typically latency is around 10ms for R and 20-40ms for W even on flash!
EC is a customer configurable option. A lot more writes with 3 copy.
SanDisk is working on block EC….right now just object
The point here is flash is about the same as HDD
FileStore – existing backend storage for CEPH, many deficits. BlueStore is the new architecture. This is a preview… Tech preview for the rest of this year. By the L release it will be production.
We will switch to KV store and get rid of the journal. A journal invokes too many writes. Almost 4.
KV Store will be RockDB but SD will introduce a flash optimised KV Store basedon ZetaScale.
Today's flash solutions and arrays can address most of these problems – they are low power, high performance, somewhat scalable (though not to 10s of Pbs) and highly reliable but there is one thing that is holds it back – something missing – Favorable Economics – it’s simply way too expensive for @ scale workloads making flash out of reach. So we went to work as a team – our first investment was the best investment we ever made – a very clean sheet of paper !!! We knew that today’s HDD based storage solutions and today’s all flash arrays would not do the trick. We had to create something brand new that looks like nothing the world has seen.
Substantiation
Low Power 20 enclosures down to 2 – 100 watts per enclosure 24 Drives per enclosure (9w HDD, 7W SSD) = 93% power reductions or 1/16 the power. From his TCO Calculator.
HDD
480 Drives -9w
20 Enclosures – 100w
4320 + 2000 = 6320w
SSD
46 SSDs – 7w 176TB
2 Enclosures – 100w
536 Watts
Extreme Performance
30x faster NoSQL transactions MongoDB Solution Brief
Scalable
4,500 virtual desktops in one rack Fusion ioVDI and VMware Horizon View: Reference Architecture for VDI
Reliable
Accelerate Oracle Backup Using SanDisk Solid State Drives (SSDs)
Accelerate Oracle Backup Using SanDisk Solid State Drives (SSDs)
Breakthrough Economics
~3x faster Hadoop jobs with half the servers Increasing Hadoop Performance with SanDisk SSDs