2. Agenda
Œ About Ambedded
What is the Issues of Using Single Server Node with Multiple Ceph
OSD?
Ž 1 Microserver vs. 1 x OSD architecture
The benefits
The basic High Availability Ceph Cluster
‘ Scale it Out
’ Why Network matter?
“ How fast it can self-heal a failed OSD on this architecture
” How much we save the energy
• Easy way to run a ceph cluster
2
3. About Ambedded Technology
Y2013
Y2016
Y2015
Y2014
Founded in Taiwan Taipei,
Office in National Taiwan University Innovative Innovation Center
Launch Gen 1 microserver architecture Storage Server
Product
Demo in ARM Global Partner Meeting UK Cambridge.
Partnership with European customer for the Cloud
Storage Service. Installed 1800+ microservers
5.5PB in operating since 2014
• Launch the 1st ever Ceph Storage Appliance powered
by Gen 2 ARM microserver
• Awarded as the 2016 Best of INTEROP Las Vegas
Storage product. Defeat VMware virtual SAN.
3
4. Issues of Using Single Server Node
with Multiple Ceph OSDs
• The smallest failure domain is the OSDs inside a server. One
Server fail causes many OSD down.
• CPU utility is 30%-40% only when network is saturated. The
bottleneck is network instead of computing.
• The power consumption and thermal heat eat your money
4
5. x N
x N
x N
Network
M
S
M
S
xN
M
S
M
S
xN
M
S
M
S
xN
M
S
M
S
M
S
40Gb
40Gb
40Gb
Micro server
cluster
Micro server
cluster
Micro server
cluster
ARM micro server
cluster
1 micro-server failed,
ONLY 1 disk failed
Traditional
Server #1
Traditional
Server #2
Traditional
Server #3
x N
x N
x N
Client #1
Client #2
Network
10Gb
10Gb
10Gb
X86 server
1 motherboard failed, a
bunch of disk failed
High Availability- Minimize HW Failure Domain
N 1
6. The Benefit of Using One to One Architecture
• True no single point of failure.
• The smallest failure domain is one OSD
• The MTBF of a micro server is much higher than a all-in-
one mother board
• Dedicate H/W resource to get stable OSD service
• Aggregate network bandwidth with failover
• Low power consumption and cooling cost
• OSD, MON, gateway are all in the same boxes.
• 3 units form a high availability cluster
6
7. The Basic High Availability Cluster
7
1x MON
1x MON
1x MON
7x OSD
7x OSD
7x OSD
Basic Configuration
Scale Out the cluster
9. 9
Capacity
Performance
Scale Out Unified Virtual Storage
More OSDs, More Capacity, More Performance
3x Mars 200 -HDD
6x Mars 200
- HDD
5x Mars 200 -HDD
4x Mars 200 –HDD
3x Mars 200 -SSD
5x Mars 200 -SSD
4x Mars 200 –SSD
6x Mars 200 -SSD
ALL Flash
ALL HDD
10. Network does Matter!
HGST He 10T
16x OSD
20Gb uplink 40Gb Uplink Increase
BW IOPS BW IOPS
4K Write 1 Client 7.2 1,800 11 2,824 57%
4K Write 2 Client 13 3,389 20 5,027 48%
4K Write 4 Client 22 5,570 35 8,735 57%
4K Write 10 Client 39 9,921 60 15,081 52%
4K Write 20 Client 53 13,568 79 19,924 47%
4K Write 30 Client 63 15,775 90 22,535 43%
4K Write 40 Client 68 16,996 96 24,074 42%
The purpose of this test is to know how much improvement if the uplink
bandwidth is increased from 20Gb to 40Gb. Mars 200 has 4x 10Gb uplinks ports.
The test result shows 42-57% improvement on IOPS.
10
11. Self-healing is much faster Safer than RAID
11
Test method Ambedded CEPH Storage Disk Array
# of HDD /Capacity 16 x HDD / 10TB each 16 x HDD / 3TB each
Data Protection Replica =2 RAID 5
Stored data on each HDD 3TB No matter how much
Time for recover data health
if lost ONE HDD
5 hours 10 minutes
(System keep normal work)
41 hours
IT people hand-in Auto-detection, self-healing
Autonomic!
Somebody has to replace
the damaged disk
Data recover method Self-healing, cluster will gather
the re-generated data copies
from the existing health disk
Need to re-build the
whole disk by the entire
disk array
Recovery time vs. Disk # More disks, Shorter time More disks, LONGER time
12. Mars 200: 8-Node ARM Microserver Cluster
8x 1.6GHz ARM v7 Dual Core hot swappable microserver
- 2G Bytes DRAM
- 8G Bytes Flash
- 5 Gbps LAN
- 5 Watts power consumption
Storage
- 8x hot swappable SATA3
HDD/SSD
- 8x SATA3 Journal SSD
300 Watts
Redundant
power supply
OOB BMC port
Dual hot swappable uplink
switches
- Total 4x 10 Gbps
- SFP+/10G Base-T Combo
12
13. (320W-60W) x 24h x 365 days /1000 x $0.2 USD x 40
units X 2 (power Cooling)
= USD 36,000/rack/YR
This electricity cost is based on TW rate, it could be
double or triple in Japan or Germany
Note: SuperMicro server 6028R-WTRT with 8x3.5” HDD bay
13
Green Storage Saving More in Operation
15. What You Can do with UVS Manager
• Deploy OSD, MON, MDS
• Create Pool, RBD image, iSCSI LUN, S3 user
• Support replica (1- 10) And Erasure Code (K+M)
• OpenStack back storage management
• Create CephFS
• Snapshot, Clone, Flatten image
• Crush Map configuration
• Scale out your cluster
15