Mais conteúdo relacionado Semelhante a Lego Cloud SAP Virtualization Week 2012 (20) Lego Cloud SAP Virtualization Week 20124. Evolution of Virtualization
Resources Disaggregation
(True utility Cloud)
Flexible Resources
Management
Basic (Cloud)
Consolidation
No
virtualization
© 2012 SAP AG. All rights reserved. 4
5. Why Disaggregate Resources?
Better Performance
Replacing slow local devices (e.g. disk) with fast remote devices (e.g. DRAM).
Many remote devices working in parallel (e.g. DRAM, disk, compute)
Superior Scalability
Going beyond boundaries of the single node
Improved Economics
Do more with existing hardware
Reach better hardware utilization levels
© 2012 SAP AG. All rights reserved. 5
6. The Hecatonchires Project
Hecatonchires “Hundred Headed One”
Original idea: provide Distributed Shared Memory (DSM)
capabilities to the cloud
Strategic goal : full resource liberation brought to the cloud by:
Providing more resource flexibility to current cloud paradigm by breaking
down nodes to their basic elements (CPU, Memory, I/O)
Extend existing cloud software stack (KVM, Qemu, libvirt, OpenStack)
without degrading any existing capabilities.
Using commodity cloud hardware: medium sized hosts (typically 64 GB
and 8/16 cores), and standard interconnects (such as 1 Gigabit or 10 GE).
Initiated by Benoit Hudzia in 2011. Currently developed by two
small teams of researchers from the TI Practice located in
Belfast and Ra’anana
© 2012 SAP AG. All rights reserved. 6
7. High Level Architecture
Guests
No special HW required but RDMA enabled
NICs which support the low overhead low
latency communication layer VM
VM VM
App App App
VMs are not bounded by host size anymore as OS
OS
VM
OS H/W
resources such as memory, I/O and compute
Ap
p
OS
H/W
can be aggregated H/W H/W
Different sized VMs can share infrastructure
so we can still support the smaller VMs not Server #1 Server #2 Server #n
requiring dedicated hosts CPUs CPUs CPUs
Memory Memory Memory
Application stack runs unmodified I/O I/O I/O
Fast RDMA Communication
© 2012 SAP AG. All rights reserved. 7
8. The Team - Panoramic View
© 2012 SAP AG. All rights reserved. 8
10. CPUs stopped getting faster
Moore’s law prevailed until 2003 when core’s
speed hit a practical limit of about 3.4 Ghz
In data center core are even slower running at
2.0 - 2.8 Ghz for to power conservation
reasons
Since 2000 you do get more cores – but that
does not effect compute cycle and compute
instruction latencies
Effectively arbitrary sequential algorithms
have not gotten faster since
Source: http://www.intel.com/pressroom/kits/quickrefyr.htm
© 2012 SAP AG. All rights reserved. 10
11. DRAM latency has remained constant
CPU clock speed and memory bandwidth
increased steadily (at least until 2000)
But memory latency remained constant – so
local memory has gotten slower from the CPU
perspective
Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010
© 2012 SAP AG. All rights reserved. 11
12. Disk latency has virtually not improved
1980s standard disk has a 3,600 RPM Average Latency (ms)
8.3
2010s standard disk has a 7,200 RPM
7.1
6.7
2x speedup in 30 years is negligible – 6.1
effectively disk has become slower from the 5.8 5.6
CPU perspective.
4.2
3
2.5
2
3,600 4,200 4,500 4,900 5,200 5,400 7,200 10,000 12,000 15,000
Panda et al. Supercomputing 2009
© 2012 SAP AG. All rights reserved. 12
13. But: Networks are Steadily Getting Faster
Since 1979 we went from 0.01 Gbit/s to up 64 Network Performance (Gbit/s)
Gbit/s a x6400 Speedup 70
60
A competitive marketplace 50
10 and 40 Gbps Ethernet – originated from network 40
interconnects 30
40 Gbps QPX InfiniBand – originated from computer 20
internal bus technology 10
0
InfiniBand/Ethernet convergence
Virtual Protocol Interconnects
InfiniBand over Ethernet
RDMA over converged enhanced Ethernet
Using standard semantics defined by OFED
Panda et al. Supercomputing 2009
© 2012 SAP AG. All rights reserved. 13
14. And: Communication Stacks Are Becoming Faster
Network stack deficiencies
Application / OS context switches
Intermediate buffer copies
Transport processing
RDMA OFED Verbs API provides
Zero copy
Offloading TCP to NIC using RoCE
Flexibility to use IB, GE or IWARP
Resulting in
Reduced latency
Processor offloading
Operational flexibility
© 2012 SAP AG. All rights reserved. 14
15. Benchmarking Modern Interconnects
Intel MPI benchmark (IMP) Broadcast latency
Used typically in HPC and parallel computing
Comparing:
4x DDR IB using Verbs API
10 GE TOE (TCP offloading engine) iWARP
1 GE
Exchange bandwidth
Measured latencies
IB 2 us
10 GE 8.23 us
1 GE 46.52 us
Source: Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet, IBM
© 2012 SAP AG. All rights reserved. 15
16. Conclusion: Remote Nodes Have Gotten Closer
Interconnects have become much faster
Fast interconnects have become a commodity
and are moving out of the High Performance
Computing (HPC) niche
IB latency 2000 ns is only 20x slower than
RAM and is 100x faster than SSD
Remote page faulting is much faster than
traditional disk backed page swapping!
HANA Performance Analysis, Chaim Bendelac, 2011
© 2012 SAP AG. All rights reserved. 16
17. Result: Blurring of the physical node boundaries
2,000 ns 10,000,000 ns
0 ns
100 ns
10,000,000 ns
© 2012 SAP AG. All rights reserved. 17
19. Enabling Live Migration of SAP Workloads
Business Problem
Typical SAP workloads (e.g. SAP ERP) are transactional,
large (possibly 64 GB), with a fast rate of memory writes.
Classic live migration fails for such workloads as rapid
memory writes cause memory pages to be re-sent over
and over again
Hecatonchire’s Solution
Enable live migration by reducing both the number of
pages re-sent and the cost of a page re-send
Non intrusive reducing downtime, service degradation, and
total migration time
© 2012 SAP AG. All rights reserved. 19
20. Live Migration Technique
Pre-migration process
Reservation process
• Suspend on host A
VM activeVM on host A
• Activate on host in
Copy dirty pagesB successive
Iterative pre-copy • Redirect network traffic
Initialize container on target
Destination host selected host
• VM state
rounds on host A released
• Synch devices mirrored)
(Block remaining state
Stop and copy
Commitment
© 2012 SAP AG. All rights reserved. 20
21. Pre-copy live migration
Reducing number of page re-sends
Page LRU Reordering such that pages which
have a high chance of being re-dirtied soon are
delayed until later
Reducing the cost of a page re-send
By using XBZRLE delta encoder we can much
more efficiently represent page changes
© 2012 SAP AG. All rights reserved. 21
22. More Than One Way to Live Migrate …
Iterative Stop
Pre-Copy Live- Pre-migrate;
Pre-copy X and
Commit
Migration Reservation
Rounds Copy
Live on A Downtime Live on B
Total Migration Time
Stop Page Pushing
Post-Copy Live- Pre-migrate;
and 1
Commit Commit
Migration Reservation
Copy Round
Live on A Downtime Degraded on B Live on B
Total Migration Time
Iterative
Stop Page Pushing
Hybrid Post-Copy Pre-migrate; Pre-Copy
and 1
Commit
Live-Migration Reservation X
Copy Round
Rounds
Live on A Downtime Degraded on B Live on B
Total Migration Time
© 2012 SAP AG. All rights reserved. 22
23. Post-copy live migration using fast interconnects
In Post-copy live migration the state of the VM
is transferred to the destination and activated
before memory is transferred
Post-copy implementation includes
Handling of remote page faults
Background transfer of memory pages
Service degradation mitigated by
RDMA zero-copy interconnects
Pre-paging – similar in concept to pre-fetching
Hybrid Post Copy – begins with a pre-copy phase
MMU integration – eliminating need for VM pause
© 2012 SAP AG. All rights reserved. 23
26. The Memory Cloud
Turns memory into a distributed memory service
Server
Server 1 Server
Server 2 Server
Server 3
Server1 1
VM Server2 2
VM Server3 3
VM
Applications App App App
Memory RAM RAM RAM
Storage
Breaks memory Yields double digit Transparent
from the bounds of the percentage gains in IT deployment with
physical box economics performance at scale
and Reliability
© 2012 SAP AG. All rights reserved. 26
27. RRAIM : Remote Redundant Array of Inexpensive Memory
Supporting Large Memory Instances On-Demand
Business Problem
RAIM Solution
Current instance memory sizes are constrained by physical hosts’
memory size ( Amazon Biggest VM occupy the whole physical host)
Heavy swap usage slows execution time for data intensive applications
Hecatonchire Solution VM swaps to memory
Application Cloud
Access remote DRAM via fast interconnects zero-copy RDMA
Hide remote DRAM latency by using page pre-pushing
RAM Memory Cloud
MMU Integration for transparency for applications and VMs
Reliability by using a RAID-1 (mirroring) like schema
Hecatonchire Value Proposition Compression / De-
duplication / N-tiers
Provide memory aggregation on-demand storage / HR-HA
Totally transparent to workload (no integration needed)
No hardware investment! No dedicated servers!
© 2012 SAP AG. All rights reserved. 27
28. Hecatonchire / RRAIM: Breakthrough Capability
Breaking the memory box barrier for memory intensive applications
L1 cache
10 μsec L2 cache
DRAM
Access Speed
100 μsec
1 μsec
SSD
1 msec
Performance
Networked
Embedded
Resources
Resources
Resources
Barrier
Local Disk
10 msec
Local
NAS
SAN
MB GB TB PB
Capacity
© 2012 SAP AG. All rights reserved. 28
29. Lego Cloud Architecture ( Memory block)
Memory VM Compute VM Combination VM
Memory Host Memory Guest Memory Guest & Host
RAM App RAM App
Memory
memory
VM VM VM Cloud
RRAIM
Memory Cloud Management
Services
Many Physical Nodes
Hosting a variety of VMs
© 2012 SAP AG. All rights reserved. 29
30. Instant Flash Cloning On-Demand
Business Problem
Burst load / service usage that cannot be satisfied in time
Existing solutions
Vendors: Amazon / VMWare/ rightscale
Startup VM from disk image
Requires full VM OS startup sequence
Hecatonchire Solution
Using a paused VM as source for Copy-on-Write (CoW)
We perform a Post-Copy Live Migration
Hecatonchire Value Proposition
Just in time (sub-second) provisioning
© 2012 SAP AG. All rights reserved. 30
31. Instant Flash Cloning On-Demand
We can clone VMs to meet demand much faster
than other solutions
Reducing infrastructure costs while still minimizing
lost opportunities => Just in time provisioning
Requires Application Integration
We track OS/application metrics in running VMs or in Load
Balancer (LB)
Alerts are defined if metrics pass a pre-define threshold
According to alerts we can scale-up adding more resources
or scale-down to save on resources not utilized
Amazon Web Services - Guide
© 2012 SAP AG. All rights reserved. 31
33. Cost Effective “Small” HPC Grid
High Performance Computing (HPC)
Supercomputers at the frontline of processing speeds 10k-100k core
Typical benchmark: Grid 500 (Linear Algebra)
Small HPC using 10-20 commodity (2 TB / 80 core) nodes
Typical Applications
Relational Databases
Analytics tasks (Linear Algebra)
Simulations
Hecatonchire Value Proposition
Optimal price / performance by using commodity hardware
Operational flexibility: node downtime without downing the cluster
Seamless deployment within existing cloud
© 2012 SAP AG. All rights reserved. 33
34. Distributed Shared Memory (DSM)
Traditional cluster ccNUMA
Distributed memory Cache coherent shared memory
Standard interconnects Fast interconnects
OS instance on each node One OS instance
Distribution handled by application Distribution handled by hardware
Vendors: ScaleMP, Numascale, others
© 2012 SAP AG. All rights reserved. 34
35. Distributed Shared Memory – Inherent Limitations
Linux provides NUMA topology discovery
Distance between compute cores
Distance between cores to memory
While the Linux OS is aware of the NUMA
layout the application may not be aware …
Cache-coherency may get very expensive
Inter-core: L3 Cache 20 ns
Inter-socket: Main Memory 100 ns
Inter-node (IB): Remote Memory 2,000 ns
Thus the ccNUMA architecture many not
“really” be transparent to the application!
© 2012 SAP AG. All rights reserved. 35
37. Roadmap
• Live Migration
• Pre-copy XBZRLE Delta Encoding
• Pre-copy LRU page reordering
2011 • Post-copy using RDMA interconnects
• Resource Aggregation
• Cloud Management Integration
• Memory Aggregation – RAIM (Redundant Array of Inexpensive Memory)
2012 • I/O Aggregation – vRAID (virtual Redundant Array of Inexpensive Disks)
• Flash cloning
• Lego Landscape
• CPU Aggregation - ccNUMA
2013 • Flexible resource management
© 2012 SAP AG. All rights reserved. 37
38. Key takeaways
Hecatonchire extends standard Linux stack requiring
standard commodity hardware
With Hecatonchire unmodified applications or VMs can
tape into remote resources tranparently
To be released as open source under GPLv2 and LGPL
licenses to Qemu and Linux communities
Developed by SAP Research TI
© 2012 SAP AG. All rights reserved. 38
39. Thank you
Contact information:
Benoit Hudzia; Sr. Researcher; Hecatonchire Wiki
SAP Research CEC Belfast
benoit.hudzia@sap.com https://wiki.wdf.sap.corp/wiki/display/cecbelfast/Hecatonc
hire%2C++Distributed+Shared+Memory+%28DSM%29+
And+Datacenter+Resources+disaggregation+for+Cloud
Aidan Shribman; Sr. Researcher;
SAP Research Israel
aidan.Shribman@sap.com
41. Linux Kernel Virtual Machine (KVM)
Released as a Linux Kernel Module (LKM)
under GPLv2 license in 2007 by Qumranet
Full virtualization via Intel VT-x and AMD-V
virtualization extensions to the x86 instruction
set
Uses Qemu for invoking KVM, for handling of
I/O and for advanced capabilities such as VM
live migration
KVM considered the primary hypervisor on
most major Linux distributions such as
RedHat and SuSE
© 2012 SAP AG. All rights reserved. 41
42. Remote Page Faulting Architecture Comparison
Hecatonchire Yobusame
No context switches Context switches into user mode
Zero-copy Use standard TCP/IP transport
Use iWarp RDMA
Hudzia and Shribman, SYSTOR 2012 Horofuchi and Yamahata, KVM Forum 2011
© 2012 SAP AG. All rights reserved. 42
43. Legal Disclaimer
The information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission of
SAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAP
has no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or
release any functionality mentioned therein. This document, or any related presentation and SAP's strategy and possible future
developments, products and or platforms directions and functionality are all subject to change and may be changed by SAP at
any time for any reason without notice. The information on this document is not a commitment, promise or legal obligation to
deliver any material, code or functionality. This document is provided without a warranty of any kind, either express or implied,
including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This
document is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors or
omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent.
All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially
from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as
of their dates, and they should not be relied upon in making purchasing decisions.
© 2012 SAP AG. All rights reserved. 43