SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
Developing cloud infrastructure from scratch:
the tale of the ISP
Andrey Korolyov
Performix LLC

22 October 2013
Architecture planning

Scalable to infinity and beyond
Fault-tolerant (not just words for a salesman)
Easy to build...or not so easy
Cheap commodity hardware
Avoid modular design because we are not building OpenStack
Things that are tolerable in the private cloud completely
unacceptable in the public one
The beginning: no expertise but lots of enthusiasm

Ceph for storage – just because every other distributed FS was far worse
for our needs
Flat networking with L2 filtering using nwfilter mechanism - keep all logic
inside libvirt
KVM virtualization because of rich featureset
Monitoring via collectd because many components already has its support
So seems we just need to write UI to build our cloud...
The beginning: after testing all together(mid 2012)

Ceph(Argonaut) has very promising benchmarks but was unusable due to
I/O freezes during rebalance.
RBD driver has a lot of leaks.
Libvirt has no OVS support and a lot of hard-to catch leaks
Eventually first version of orchestrator was launched
VM management

All operations with VMs are passed through one local agent instance to
ease task of syncronizing different kind of requests
VMs in cgroups: pinning and accounting.
Mechanism for eliminate leak effects via delayed OOM, simular to the [a ].
Statistics goes over preprocessing and squeezing on node agent.
a https://lwn.net/Articles/317814/
VM management, part 2

Guest agent necessary for approximately half of tasks
Any upgrades can be done via migration
Automatic migration to meet guaranteed shareable resource needs
Rack-based management granularity
A lot of features: implement, look and throw it out

Pinning with exact match to the host topology
Custom TCP congestion protocol
Memory automatic shrinking/extending based on statistics
Online CPU hot-add/hot-remove
NUMA topology passthrough
Performance versus needs

Most kind of regular workloads does not require strict NUMA tuning for
large performance improvements.
Using THP is the same, boost is quite low.
IDE driver supports discard operations but its bandwidth limited by about
70Mb/s, so virtio is only an option
BTRFS shows awesome benchmark resuls as a Ceph backend(zero-cost
snapshot operations and workloads with ton of small operations) but
ultimately unstable
Performance versus needs

Most kind of regular workloads does not require strict NUMA tuning for
large performance improvements.
Using THP is the same, boost is quite low.
IDE driver supports discard operations but its bandwidth limited by about
70Mb/s, so virtio is only an option
BTRFS shows awesome benchmark resuls as a Ceph backend(zero-cost
snapshot operations and workloads with ton of small operations) but
ultimately unstable
Nothing of this came to our production system
Building on top of instabilities(end of 2012)

XFS was quite unstable under CPU-intensive workloads
Ceph daemons leaks in many ways and should be restarted eventually
Nested per-thread cgroups for QEMU may hang the system
OpenVSwitch can be brought down by relatively simple attack
Memory hot-add: ACPI hotplug – buggy but works
Reserved kernel memory takes a lot when high and low limits are different
by an order of magnitude, so ballooning can be used only with samepage
merging.
Regular KSM does not have sufficiently fine tuning and UKSM has a
memory allocation issues.
None of us was able to fix memory compaction issues, so we had
switched to ACPI hotplug driver.
Out of tree, requires some additional work to play with [a ].
a https://github.com/vliaskov/qemu-kvm/

<qemu:commandline>
<qemu:arg value=’-dimm’/>
<qemu:arg value=’id=dimm0,size=256M,populated=on’/>
...
<qemu:arg value=’id=dimm62,size=256M,populated=off’/>
</qemu:commandline>
Ceph

Ceph brings in a lot of awesome improvements in every release, as a
Linux itself.
And as with Linux some work required to make it rocket fast.
Caches at every level;
Fast and low-latency links and precise time sync;
Rebalancing should be avoided as much as possible;
Combine storage and compute roles but avoid resource exhaustion;
Fighting with client latencies all the time.
Ceph, part 2

Rotating media by itself is not enough to give proper latency ceilings to
lot of read requests so large cache is required.
DM-Cache is only one solution which is pluggable, reliable and presented
in the Linux tree in same time.
Cache helped us to put 99% of read latencies behind 10ms dramatically
decreasing from 100ms with only HW controller cache.
Operation priorities are most important thing.
With 10x scale scope of problem priorities was completely changed.
A lot of tunables should be modified.
The less percentage of data one node holds, the lesser downtime (on
peering) will be in a case of failure.
Selecting networking platform

Putting efforts on RSocket for Ceph
Storage network may remain flat
Switch hardware for networking should run opensource firmware;
Today a lot of options on the market:
Arista, HP, NEC, Cisco, Juniper, even Mellanox... none of them offers
opensource platform.
We chose Pica8 platform because it is managed by OVS and can be
modified easily.
Networking gets an upgrade

In the standalone OVS we need to build GRE mesh or put VLAN on
interface to make private networking work properly.
Lots of broadcast in the public network segment.
Traffic pattern analysis for preventing abuse/spam can be hardly applied
to the regular networking.
OpenFlow 1.0+ fulfill all those needs, so we made necessary preparations
and switched at once
OpenFlow

FloodLight was selected to minimize integration complexity
All OpenFlow logic comes from existing VM-VM and MAC-VM relation
Future extensions like overlay networking can built with very small effort
Still not finished multiple controller syncronization
Software storage and sotware networking

Due to nature of Ceph VMs can be migrated and placed everywhere
using snapshot+flatten (and hopefully something easier in future)
Any VMs may be grouped easily;
Amount of overlay networking rules need to be built reduced dramatically
up to IDC scale;
Only filtering and forwarding tasks are remains on rack scale for OF.
Current limitations in Linux

No automatic ballooning
No reliable online memory shrinking mechanism outside ballooning
THP are not working with virtual memory overcommit
Online partition resizing came only in the 2012, so almost no distros have
support out-of-box for this
Future work

Advanced traffic engineering on top of fat-tree ethernet segment using
Pica8 3295 as ToR and 3922 for second level
Extending Ceph experience(RSockets and driver policies improvements)
Autoballooning “zero-failure” mechanism implementation
Any questions?
Backup slides: VM scalability

CPU and RAM cannot be scaled over one node without RDMA and for
very specific NUMA-aware tasks.
Passing variable NUMA topology to VM is impossible to date so Very
Large VMs are not in place.
VM can be easily cloned and new instance with rewritten network
settings can be spinned of (under a balancer)
Rebalance for unshareable/exhausted resources can be done using
migration, but even live migration is visible for end-user; waiting for
RDMA migration to come into mainline

Mais conteúdo relacionado

Mais procurados

Building the carrier grade nfv infrastructure
Building the carrier grade nfv infrastructureBuilding the carrier grade nfv infrastructure
Building the carrier grade nfv infrastructureOPNFV
 
XPDDS18: Xen Testing at Intel - Xudong Hao, Intel
XPDDS18: Xen Testing at Intel - Xudong Hao, IntelXPDDS18: Xen Testing at Intel - Xudong Hao, Intel
XPDDS18: Xen Testing at Intel - Xudong Hao, IntelThe Linux Foundation
 
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebula Project
 
Enabling Scientific Workflows on FermiCloud using OpenNebula
Enabling Scientific Workflows on FermiCloud using OpenNebulaEnabling Scientific Workflows on FermiCloud using OpenNebula
Enabling Scientific Workflows on FermiCloud using OpenNebulaNETWAYS
 
Manage your switches like servers
Manage your switches like serversManage your switches like servers
Manage your switches like serversCumulus Networks
 
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBITOpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBITOpenNebula Project
 
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NECXPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NECThe Linux Foundation
 
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...NETWAYS
 
Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...
Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...
Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...OPNFV
 
How Many Ohs? (An Integration Guide to Apex & Triple-o)
How Many Ohs? (An Integration Guide to Apex & Triple-o)How Many Ohs? (An Integration Guide to Apex & Triple-o)
How Many Ohs? (An Integration Guide to Apex & Triple-o)OPNFV
 
Adventures in Research
Adventures in ResearchAdventures in Research
Adventures in ResearchNETWAYS
 
Kernel maintainance in Linux distributions: Debian
Kernel maintainance in Linux distributions: DebianKernel maintainance in Linux distributions: Debian
Kernel maintainance in Linux distributions: DebianAnne Nicolas
 
KubeCon EU 2016: Full Automatic Database: PostgreSQL HA with Kubernetes
KubeCon EU 2016: Full Automatic Database: PostgreSQL HA with KubernetesKubeCon EU 2016: Full Automatic Database: PostgreSQL HA with Kubernetes
KubeCon EU 2016: Full Automatic Database: PostgreSQL HA with KubernetesKubeAcademy
 
OVHcloud Hosted Private Cloud Platform Network use cases with VMware NSX
OVHcloud Hosted Private Cloud Platform Network use cases with VMware NSXOVHcloud Hosted Private Cloud Platform Network use cases with VMware NSX
OVHcloud Hosted Private Cloud Platform Network use cases with VMware NSXOVHcloud
 
OSDC 2019 | Running backups with Ceph-to-Ceph by Michael Raabe
OSDC 2019 | Running backups with Ceph-to-Ceph by Michael RaabeOSDC 2019 | Running backups with Ceph-to-Ceph by Michael Raabe
OSDC 2019 | Running backups with Ceph-to-Ceph by Michael RaabeNETWAYS
 
My network functions are virtualized, but are they cloud-ready
My network functions are virtualized, but are they cloud-readyMy network functions are virtualized, but are they cloud-ready
My network functions are virtualized, but are they cloud-readyOPNFV
 
Hardware accelerated switching with Linux @ SWLUG Talks May 2014
Hardware accelerated switching with Linux @ SWLUG Talks May 2014Hardware accelerated switching with Linux @ SWLUG Talks May 2014
Hardware accelerated switching with Linux @ SWLUG Talks May 2014Nat Morris
 

Mais procurados (20)

Building the carrier grade nfv infrastructure
Building the carrier grade nfv infrastructureBuilding the carrier grade nfv infrastructure
Building the carrier grade nfv infrastructure
 
XPDDS18: Xen Testing at Intel - Xudong Hao, Intel
XPDDS18: Xen Testing at Intel - Xudong Hao, IntelXPDDS18: Xen Testing at Intel - Xudong Hao, Intel
XPDDS18: Xen Testing at Intel - Xudong Hao, Intel
 
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
 
Docker infiniband
Docker infinibandDocker infiniband
Docker infiniband
 
Enabling Scientific Workflows on FermiCloud using OpenNebula
Enabling Scientific Workflows on FermiCloud using OpenNebulaEnabling Scientific Workflows on FermiCloud using OpenNebula
Enabling Scientific Workflows on FermiCloud using OpenNebula
 
Manage your switches like servers
Manage your switches like serversManage your switches like servers
Manage your switches like servers
 
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBITOpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
 
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NECXPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
 
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
 
Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...
Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...
Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...
 
How Many Ohs? (An Integration Guide to Apex & Triple-o)
How Many Ohs? (An Integration Guide to Apex & Triple-o)How Many Ohs? (An Integration Guide to Apex & Triple-o)
How Many Ohs? (An Integration Guide to Apex & Triple-o)
 
Adventures in Research
Adventures in ResearchAdventures in Research
Adventures in Research
 
Kernel maintainance in Linux distributions: Debian
Kernel maintainance in Linux distributions: DebianKernel maintainance in Linux distributions: Debian
Kernel maintainance in Linux distributions: Debian
 
KubeCon EU 2016: Full Automatic Database: PostgreSQL HA with Kubernetes
KubeCon EU 2016: Full Automatic Database: PostgreSQL HA with KubernetesKubeCon EU 2016: Full Automatic Database: PostgreSQL HA with Kubernetes
KubeCon EU 2016: Full Automatic Database: PostgreSQL HA with Kubernetes
 
OVHcloud Hosted Private Cloud Platform Network use cases with VMware NSX
OVHcloud Hosted Private Cloud Platform Network use cases with VMware NSXOVHcloud Hosted Private Cloud Platform Network use cases with VMware NSX
OVHcloud Hosted Private Cloud Platform Network use cases with VMware NSX
 
OSDC 2019 | Running backups with Ceph-to-Ceph by Michael Raabe
OSDC 2019 | Running backups with Ceph-to-Ceph by Michael RaabeOSDC 2019 | Running backups with Ceph-to-Ceph by Michael Raabe
OSDC 2019 | Running backups with Ceph-to-Ceph by Michael Raabe
 
My network functions are virtualized, but are they cloud-ready
My network functions are virtualized, but are they cloud-readyMy network functions are virtualized, but are they cloud-ready
My network functions are virtualized, but are they cloud-ready
 
What is 3d torus
What is 3d torusWhat is 3d torus
What is 3d torus
 
XS Boston 2008 Quantitative
XS Boston 2008 QuantitativeXS Boston 2008 Quantitative
XS Boston 2008 Quantitative
 
Hardware accelerated switching with Linux @ SWLUG Talks May 2014
Hardware accelerated switching with Linux @ SWLUG Talks May 2014Hardware accelerated switching with Linux @ SWLUG Talks May 2014
Hardware accelerated switching with Linux @ SWLUG Talks May 2014
 

Semelhante a Developing cloud infrastructure from scratch: the tale of the ISP

Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Community
 
Unikraft: Fast, Specialized Unikernels the Easy Way
Unikraft: Fast, Specialized Unikernels the Easy WayUnikraft: Fast, Specialized Unikernels the Easy Way
Unikraft: Fast, Specialized Unikernels the Easy WayScyllaDB
 
OpenNebulaConf 2016 - OpenNebula, a story about flexibility and technological...
OpenNebulaConf 2016 - OpenNebula, a story about flexibility and technological...OpenNebulaConf 2016 - OpenNebula, a story about flexibility and technological...
OpenNebulaConf 2016 - OpenNebula, a story about flexibility and technological...OpenNebula Project
 
From data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloudFrom data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloudFogGuru MSCA Project
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 
The advantages of Arista/OVH configurations, and the technologies behind buil...
The advantages of Arista/OVH configurations, and the technologies behind buil...The advantages of Arista/OVH configurations, and the technologies behind buil...
The advantages of Arista/OVH configurations, and the technologies behind buil...OVHcloud
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AITyrone Systems
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
Software Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFVSoftware Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFVYoshihiro Nakajima
 
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...Ceph Community
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)Kirill Tsym
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingKernel TLV
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentDoKC
 
How (and why!) we built Packet
How (and why!) we built Packet  How (and why!) we built Packet
How (and why!) we built Packet Bob Sokol
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Community
 
Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E...
 Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E... Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E...
Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E...ShapeBlue
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputePatrick McGarry
 

Semelhante a Developing cloud infrastructure from scratch: the tale of the ISP (20)

Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
 
Unikraft: Fast, Specialized Unikernels the Easy Way
Unikraft: Fast, Specialized Unikernels the Easy WayUnikraft: Fast, Specialized Unikernels the Easy Way
Unikraft: Fast, Specialized Unikernels the Easy Way
 
OpenNebulaConf 2016 - OpenNebula, a story about flexibility and technological...
OpenNebulaConf 2016 - OpenNebula, a story about flexibility and technological...OpenNebulaConf 2016 - OpenNebula, a story about flexibility and technological...
OpenNebulaConf 2016 - OpenNebula, a story about flexibility and technological...
 
From data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloudFrom data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloud
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
The advantages of Arista/OVH configurations, and the technologies behind buil...
The advantages of Arista/OVH configurations, and the technologies behind buil...The advantages of Arista/OVH configurations, and the technologies behind buil...
The advantages of Arista/OVH configurations, and the technologies behind buil...
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Software Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFVSoftware Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFV
 
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
 
How (and why!) we built Packet
How (and why!) we built Packet  How (and why!) we built Packet
How (and why!) we built Packet
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
 
Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E...
 Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E... Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E...
Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E...
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
 
Big Data, Better Networks
Big Data, Better NetworksBig Data, Better Networks
Big Data, Better Networks
 
Virtualization Acceleration
Virtualization Acceleration Virtualization Acceleration
Virtualization Acceleration
 

Último

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Developing cloud infrastructure from scratch: the tale of the ISP

  • 1. Developing cloud infrastructure from scratch: the tale of the ISP Andrey Korolyov Performix LLC 22 October 2013
  • 2. Architecture planning Scalable to infinity and beyond Fault-tolerant (not just words for a salesman) Easy to build...or not so easy Cheap commodity hardware Avoid modular design because we are not building OpenStack Things that are tolerable in the private cloud completely unacceptable in the public one
  • 3. The beginning: no expertise but lots of enthusiasm Ceph for storage – just because every other distributed FS was far worse for our needs Flat networking with L2 filtering using nwfilter mechanism - keep all logic inside libvirt KVM virtualization because of rich featureset Monitoring via collectd because many components already has its support So seems we just need to write UI to build our cloud...
  • 4. The beginning: after testing all together(mid 2012) Ceph(Argonaut) has very promising benchmarks but was unusable due to I/O freezes during rebalance. RBD driver has a lot of leaks. Libvirt has no OVS support and a lot of hard-to catch leaks Eventually first version of orchestrator was launched
  • 5. VM management All operations with VMs are passed through one local agent instance to ease task of syncronizing different kind of requests VMs in cgroups: pinning and accounting. Mechanism for eliminate leak effects via delayed OOM, simular to the [a ]. Statistics goes over preprocessing and squeezing on node agent. a https://lwn.net/Articles/317814/
  • 6. VM management, part 2 Guest agent necessary for approximately half of tasks Any upgrades can be done via migration Automatic migration to meet guaranteed shareable resource needs Rack-based management granularity
  • 7. A lot of features: implement, look and throw it out Pinning with exact match to the host topology Custom TCP congestion protocol Memory automatic shrinking/extending based on statistics Online CPU hot-add/hot-remove NUMA topology passthrough
  • 8. Performance versus needs Most kind of regular workloads does not require strict NUMA tuning for large performance improvements. Using THP is the same, boost is quite low. IDE driver supports discard operations but its bandwidth limited by about 70Mb/s, so virtio is only an option BTRFS shows awesome benchmark resuls as a Ceph backend(zero-cost snapshot operations and workloads with ton of small operations) but ultimately unstable
  • 9. Performance versus needs Most kind of regular workloads does not require strict NUMA tuning for large performance improvements. Using THP is the same, boost is quite low. IDE driver supports discard operations but its bandwidth limited by about 70Mb/s, so virtio is only an option BTRFS shows awesome benchmark resuls as a Ceph backend(zero-cost snapshot operations and workloads with ton of small operations) but ultimately unstable Nothing of this came to our production system
  • 10. Building on top of instabilities(end of 2012) XFS was quite unstable under CPU-intensive workloads Ceph daemons leaks in many ways and should be restarted eventually Nested per-thread cgroups for QEMU may hang the system OpenVSwitch can be brought down by relatively simple attack
  • 11. Memory hot-add: ACPI hotplug – buggy but works Reserved kernel memory takes a lot when high and low limits are different by an order of magnitude, so ballooning can be used only with samepage merging. Regular KSM does not have sufficiently fine tuning and UKSM has a memory allocation issues. None of us was able to fix memory compaction issues, so we had switched to ACPI hotplug driver. Out of tree, requires some additional work to play with [a ]. a https://github.com/vliaskov/qemu-kvm/ <qemu:commandline> <qemu:arg value=’-dimm’/> <qemu:arg value=’id=dimm0,size=256M,populated=on’/> ... <qemu:arg value=’id=dimm62,size=256M,populated=off’/> </qemu:commandline>
  • 12. Ceph Ceph brings in a lot of awesome improvements in every release, as a Linux itself. And as with Linux some work required to make it rocket fast. Caches at every level; Fast and low-latency links and precise time sync; Rebalancing should be avoided as much as possible; Combine storage and compute roles but avoid resource exhaustion; Fighting with client latencies all the time.
  • 13. Ceph, part 2 Rotating media by itself is not enough to give proper latency ceilings to lot of read requests so large cache is required. DM-Cache is only one solution which is pluggable, reliable and presented in the Linux tree in same time. Cache helped us to put 99% of read latencies behind 10ms dramatically decreasing from 100ms with only HW controller cache. Operation priorities are most important thing. With 10x scale scope of problem priorities was completely changed. A lot of tunables should be modified. The less percentage of data one node holds, the lesser downtime (on peering) will be in a case of failure.
  • 14. Selecting networking platform Putting efforts on RSocket for Ceph Storage network may remain flat Switch hardware for networking should run opensource firmware; Today a lot of options on the market: Arista, HP, NEC, Cisco, Juniper, even Mellanox... none of them offers opensource platform. We chose Pica8 platform because it is managed by OVS and can be modified easily.
  • 15. Networking gets an upgrade In the standalone OVS we need to build GRE mesh or put VLAN on interface to make private networking work properly. Lots of broadcast in the public network segment. Traffic pattern analysis for preventing abuse/spam can be hardly applied to the regular networking. OpenFlow 1.0+ fulfill all those needs, so we made necessary preparations and switched at once
  • 16. OpenFlow FloodLight was selected to minimize integration complexity All OpenFlow logic comes from existing VM-VM and MAC-VM relation Future extensions like overlay networking can built with very small effort Still not finished multiple controller syncronization
  • 17. Software storage and sotware networking Due to nature of Ceph VMs can be migrated and placed everywhere using snapshot+flatten (and hopefully something easier in future) Any VMs may be grouped easily; Amount of overlay networking rules need to be built reduced dramatically up to IDC scale; Only filtering and forwarding tasks are remains on rack scale for OF.
  • 18. Current limitations in Linux No automatic ballooning No reliable online memory shrinking mechanism outside ballooning THP are not working with virtual memory overcommit Online partition resizing came only in the 2012, so almost no distros have support out-of-box for this
  • 19. Future work Advanced traffic engineering on top of fat-tree ethernet segment using Pica8 3295 as ToR and 3922 for second level Extending Ceph experience(RSockets and driver policies improvements) Autoballooning “zero-failure” mechanism implementation
  • 21. Backup slides: VM scalability CPU and RAM cannot be scaled over one node without RDMA and for very specific NUMA-aware tasks. Passing variable NUMA topology to VM is impossible to date so Very Large VMs are not in place. VM can be easily cloned and new instance with rewritten network settings can be spinned of (under a balancer) Rebalance for unshareable/exhausted resources can be done using migration, but even live migration is visible for end-user; waiting for RDMA migration to come into mainline