SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
Baptiste Gerondeau, Takeharu Kato, Renato Golin
Arm Workshop, 26th July 2018
OpenHPC Automation with Ansible
Outline
● Linaro’s HPC-SIG Lab
● OpenHPC Ansible Automation
● Results
The HPC-SIG Lab
Goals:
● Cluster Automation & Validation
● Benchmarking & Performance Investigation
Requirements:
● Stability & Repeatability
● Close-to-production environment
● Upstream technology (reproducibility)
● Vendor isolation (hardware, results)
3
The HPC-SIG Lab
Network Layout
● Uplink subnet
○ Access/Provision
● BMC subnet
● MPI subnet
○ 100GB InfiniBand
○ Slave provision
● File System subnet
○ 100GB Ethernet
○ Lustre/Ceph (future)
4
The HPC-SIG Lab
● Stability & Repeatability
○ Critical external components cached locally
○ Strict migration plans (staging)
● Close-to-production environment
○ Hardware and firmware updated frequently
● Upstream technology (reproducibility)
○ All components open source / upstream
● Vendor isolation (hardware, results)
○ VPN, Provisioner, Jenkins, SSH control
5
Open Source and Upstream
Lab admin & tools:
● Jenkins: https://jenkins.io/
● Mr-Provisioner: https://github.com/Linaro/mr-provisioner
Lab Automation:
● https://github.com/Linaro/ans_setup_jenkins
● https://github.com/Linaro/mr-provisioner-role
● https://github.com/Linaro/mr-provisioner-kea-dhcp4-role
● https://github.com/Linaro/ansible-role-mr-provisioner
● https://github.com/Linaro/mr-provisioner-client
HPC-specific automation:
● https://github.com/Linaro/hpc_lab_setup
● https://github.com/Linaro/hpc_deploy_benchmarks
● https://github.com/Linaro/benchmark_harness
● https://github.com/Linaro/ansible-playbook-for-ohpc
6
Existing OpenHPC Automation
● A recipe with all rules described in the official documents
● LaTex snippets containing shell code
● Are converted and merged into a (OS-dependent) bash script
● Scripts in OHPC packages, installed after OpenHPC main package is installed
● Plus a input.local file, with some cluster-specific configuration / environment
7
Existing OpenHPC Automation
Shortcomings:
● The input.local file exports shell variables
● The recipe.sh is not idempotent
● Extensibility is impossible without editing the files
8
Ansible Playbooks
Ansible is a widely used automation tool which can describe the structure and
configuration of IT infrastructure with YAML “playbooks” and “roles”.
OpenHPC with Ansible:
● Ansible playbooks can more easily be made idempotent
● Ansible can manage nodes/tasks according to the the structure of the cluster
● Configuration is passed as a YAML file (no environment handling)
● Composition, using playbooks and roles, building on third-party content
9
Ansible OpenHPC Benefits
● Flexible cluster configuration
○ Fine grained / composable
○ Cluster wide / node group wide / node specific
● Works on both x86_64 and AArch64
○ Ansible gathers information about architecture
○ Same playbook runs on both
● OS is directly inferred by Ansible (gather_facts)
○ Ansible gathers information about OS
○ Yum, apt, zypper… can be switched in the roles’ logic
10
Ansible OpenHPC Recipe
The basic structure of the Ansible playbook
playbook
+---- group_vars/
+-- all.yml cluster wide configurations
+-- group1,group2 … node group(e.g., computing nodes, login, DFS) specific configurations
+---- host_vars/
+--- host1,host2 … host specific configurations
+---- roles/ package specific tasks, templates, config files, and config variables
+--- package-name/
+--- tasks/main.yml … YAML file to describe installation method of package-name
+--- vars/main.yml … package specific configuration variables
+--- templates/*.j2 … template files to generate configuration files
11
Ansible OpenHPC Tasks
12
Upstreaming our work
After multiple discussions, our proposal is:
● Generate from LaTex sources at the same time as recipe.sh
○ Each block/section as a separate role
○ Control OS/Arch via gather_facts/config
○ xCAT vs Warewulf, Slurm vs PBS: only add what hasn’t yet
● Push to a separate repository (openhpc/ansible-role ?)
○ Control release branch/tags as to not overwrite previous recipes
○ Repo can have additional playbooks (CI, new users)
○ Once release is validated, push to Ansible Galaxy
● Direct pull requests to playbooks / support roles
○ Only recipe roles are auto-generated
○ Support roles (specific to Ansible, playbook) are maintained on the repo
13
Results
Test Suite
Most tests green, however:
● Intel-specific tests (CILK, TBB, IMB) disabled
● Others need package install (PDF, CDF, HDF), but pass when installed
● TAU fails because LMod defaults to openmpi (needs openmpi3)
● Lustre fails as package depends on kernel 4.2 (which won’t work on our machines)
● MiniDFT and PRK had make failures, but we haven’t investigated yet
● --enable-longdoesn’t really, need to look into why not
The plan from now on is:
1. Automate package install conditional on enabled tests, fix remaining errors
2. Work with members to prioritise long term ones (like Lustre)
3. Use it for additional packages, so we can test them before sending upstream
4. Add a benchmark mode, making sure to use entire cluster
15
Thank You!

Mais conteúdo relacionado

Mais procurados

OpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph Galuschka
OpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph GaluschkaOpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph Galuschka
OpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph Galuschka
NETWAYS
 
Openstack devops challenges
Openstack devops challenges Openstack devops challenges
Openstack devops challenges
openstackindia
 

Mais procurados (20)

OpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community UpdateOpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community Update
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
 
OpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph Galuschka
OpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph GaluschkaOpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph Galuschka
OpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph Galuschka
 
BKK16-305B ILP32 Performance on AArch64
BKK16-305B ILP32 Performance on AArch64BKK16-305B ILP32 Performance on AArch64
BKK16-305B ILP32 Performance on AArch64
 
Python Basics for Operators Troubleshooting OpenStack
Python Basics for Operators Troubleshooting OpenStackPython Basics for Operators Troubleshooting OpenStack
Python Basics for Operators Troubleshooting OpenStack
 
OpenPOWER foundation update new executive director and bright open future_i...
OpenPOWER  foundation update  new executive director and bright open future_i...OpenPOWER  foundation update  new executive director and bright open future_i...
OpenPOWER foundation update new executive director and bright open future_i...
 
Openstack devops challenges
Openstack devops challenges Openstack devops challenges
Openstack devops challenges
 
TripleO
 TripleO TripleO
TripleO
 
Dude, This Isn't Where I Parked My Instance?
Dude, This Isn't Where I Parked My Instance?Dude, This Isn't Where I Parked My Instance?
Dude, This Isn't Where I Parked My Instance?
 
Openstack ansible
Openstack ansibleOpenstack ansible
Openstack ansible
 
OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014
 
Andy McCrae, Rackspace - Using Ansible to Deploy and Automate OpenStack, Open...
Andy McCrae, Rackspace - Using Ansible to Deploy and Automate OpenStack, Open...Andy McCrae, Rackspace - Using Ansible to Deploy and Automate OpenStack, Open...
Andy McCrae, Rackspace - Using Ansible to Deploy and Automate OpenStack, Open...
 
LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...
LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...
LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...
 
BKK16-400B ODPI - Standardizing Hadoop
BKK16-400B ODPI - Standardizing HadoopBKK16-400B ODPI - Standardizing Hadoop
BKK16-400B ODPI - Standardizing Hadoop
 
Deploying OpenDaylight and OpenStack at Ease
Deploying OpenDaylight and OpenStack at EaseDeploying OpenDaylight and OpenStack at Ease
Deploying OpenDaylight and OpenStack at Ease
 
LAS16-207: Bus scaling QoS
LAS16-207: Bus scaling QoSLAS16-207: Bus scaling QoS
LAS16-207: Bus scaling QoS
 
HA in OpenStack service - meetup #9
HA in OpenStack service - meetup #9HA in OpenStack service - meetup #9
HA in OpenStack service - meetup #9
 
Guts & OpenStack migration
Guts & OpenStack migrationGuts & OpenStack migration
Guts & OpenStack migration
 
OpenStack Neutron behind the Scenes
OpenStack Neutron behind the ScenesOpenStack Neutron behind the Scenes
OpenStack Neutron behind the Scenes
 

Semelhante a OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018

LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)
Linaro
 
Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014
vespian_256
 

Semelhante a OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018 (20)

LAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMGLAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMG
 
Network Automation: Ansible 101
Network Automation: Ansible 101Network Automation: Ansible 101
Network Automation: Ansible 101
 
LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)
 
TechDay - Cambridge 2016 - OpenNebula at Harvard Univerity
TechDay - Cambridge 2016 - OpenNebula at Harvard UniverityTechDay - Cambridge 2016 - OpenNebula at Harvard Univerity
TechDay - Cambridge 2016 - OpenNebula at Harvard Univerity
 
Introduction to ansible
Introduction to ansibleIntroduction to ansible
Introduction to ansible
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
PLNOG14: Automation at Brainly - Paweł Rozlach
PLNOG14: Automation at Brainly - Paweł RozlachPLNOG14: Automation at Brainly - Paweł Rozlach
PLNOG14: Automation at Brainly - Paweł Rozlach
 
PLNOG Automation@Brainly
PLNOG Automation@BrainlyPLNOG Automation@Brainly
PLNOG Automation@Brainly
 
Linux Kernel Platform Development: Challenges and Insights
 Linux Kernel Platform Development: Challenges and Insights Linux Kernel Platform Development: Challenges and Insights
Linux Kernel Platform Development: Challenges and Insights
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
 
Linux 开源操作系统发展新趋势
Linux 开源操作系统发展新趋势Linux 开源操作系统发展新趋势
Linux 开源操作系统发展新趋势
 
Sheepdog Status Report
Sheepdog Status ReportSheepdog Status Report
Sheepdog Status Report
 
Lec 10-linux-review
Lec 10-linux-reviewLec 10-linux-review
Lec 10-linux-review
 
LMG Lightning Talks - SFO17-205
LMG Lightning Talks - SFO17-205LMG Lightning Talks - SFO17-205
LMG Lightning Talks - SFO17-205
 
Introduction to ansible
Introduction to ansibleIntroduction to ansible
Introduction to ansible
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
 
#OktoCampus - Workshop : An introduction to Ansible
#OktoCampus - Workshop : An introduction to Ansible#OktoCampus - Workshop : An introduction to Ansible
#OktoCampus - Workshop : An introduction to Ansible
 
LCE13: Test and Validation Summit: The future of testing at Linaro
LCE13: Test and Validation Summit: The future of testing at LinaroLCE13: Test and Validation Summit: The future of testing at Linaro
LCE13: Test and Validation Summit: The future of testing at Linaro
 

Mais de Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Linaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
Linaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
Linaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
Linaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
Linaro
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
Linaro
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready Program
Linaro
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NN
Linaro
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
Linaro
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
Linaro
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: Introduction
Linaro
 

Mais de Linaro (20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready Program
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NN
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: Introduction
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018

  • 1. Baptiste Gerondeau, Takeharu Kato, Renato Golin Arm Workshop, 26th July 2018 OpenHPC Automation with Ansible
  • 2. Outline ● Linaro’s HPC-SIG Lab ● OpenHPC Ansible Automation ● Results
  • 3. The HPC-SIG Lab Goals: ● Cluster Automation & Validation ● Benchmarking & Performance Investigation Requirements: ● Stability & Repeatability ● Close-to-production environment ● Upstream technology (reproducibility) ● Vendor isolation (hardware, results) 3
  • 4. The HPC-SIG Lab Network Layout ● Uplink subnet ○ Access/Provision ● BMC subnet ● MPI subnet ○ 100GB InfiniBand ○ Slave provision ● File System subnet ○ 100GB Ethernet ○ Lustre/Ceph (future) 4
  • 5. The HPC-SIG Lab ● Stability & Repeatability ○ Critical external components cached locally ○ Strict migration plans (staging) ● Close-to-production environment ○ Hardware and firmware updated frequently ● Upstream technology (reproducibility) ○ All components open source / upstream ● Vendor isolation (hardware, results) ○ VPN, Provisioner, Jenkins, SSH control 5
  • 6. Open Source and Upstream Lab admin & tools: ● Jenkins: https://jenkins.io/ ● Mr-Provisioner: https://github.com/Linaro/mr-provisioner Lab Automation: ● https://github.com/Linaro/ans_setup_jenkins ● https://github.com/Linaro/mr-provisioner-role ● https://github.com/Linaro/mr-provisioner-kea-dhcp4-role ● https://github.com/Linaro/ansible-role-mr-provisioner ● https://github.com/Linaro/mr-provisioner-client HPC-specific automation: ● https://github.com/Linaro/hpc_lab_setup ● https://github.com/Linaro/hpc_deploy_benchmarks ● https://github.com/Linaro/benchmark_harness ● https://github.com/Linaro/ansible-playbook-for-ohpc 6
  • 7. Existing OpenHPC Automation ● A recipe with all rules described in the official documents ● LaTex snippets containing shell code ● Are converted and merged into a (OS-dependent) bash script ● Scripts in OHPC packages, installed after OpenHPC main package is installed ● Plus a input.local file, with some cluster-specific configuration / environment 7
  • 8. Existing OpenHPC Automation Shortcomings: ● The input.local file exports shell variables ● The recipe.sh is not idempotent ● Extensibility is impossible without editing the files 8
  • 9. Ansible Playbooks Ansible is a widely used automation tool which can describe the structure and configuration of IT infrastructure with YAML “playbooks” and “roles”. OpenHPC with Ansible: ● Ansible playbooks can more easily be made idempotent ● Ansible can manage nodes/tasks according to the the structure of the cluster ● Configuration is passed as a YAML file (no environment handling) ● Composition, using playbooks and roles, building on third-party content 9
  • 10. Ansible OpenHPC Benefits ● Flexible cluster configuration ○ Fine grained / composable ○ Cluster wide / node group wide / node specific ● Works on both x86_64 and AArch64 ○ Ansible gathers information about architecture ○ Same playbook runs on both ● OS is directly inferred by Ansible (gather_facts) ○ Ansible gathers information about OS ○ Yum, apt, zypper… can be switched in the roles’ logic 10
  • 11. Ansible OpenHPC Recipe The basic structure of the Ansible playbook playbook +---- group_vars/ +-- all.yml cluster wide configurations +-- group1,group2 … node group(e.g., computing nodes, login, DFS) specific configurations +---- host_vars/ +--- host1,host2 … host specific configurations +---- roles/ package specific tasks, templates, config files, and config variables +--- package-name/ +--- tasks/main.yml … YAML file to describe installation method of package-name +--- vars/main.yml … package specific configuration variables +--- templates/*.j2 … template files to generate configuration files 11
  • 13. Upstreaming our work After multiple discussions, our proposal is: ● Generate from LaTex sources at the same time as recipe.sh ○ Each block/section as a separate role ○ Control OS/Arch via gather_facts/config ○ xCAT vs Warewulf, Slurm vs PBS: only add what hasn’t yet ● Push to a separate repository (openhpc/ansible-role ?) ○ Control release branch/tags as to not overwrite previous recipes ○ Repo can have additional playbooks (CI, new users) ○ Once release is validated, push to Ansible Galaxy ● Direct pull requests to playbooks / support roles ○ Only recipe roles are auto-generated ○ Support roles (specific to Ansible, playbook) are maintained on the repo 13
  • 15. Test Suite Most tests green, however: ● Intel-specific tests (CILK, TBB, IMB) disabled ● Others need package install (PDF, CDF, HDF), but pass when installed ● TAU fails because LMod defaults to openmpi (needs openmpi3) ● Lustre fails as package depends on kernel 4.2 (which won’t work on our machines) ● MiniDFT and PRK had make failures, but we haven’t investigated yet ● --enable-longdoesn’t really, need to look into why not The plan from now on is: 1. Automate package install conditional on enabled tests, fix remaining errors 2. Work with members to prioritise long term ones (like Lustre) 3. Use it for additional packages, so we can test them before sending upstream 4. Add a benchmark mode, making sure to use entire cluster 15