SlideShare uma empresa Scribd logo
1 de 42
Baixar para ler offline
Luciano Mammino (@loige)
Serverless for HPC 🚀
fth.link/cm22
Is Serverless a good option for
High Performance Computing?
@loige
👋 Hello, I am Luciano
Senior architect
nodejsdesignpatterns.com
Let’s connect:
🌎 loige.co
🐦 @loige
🎥 loige
🧳 lucianomammino
Middy Framework
SLIC Starter - Serverless Accelerator
SLIC Watch - Observability Plugin
Business focused technologists.
Accelerated Serverless | AI as a Service | Platform Modernisation
We host a podcast about AWS and Cloud computing
🔗 awsbites.com
🎬 YouTube Channel
🎙 Podcast
📅 Episodes every week
@loige
Get the slides: fth.link/cm22
@loige
Agenda
● The 6 Rs of Cloud Migration
● A serverless case study
○ The problem space and types of workflows
○ Original on premise implementation
○ The PoC
○ The final production version
○ The components of a serverless job scheduler
○ Challenges & Limits
@loige
fth.link/cm22
The 6 Rs of Cloud Migrations
@loige
🗑 🕸 🚚
Retire Retain Rehost
🏗 📐 💰
Replatform Refactor Repurchase
fth.link/cm22
A case study
@loige
Case study on AWS blog: fth.link/awshpc
The workloads - Risk Rollup
🏦 Financial modeling to understand the portfolio of risk
🧠 Internal, custom-built risk model on all reinsurance deals
⚙ HPC (High-Performance Computing) workload
🗄 ~45TB data processed
⏱ 2/3 rollups per day (6-8 hours each!)
@loige
The workloads - Deal Analytics
⚡ Near real-time deal pricing using the same risk model
🗃 Lower data volumes
🔁 High frequency of execution – up to 1.000 per day
@loige
Original on-prem implementation
@loige
Challenges
🐢 Long execution times, constraining business agility
🥊 Competing workloads
📈 Limits our ability to support portfolio growth
😩 Can’t deliver new features
🧾 Very high total cost of ownership
@loige
Thinking Big
💭 Imagine a solution that would …
1. Offer a dramatic increase in performance
2. Provide consistent run times
3. Support more executions, more often
4. Support future portfolio growth and new
capabilities – 15x data volumes
@loige
The Goal ⚽
Run a Risk Rollup in 1 hour!
@loige
Architecture Options for Compute/Orchestration
@loige
AWS Lambda
Amazon SQS AWS Step Functions
AWS Fargate
Com t om :
Red he b to si l ,
s a l , ev -d i n
co n s
POC Architecture
@loige
AWS Batch
S3
Step Functions
Lambda
SQS
Measure Everything! 📏
⏱ Built metrics in from the start
󰤈 AWS metrics we wish existed out of the box:
- Number of running containers
- Success/failure counts
🎨 Custom metrics:
- Scheduler overhead
- Detailed timings (job duration, I/O time, algorithm steps)
🛠 Using CloudWatch, EMF
@loige
Measure Everything! 📏
👍 Rollup in 1 hour
☁ Running on AWS Batch
👎 Cluster utilisation was <50%
✅ Goal success
🤔 Understanding of what needs to be addressed
@loige
Beyond the PoC
Production: optimise for unique workload characteristics
@loige
Job Plan
@loige
In reality, not all jobs are alike!
@loige
Horizontal scaling 🚀
1000’s of jobs
Duration: 1 second – 45 minutes
Scaling horizontally = splitting jobs
Jobs split according to their
complexity/duration
Resulting in >1 million jobs
@loige
Moving to production 🚢
@loige
Scope
@loige
Actual End to End overview
@loige
Modelling Worker
@loige
Compute Services
Scales to 1000’s of tasks (containers)
Little management overhead
Up to 4 vCPUs and 30GB Memory
Up to 200GB ephemeral storage
Scales to 1000’s of function containers (in seconds!)
Very little management overhead
Up to 6 vCPUs and 10GB Memory
Up to 10GB ephemeral storage
It wasn’t always this way!
@loige
Store all the things in S3!
The source of truth for:
● Input Data (JSON, Parquet)
● Intermediate Data (Parquet)
● Results (Parquet)
● Aggregates (Parquet)
Input data: 20GB
Output data: ~1 TB
Reads and writes: 10,000s of objects per second.
@loige
Scheduling and Orchestration
✅ We have our cluster (Fargate or Lambda)
✅ We have a plan! (list of jobs, parameters and
dependencies)
🤔 How do we feed this plan to the cluster?!
🤨 Existing schedulers use traditional clusters – there
is no serverless job scheduler for workloads like this!
@loige
Lifecycle of a Job
@loige
A new job
get queued
here 👇
A worker
picks up the
job and
executes it
The worker
emits the
job state
(success or
failure)
Event-Driven Scheduler
@loige
Job states are pulled
from a Kinesis Data
Stream
Redis stores:
- Job states
- Dependencies
This scheduler checks
new job states against
the state in Redis and
figures out if there are
new jobs that can be
scheduled next
Dynamic Runtime
Handling
@loige
We also need to handle
system failures!
Outcomes 🙌
Business
● Rollup in 1 hour
● Removed limits on number of runs
● Faster, more consistent deal
analytics
● Business spending more time on
revenue-generating activities
● Support portfolio growth and deliver
new capabilities
@loige
Technology
● Brought serverless to HPC financial
modeling
● Reduced codebase by ~70%
● Lowered total cost of ownership
● Increased dev team agility
● Reduced carbon footprint
Hitting the limits 😰
@loige
S3 Throughput
@loige
S3 Partitioning
S3 cleverly detects high-throughput prefixes and creates partitions
….normally
If this does not happen…
🚨Please reduce your request rate;
Status Code: 503; Error Code: SlowDown
@loige
The Solution
Explicit Partitioning:
○Figure out how many partitions you need
○Update code to create keys uniformly distributed over all partitions
/part/0…
/part/1…
/part/2…
/part/3…
…
/part/f…
@loige
1. Talk (a lot) to AWS SAs, Support, Account
Manager for special requirements like this!
2. Think ahead if you have multiple accounts
for different environments!
Fargate Scaling
● We want to run 3000 containers ASAP
● This took > 1 hour!
● We built a custom Fargate scaler
○Using the RunTask API (no ECS Service)
○Hidden quota increases
○Step Function + Lambda
● 3000 containers in ~20 minutes
@loige
The AWS ECS team since made lots of
improvements, making it possible to scale to
3,000 containers in under 5 minutes
How high can we go today?
🚀 10,000 concurrent Lambda functions in seconds
🎢 10,000 Fargate containers in 10 minutes
💸 No additional cost
vladionescu.me/posts/scaling-containers-on-aws-in-2022
@loige
Wrapping up 🎁
● "Serverless supercomputer" lets you do HPC with
commodity AWS compute
● Plenty of challenges, but it's doable!
● Agility and innovation benefits are massive
● Customer is now serverless-first and expert in AWS
Other interesting case studies:
☁ AWS HTC Grid - 🧬 COVID genome research
@loige
Special thanks to @eoins, @cmthorne10 and the awesome team at RenRe!
@loige
fth.link/cm22
Serverless for HPC?
IT WORKS!

Mais conteúdo relacionado

Semelhante a Serverless for High Performance Computing

The burden of a successful feature: Scaling our real time logging platform
The burden of a successful feature: Scaling our real time logging platformThe burden of a successful feature: Scaling our real time logging platform
The burden of a successful feature: Scaling our real time logging platformFastly
 
There is something about serverless
There is something about serverlessThere is something about serverless
There is something about serverlessgjdevos
 
Airflow based Video Encoding Platform
Airflow based Video Encoding PlatformAirflow based Video Encoding Platform
Airflow based Video Encoding PlatformHotstar
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegasPeter Mounce
 
AWS Observability Made Simple
AWS Observability Made SimpleAWS Observability Made Simple
AWS Observability Made SimpleLuciano Mammino
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka Hua Chu
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applicationsCesar Cardenas Desales
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsClaudiu Coman
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsCesar Cardenas Desales
 
PyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applicationsPyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applicationsCesar Cardenas Desales
 
Scaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays SingaporeScaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays SingaporeAngad Singh
 
JUST EAT: Tools we use to enable our culture
JUST EAT: Tools we use to enable our cultureJUST EAT: Tools we use to enable our culture
JUST EAT: Tools we use to enable our culturePeter Mounce
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
 

Semelhante a Serverless for High Performance Computing (20)

The burden of a successful feature: Scaling our real time logging platform
The burden of a successful feature: Scaling our real time logging platformThe burden of a successful feature: Scaling our real time logging platform
The burden of a successful feature: Scaling our real time logging platform
 
There is something about serverless
There is something about serverlessThere is something about serverless
There is something about serverless
 
Airflow based Video Encoding Platform
Airflow based Video Encoding PlatformAirflow based Video Encoding Platform
Airflow based Video Encoding Platform
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
 
AWS Observability Made Simple
AWS Observability Made SimpleAWS Observability Made Simple
AWS Observability Made Simple
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applications
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyics
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applications
 
PyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applicationsPyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applications
 
Scaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays SingaporeScaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays Singapore
 
JUST EAT: Tools we use to enable our culture
JUST EAT: Tools we use to enable our cultureJUST EAT: Tools we use to enable our culture
JUST EAT: Tools we use to enable our culture
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 

Mais de Luciano Mammino

Did you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJSDid you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJSLuciano Mammino
 
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...Luciano Mammino
 
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS MilanoBuilding an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS MilanoLuciano Mammino
 
From Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiperFrom Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiperLuciano Mammino
 
Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!Luciano Mammino
 
Everything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLsEverything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLsLuciano Mammino
 
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022Luciano Mammino
 
Building an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & AirtableBuilding an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & AirtableLuciano Mammino
 
Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀Luciano Mammino
 
A look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust DublinA look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust DublinLuciano Mammino
 
Node.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community VijayawadaNode.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community VijayawadaLuciano Mammino
 
A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)Luciano Mammino
 
Semplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessSemplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessLuciano Mammino
 
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021Luciano Mammino
 
Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021Luciano Mammino
 
How to send gzipped requests with boto3
How to send gzipped requests with boto3How to send gzipped requests with boto3
How to send gzipped requests with boto3Luciano Mammino
 
Finding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iteratorsFinding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iteratorsLuciano Mammino
 
AWS Observability (without the Pain)
AWS Observability (without the Pain)AWS Observability (without the Pain)
AWS Observability (without the Pain)Luciano Mammino
 

Mais de Luciano Mammino (20)

Did you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJSDid you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJS
 
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
 
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS MilanoBuilding an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
 
From Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiperFrom Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiper
 
Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!
 
Everything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLsEverything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLs
 
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
 
Building an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & AirtableBuilding an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & Airtable
 
Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀
 
A look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust DublinA look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust Dublin
 
Monoliths to the cloud!
Monoliths to the cloud!Monoliths to the cloud!
Monoliths to the cloud!
 
The senior dev
The senior devThe senior dev
The senior dev
 
Node.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community VijayawadaNode.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community Vijayawada
 
A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)
 
Semplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessSemplificare l'observability per progetti Serverless
Semplificare l'observability per progetti Serverless
 
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
 
Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021
 
How to send gzipped requests with boto3
How to send gzipped requests with boto3How to send gzipped requests with boto3
How to send gzipped requests with boto3
 
Finding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iteratorsFinding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iterators
 
AWS Observability (without the Pain)
AWS Observability (without the Pain)AWS Observability (without the Pain)
AWS Observability (without the Pain)
 

Último

COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...naitiksharma1124
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfFurqanuddin10
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabbereGrabber
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfVictor Lopez
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfMehmet Akar
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfsteffenkarlsson2
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareinfo611746
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationHelp Desk Migration
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignNeo4j
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Andrea Goulet
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024Shane Coughlan
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfmbmh111980
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems ApproachNeo4j
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)Max Lee
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfQ-Advise
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityamy56318795
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationHelp Desk Migration
 

Último (20)

AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdf
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting software
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data Migration
 

Serverless for High Performance Computing

  • 1. Luciano Mammino (@loige) Serverless for HPC 🚀 fth.link/cm22
  • 2. Is Serverless a good option for High Performance Computing? @loige
  • 3. 👋 Hello, I am Luciano Senior architect nodejsdesignpatterns.com Let’s connect: 🌎 loige.co 🐦 @loige 🎥 loige 🧳 lucianomammino
  • 4. Middy Framework SLIC Starter - Serverless Accelerator SLIC Watch - Observability Plugin Business focused technologists. Accelerated Serverless | AI as a Service | Platform Modernisation
  • 5. We host a podcast about AWS and Cloud computing 🔗 awsbites.com 🎬 YouTube Channel 🎙 Podcast 📅 Episodes every week @loige
  • 6. Get the slides: fth.link/cm22 @loige
  • 7. Agenda ● The 6 Rs of Cloud Migration ● A serverless case study ○ The problem space and types of workflows ○ Original on premise implementation ○ The PoC ○ The final production version ○ The components of a serverless job scheduler ○ Challenges & Limits @loige fth.link/cm22
  • 8. The 6 Rs of Cloud Migrations @loige 🗑 🕸 🚚 Retire Retain Rehost 🏗 📐 💰 Replatform Refactor Repurchase fth.link/cm22
  • 9. A case study @loige Case study on AWS blog: fth.link/awshpc
  • 10. The workloads - Risk Rollup 🏦 Financial modeling to understand the portfolio of risk 🧠 Internal, custom-built risk model on all reinsurance deals ⚙ HPC (High-Performance Computing) workload 🗄 ~45TB data processed ⏱ 2/3 rollups per day (6-8 hours each!) @loige
  • 11. The workloads - Deal Analytics ⚡ Near real-time deal pricing using the same risk model 🗃 Lower data volumes 🔁 High frequency of execution – up to 1.000 per day @loige
  • 13. Challenges 🐢 Long execution times, constraining business agility 🥊 Competing workloads 📈 Limits our ability to support portfolio growth 😩 Can’t deliver new features 🧾 Very high total cost of ownership @loige
  • 14. Thinking Big 💭 Imagine a solution that would … 1. Offer a dramatic increase in performance 2. Provide consistent run times 3. Support more executions, more often 4. Support future portfolio growth and new capabilities – 15x data volumes @loige
  • 15. The Goal ⚽ Run a Risk Rollup in 1 hour! @loige
  • 16. Architecture Options for Compute/Orchestration @loige AWS Lambda Amazon SQS AWS Step Functions AWS Fargate Com t om : Red he b to si l , s a l , ev -d i n co n s
  • 18. Measure Everything! 📏 ⏱ Built metrics in from the start 󰤈 AWS metrics we wish existed out of the box: - Number of running containers - Success/failure counts 🎨 Custom metrics: - Scheduler overhead - Detailed timings (job duration, I/O time, algorithm steps) 🛠 Using CloudWatch, EMF @loige
  • 19. Measure Everything! 📏 👍 Rollup in 1 hour ☁ Running on AWS Batch 👎 Cluster utilisation was <50% ✅ Goal success 🤔 Understanding of what needs to be addressed @loige
  • 20. Beyond the PoC Production: optimise for unique workload characteristics @loige
  • 22. In reality, not all jobs are alike! @loige
  • 23. Horizontal scaling 🚀 1000’s of jobs Duration: 1 second – 45 minutes Scaling horizontally = splitting jobs Jobs split according to their complexity/duration Resulting in >1 million jobs @loige
  • 24. Moving to production 🚢 @loige
  • 26. Actual End to End overview @loige
  • 28. Compute Services Scales to 1000’s of tasks (containers) Little management overhead Up to 4 vCPUs and 30GB Memory Up to 200GB ephemeral storage Scales to 1000’s of function containers (in seconds!) Very little management overhead Up to 6 vCPUs and 10GB Memory Up to 10GB ephemeral storage It wasn’t always this way! @loige
  • 29. Store all the things in S3! The source of truth for: ● Input Data (JSON, Parquet) ● Intermediate Data (Parquet) ● Results (Parquet) ● Aggregates (Parquet) Input data: 20GB Output data: ~1 TB Reads and writes: 10,000s of objects per second. @loige
  • 30. Scheduling and Orchestration ✅ We have our cluster (Fargate or Lambda) ✅ We have a plan! (list of jobs, parameters and dependencies) 🤔 How do we feed this plan to the cluster?! 🤨 Existing schedulers use traditional clusters – there is no serverless job scheduler for workloads like this! @loige
  • 31. Lifecycle of a Job @loige A new job get queued here 👇 A worker picks up the job and executes it The worker emits the job state (success or failure)
  • 32. Event-Driven Scheduler @loige Job states are pulled from a Kinesis Data Stream Redis stores: - Job states - Dependencies This scheduler checks new job states against the state in Redis and figures out if there are new jobs that can be scheduled next
  • 33. Dynamic Runtime Handling @loige We also need to handle system failures!
  • 34. Outcomes 🙌 Business ● Rollup in 1 hour ● Removed limits on number of runs ● Faster, more consistent deal analytics ● Business spending more time on revenue-generating activities ● Support portfolio growth and deliver new capabilities @loige Technology ● Brought serverless to HPC financial modeling ● Reduced codebase by ~70% ● Lowered total cost of ownership ● Increased dev team agility ● Reduced carbon footprint
  • 35. Hitting the limits 😰 @loige
  • 37. S3 Partitioning S3 cleverly detects high-throughput prefixes and creates partitions ….normally If this does not happen… 🚨Please reduce your request rate; Status Code: 503; Error Code: SlowDown @loige
  • 38. The Solution Explicit Partitioning: ○Figure out how many partitions you need ○Update code to create keys uniformly distributed over all partitions /part/0… /part/1… /part/2… /part/3… … /part/f… @loige 1. Talk (a lot) to AWS SAs, Support, Account Manager for special requirements like this! 2. Think ahead if you have multiple accounts for different environments!
  • 39. Fargate Scaling ● We want to run 3000 containers ASAP ● This took > 1 hour! ● We built a custom Fargate scaler ○Using the RunTask API (no ECS Service) ○Hidden quota increases ○Step Function + Lambda ● 3000 containers in ~20 minutes @loige The AWS ECS team since made lots of improvements, making it possible to scale to 3,000 containers in under 5 minutes
  • 40. How high can we go today? 🚀 10,000 concurrent Lambda functions in seconds 🎢 10,000 Fargate containers in 10 minutes 💸 No additional cost vladionescu.me/posts/scaling-containers-on-aws-in-2022 @loige
  • 41. Wrapping up 🎁 ● "Serverless supercomputer" lets you do HPC with commodity AWS compute ● Plenty of challenges, but it's doable! ● Agility and innovation benefits are massive ● Customer is now serverless-first and expert in AWS Other interesting case studies: ☁ AWS HTC Grid - 🧬 COVID genome research @loige
  • 42. Special thanks to @eoins, @cmthorne10 and the awesome team at RenRe! @loige fth.link/cm22 Serverless for HPC? IT WORKS!