SlideShare uma empresa Scribd logo
1 de 58
Baixar para ler offline
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ARC348
Seagull
Osman Sarood
Software Engineer @ Yelp
A highly Fault-tolerant Distributed System for Concurrent Task Execution
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Seagull:
A Highly Fault-Tolerant Distributed System
for Concurrent Task Execution
Osman Sarood, Software Engineer, Yelp
ARC348
October 2015
Monthly Visitors Reviews Mobile Searches Countries
How Yelp:
• Runs millions of tests a day
• Downloads TBs of data in an extremely efficient manner
• Scales using our custom metric
What’s in it for me?
What to Expect
• High-level architecture talk
• No code
• Assumes basic understanding of Apache Mesos
A distributed system that allows concurrent task
execution:
• at large scale
• while maintaining high cluster utilization
• and is highly fault tolerant and resilient
What Is Seagull?
The Developer
Run millions of tests each day
Seagull @ Yelp
• How does Seagull work?
• Major problem: Artifact downloads
• Auto Scaling and fault tolerance
What We’ll Cover
1: How Does Seagull Work?
• Each day, Seagull runs tests that would take 700 days (serially)!
• On average, 350 seagull-runs a day
• Each seagull-run has ~ 70,000 tests
• Each seagull-run would take 2 days to run serially
• Volatile but predictable demand
• 30% of load in 3 hours
• Predictable peak time (3PM-6PM)
What’s the Challenge?
• Run 3000 tests concurrently on a 200 machine cluster
• 2 days in serial => 14 mins!
• Running at such high scale involves:
• Downloading 36 TB a day (5 TB/hr peak)
• Up to 200 simultaneous downloads for a single large file
• 1.5 million Docker containers a day (210K/hr peak)
What Do We Do?
S3
Docker
Jenkins
Mesos
EC2
Elasticsearch
DynamoDB
Reporting
Monitoring
Seagull Ingredients
EC2
Scheduler1 Scheduler2
……
Scheduler’y’…
Slave1 Slave2 Slave’n’
Elasticsearch
DynamoDB
S3
Prioritizer
UI
…
Yelp Developer
1
2
3
4
5
6(a)
6(b)
7
8
Seagull Overview
• Cluster of 7 r3.8xlarges
• Builds our largest service (artifact)
• Uploads artifacts to Amazon S3
• Discover tests
Build artifact
Discover tests
S3
Yelp Developer
Jenkins
• Largest service that forms major part of website
• Several hundreds of MB in size
• Takes ~10 mins to build
• Uses lots of memory
• Huge setup cost
• Build it once and download later
Yelp Artifact
• Takes the artifact and determines test names
• Parse Python code to extract test names
• Finishes in ~2 mins
• Separate test list for each of the 7 different suites
Test Discovery
EC2
Scheduler1 Scheduler2
……
Scheduler’y’…
Slave1 Slave2 Slave’n’
Elasticsearch
DynamoDB
S3
UI
…
Yelp Developer
1
2
3
4
5
6(a)
6(b)
7
8
Prioritizer
Recap
• Schedule longest tests first
• Historical test timing data from DynamoDB
• Fetch 25 million records/day
• Why DynamoDB:
• We don’t need to maintain it!
• Low cost (just $200/month!)
DynamoDB
Historical data
Test list
Prioritizer
Test Prioritization
EC2
Scheduler1 Scheduler2
……
Scheduler’y’…
Slave1 Slave2 Slave’n’
Elasticsearch
DynamoDB
S3
UI
…
Yelp Developer
1
2
3
4
5
6(a)
6(b)
7
8
Prioritizer
Recap
• Run ~350 seagull-runs/day:
• each run ~70000 tests (~ 25 million tests/day)
• total serial time of 48 hours/run
• Challenging to run lots of tests at scale during peak times
Runs submitted per 10 mins
Peak
The Testing Problem
• Resource management system
• mesos-master: On master node
• mesos-slave: On every slave
• Slaves register resources Mesos master
• Schedulers subscribe to Mesos master for
consuming resources
• Master offers resources to schedulers in a fair
manner
Mesos Master
Slave2 Slave2
Scheduler 1 Scheduler 2
Apache Mesos
Seagull leverages resource management abilities of Apache
Mesos
• Each run has a Mesos scheduler
• Each scheduler distributes work amongst ~600
workers (executors)
• 200 instances r3.8xlarge machines (32 cores/256GB)
Running Tests in Parallel
Test (color coded for different schedulers
Set of tests (bundle)
C1
Scheduler C1
Slave1(s1)
Slave S1
Terminology (Key)
C1
Yelp Devs
C2
Seagull Schedulers
Seagull Cluster
Test
Set of tests
(bundle)
Key
Mesos Master
Slave1(s1) Slave2 (s2)
S1 S2
User1 User2
S1
Parallel Test Execution
2: Key Challenges: Artifact
Downloads
• Each executor needs to have the artifact before running
tests
• 18,000 requests per hour at peak
• Each request is for a large file (hundreds of MBs)
• A single executor (out of 600) taking long to download could
delay the entire seagull-run.
Why Is Artifact Download Critical?
EC2
Scheduler1 Scheduler2
……
Scheduler’y’…
Slave1 Slave2 Slave’n’
Elasticsearch
DynamoDB
S3
UI
…
Yelp Developer
1
2
3
4
5
6(a)
6(b)
7
8
Prioritizer
Recap
Docker
Amazon S3
Elasticsearch Amazon
DynamoDB
Fetch artifact
Takes 10 mins on average
Start Service
Run Tests
Report Results
Seagull Executor
Tes
t
Set of tests (bundle)
C1
Scheduler C1
Slave1(s1)
Slave S1
Artifact for Scheduler C1
A1
Exec C1
A1
Executor of scheduler C1
Terminology (Key)
• Scheduler C1 starts and distributes works amongst 600 executors
• Each executor (a.k.a task):
• own artifact (independent)
• Runs for ~ 10 mins on average
• Each slave runs 15 executors (C1 uses a total of 40 slaves)
• 200 * 15 * 6 = 18000 reqs/hr! (13.5 TB/hour)
S
3
Seagull Cluster
Slave 40
Exec C1
A1
….
Exec C1
A1
Slave 1
Exec C1
A1
….
Exec C1
A1
Artifact Handling
• Lots of requests took as long as 30 mins!
• We choked NAT boxes with tons of request
• Avoiding NAT required bigger effort
• Wanted a quick solutions
Slow Download Times
• Executors from same scheduler can share artifacts
• Disadvantages:
• Executors are no longer independent
• Locking implementation for downloading artifacts
S3
Still doesn’t scale well
Seagull Cluster
Slave 40
Exec C2
A2
Exec C1
A1
Slave 1
Exec C1
A1
Exec C1
A1
Exec C2
A2
Exec C2
A1
A2
A2A2 A1
Sharing Artifacts
• Artifactcache consisting of 9 r3.8xlarges
• Replicate each artifact across each of the 9 artifact caches
• Nginx distributes requests
• 10 Gbps network bandwidth helped
Artifactcache
Seagull Cluster
Slave 40
Exec C2Exec C1
Slave 1
Exec C1Exec C1 Exec C2 Exec C2
A1 A2A2 A1
Separate Artifactcache
Number of active schedulers per 10m
Download time (secs) per 10m
Artifact Download Metrics
• Why not use so much network bandwidth from our Amazon EC2
compute?
• The entire cluster serves as the artifactcache
• Cache scales as the cluster scales
• Bandwidth comparison:
• Centralized cache ~ 30 Mbps/executor
9 (# caches) * 10 (Gbps) / 3000 (# of executors)]
• Distributed cache ~ 666 Mbps/executor
200 (#caches) * 10 (Gbps) / 3000 (# of executors)
Distributed Artifactcache
Random Selector
Seagull Cluster
Slave 1
Artifact Pool
Slave 2
Artifact Pool
Slave 3
Artifact Pool
A1
Slave 4
Benefits of distributed artifact caching:
• Very scalable
• No extra machines to maintain
• Significant reduction in out-of-space disk issues
• Fast downloads due to less contention
A1
Artifact Pool
Distributed Artifact Caching
Artifact Download Time (secs) per 10 min
Number of Downloads per 10 mins
Can we improve download
times further?
Distributed Artifactcache Performance
• At peak times:
• Lots of downloads happens
• Most artifacts end up being downloaded on 90% of
slaves
• Once a machine downloads an artifact it should serve
other requests for that artifact
• Disadvantage: Bookkeeping
Stealing Artifact
1. Slave 4 gets A2
Seagull Cluster
Slave 1
Artifact Pool
A2
2. Bundle starts on Slave 2
3. Slave 2 pulls A2 from Slave 4
5a. Slave 3 gets A2 from Slave 3
5b. Slave 1 steals A2 from Slave 2
Exec C2
Slave 2
Artifact Pool
A2
Exec C2
Slave 3
Artifact Pool
A2
Exec C2
Slave 4
Artifact Pool
A2
Steal
4. Bundles start on Slave 1 & 3
Random Selector
Stealing Artifact
ARTIFACT STEAL TIME (per 10m)
NUM OF STEAL (per 10m)
Performance: Stealing in Distributed Artifact
Caching
Artifact Load-Balancing Viz
3: Auto Scaling and Fault Tolerance
• Used Auto Scaling group provided by AWS but it wasn’t easy to ‘select’ which
instances to terminate
• Mesos uses FIFO to assign work whereas Auto Scaling also uses FIFO to
terminate
• Example: 10% Slave working -> remove 10% -> terminate slaves doing work
Runs submitted (per 10 mins)
Auto Scaling
• CPU and memory demand is volatile
• Seagull tells Mesos to reserve the max amount of memory a
task requires ( )
• Total memory required to run a set of ( ) tasks concurrently:
Reserved Memory
• Total available memory for slave ‘i’:
• Let denote the set of all slaves in our cluster
• Total available memory available:
• Gull-load: Ratio of total reserved memory to total memory available
Gull-load
GullLoad
Running lots of executors
Gull-load
Calculate Gull-load for
each machine
Sort on
Gull-load
Select slaves with
least Gull-load (10%)
Terminate
Slaves
Add 10% Extra
Machines
Invoke Auto
Scaling (Every 10
mins)
Yes
NoYes
No
Gull-load
> 0.5
Gull-load < 0.9
Gull-load (GL) Action (# slaves)
0.5 < GL < 0.9 Nothing
GL > 0.9 Add 10%
GL < 0.5 Remove 10%
How Do We Scale Automatically?
• Started with all Reserved instances. Too expensive!
• Shifted to all Spot. Always knew it was risky..
• One fine day, all slaves were gone!
• A mix of On-Demand (25%) and Spot (75%) instances
Reserved, On-Demand, or Spot Instances?
Seagull provides fault tolerance at two levels
• Hardware level: Spreading our machines
geographically (preventive)
• Infrastructure level: Seagull retries upon failure
(corrective)
Fault Tolerance and Reliability
• Equally dividing machines amongst AZs
• us-west-2: a => 60, b => 66, c => 66
• Easy to terminate a slave and recreate it quickly
• In the event of losing Spot instances:
• Our seagull-runs keep running using the On-Demand
instances
• Add on-demand instances until Spot Instances are
available again (manual)
Preventive Fault Tolerance (Reliability)
• Lots of reasons for executors to fail:
• Bad service
• Docker problems (>100 concurrent containers/machine)
• External partners (e.g., Sauce Labs)
• How do we do it:
• Task Manager (inside scheduler) tracks life cycle of each
executor/task
• Fixed number of retries upon failure/timeout
Corrective Fault Tolerance
Tes
t
Set of tests (bundle)
C1
Scheduler C1
Slave1(s1)
Slave S1
Artifact for Scheduler 1
A1
Exec C1
A1
Executor of scheduler C1
Tracks life-cycle for each task
i.e. queued, running, finished
Terminology (Key)
Task
Manager
Yelp Devs C1
Seagull Schedulers
Seagull Cluster
Test
Set of
tests
(bundle)
Key
Mesos Master
S1 (uswest2a) S2 (uswest2b)
S1
User1
S2
Task
Manager
S1 Crashed
Rerun
Bundles?
Corrective Fault Tolerance
• How Seagull works and interacts with other systems
• An extremely efficient artifact hosting design
• Custom scaling policy and its use of gull-load
• Fault tolerance at scale using:
• AWS
• Executor retry logic
What Did We Learn?
• Sanitize code for open source
• Explore why Amazon S3 downloads are so slow
• Avoiding NAT box
• Using multiple buckets
• Breaking our artifact to smaller files
• Improve scaling:
• Ability to use other instance types
• Reduce cost by choosing Spot instance types with minimum GB/$
Future Work
Remember to complete
your evaluations!
Thank you!

Mais conteúdo relacionado

Mais procurados

How to improve ELK log pipeline performance
How to improve ELK log pipeline performanceHow to improve ELK log pipeline performance
How to improve ELK log pipeline performanceSteven Shim
 
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. SaxIntroducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. SaxDatabricks
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...confluent
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoringRohit Jnagal
 
Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...
Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...
Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...Quinton Hoole
 
Introduction to Akka-Streams
Introduction to Akka-StreamsIntroduction to Akka-Streams
Introduction to Akka-Streamsdmantula
 
Micro services infrastructure with AWS and Ansible
Micro services infrastructure with AWS and AnsibleMicro services infrastructure with AWS and Ansible
Micro services infrastructure with AWS and AnsibleBamdad Dashtban
 
Introduction to akka actors with java 8
Introduction to akka actors with java 8Introduction to akka actors with java 8
Introduction to akka actors with java 8Johan Andrén
 
Kubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard wayKubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard wayLaurent Bernaille
 
Running & Monitoring Docker at Scale
Running & Monitoring Docker at ScaleRunning & Monitoring Docker at Scale
Running & Monitoring Docker at ScaleDatadog
 
Heat optimization
Heat optimizationHeat optimization
Heat optimizationRico Lin
 
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...Amazon Web Services
 
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...confluent
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...confluent
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksRuslan Meshenberg
 
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...DataStax
 
Integrating Puppet with Cloud Infrastructures-Remco Overdijk
Integrating Puppet with Cloud Infrastructures-Remco OverdijkIntegrating Puppet with Cloud Infrastructures-Remco Overdijk
Integrating Puppet with Cloud Infrastructures-Remco OverdijkMaxServ
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleSudhir Tonse
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloudconfluent
 

Mais procurados (20)

How to improve ELK log pipeline performance
How to improve ELK log pipeline performanceHow to improve ELK log pipeline performance
How to improve ELK log pipeline performance
 
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. SaxIntroducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
 
Way to cloud
Way to cloudWay to cloud
Way to cloud
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...
Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...
Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...
 
Introduction to Akka-Streams
Introduction to Akka-StreamsIntroduction to Akka-Streams
Introduction to Akka-Streams
 
Micro services infrastructure with AWS and Ansible
Micro services infrastructure with AWS and AnsibleMicro services infrastructure with AWS and Ansible
Micro services infrastructure with AWS and Ansible
 
Introduction to akka actors with java 8
Introduction to akka actors with java 8Introduction to akka actors with java 8
Introduction to akka actors with java 8
 
Kubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard wayKubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard way
 
Running & Monitoring Docker at Scale
Running & Monitoring Docker at ScaleRunning & Monitoring Docker at Scale
Running & Monitoring Docker at Scale
 
Heat optimization
Heat optimizationHeat optimization
Heat optimization
 
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
 
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talks
 
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
 
Integrating Puppet with Cloud Infrastructures-Remco Overdijk
Integrating Puppet with Cloud Infrastructures-Remco OverdijkIntegrating Puppet with Cloud Infrastructures-Remco Overdijk
Integrating Puppet with Cloud Infrastructures-Remco Overdijk
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
 

Semelhante a (ARC348) Seagull: How Yelp Built A System For Task Execution

Rails performance at Justin.tv - Guillaume Luccisano
Rails performance at Justin.tv - Guillaume LuccisanoRails performance at Justin.tv - Guillaume Luccisano
Rails performance at Justin.tv - Guillaume LuccisanoGuillaume Luccisano
 
Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Play...
Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Play...Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Play...
Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Play...DataStax Academy
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly SolarWinds Loggly
 
London devops logging
London devops loggingLondon devops logging
London devops loggingTomas Doran
 
Jolt: Distributed, fault-tolerant test running at scale using Mesos
Jolt: Distributed, fault-tolerant test running at scale using MesosJolt: Distributed, fault-tolerant test running at scale using Mesos
Jolt: Distributed, fault-tolerant test running at scale using MesosMesosphere Inc.
 
Anton Boyko "The future of serverless computing"
Anton Boyko "The future of serverless computing"Anton Boyko "The future of serverless computing"
Anton Boyko "The future of serverless computing"Fwdays
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)Tibo Beijen
 
RedisConf17 - Redis in High Traffic Adtech Stack
RedisConf17 - Redis in High Traffic Adtech StackRedisConf17 - Redis in High Traffic Adtech Stack
RedisConf17 - Redis in High Traffic Adtech StackRedis Labs
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disquszeeg
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Lucidworks
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality
 
Anton Boyko, "The evolution of microservices platform or marketing gibberish"
Anton Boyko, "The evolution of microservices platform or marketing gibberish"Anton Boyko, "The evolution of microservices platform or marketing gibberish"
Anton Boyko, "The evolution of microservices platform or marketing gibberish"Sigma Software
 
The challenges of live events scalability
The challenges of live events scalabilityThe challenges of live events scalability
The challenges of live events scalabilityGuy Tomer
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsC4Media
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedTim Callaghan
 
Azure Functions - the evolution of microservices platform or marketing gibber...
Azure Functions - the evolution of microservices platform or marketing gibber...Azure Functions - the evolution of microservices platform or marketing gibber...
Azure Functions - the evolution of microservices platform or marketing gibber...Katherine Golovinova
 

Semelhante a (ARC348) Seagull: How Yelp Built A System For Task Execution (20)

Rails performance at Justin.tv - Guillaume Luccisano
Rails performance at Justin.tv - Guillaume LuccisanoRails performance at Justin.tv - Guillaume Luccisano
Rails performance at Justin.tv - Guillaume Luccisano
 
Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Play...
Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Play...Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Play...
Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Play...
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
Jolt: Distributed, fault-tolerant test running at scale using Mesos
Jolt: Distributed, fault-tolerant test running at scale using MesosJolt: Distributed, fault-tolerant test running at scale using Mesos
Jolt: Distributed, fault-tolerant test running at scale using Mesos
 
Anton Boyko "The future of serverless computing"
Anton Boyko "The future of serverless computing"Anton Boyko "The future of serverless computing"
Anton Boyko "The future of serverless computing"
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
RedisConf17 - Redis in High Traffic Adtech Stack
RedisConf17 - Redis in High Traffic Adtech StackRedisConf17 - Redis in High Traffic Adtech Stack
RedisConf17 - Redis in High Traffic Adtech Stack
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup Presentation
 
Anton Boyko, "The evolution of microservices platform or marketing gibberish"
Anton Boyko, "The evolution of microservices platform or marketing gibberish"Anton Boyko, "The evolution of microservices platform or marketing gibberish"
Anton Boyko, "The evolution of microservices platform or marketing gibberish"
 
The challenges of live events scalability
The challenges of live events scalabilityThe challenges of live events scalability
The challenges of live events scalability
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
mtl_rubykaigi
mtl_rubykaigimtl_rubykaigi
mtl_rubykaigi
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons Learned
 
Azure Functions - the evolution of microservices platform or marketing gibber...
Azure Functions - the evolution of microservices platform or marketing gibber...Azure Functions - the evolution of microservices platform or marketing gibber...
Azure Functions - the evolution of microservices platform or marketing gibber...
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Último (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

(ARC348) Seagull: How Yelp Built A System For Task Execution

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ARC348 Seagull Osman Sarood Software Engineer @ Yelp A highly Fault-tolerant Distributed System for Concurrent Task Execution
  • 2. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Seagull: A Highly Fault-Tolerant Distributed System for Concurrent Task Execution Osman Sarood, Software Engineer, Yelp ARC348 October 2015
  • 3.
  • 4. Monthly Visitors Reviews Mobile Searches Countries
  • 5. How Yelp: • Runs millions of tests a day • Downloads TBs of data in an extremely efficient manner • Scales using our custom metric What’s in it for me?
  • 6. What to Expect • High-level architecture talk • No code • Assumes basic understanding of Apache Mesos
  • 7. A distributed system that allows concurrent task execution: • at large scale • while maintaining high cluster utilization • and is highly fault tolerant and resilient What Is Seagull?
  • 8. The Developer Run millions of tests each day Seagull @ Yelp
  • 9. • How does Seagull work? • Major problem: Artifact downloads • Auto Scaling and fault tolerance What We’ll Cover
  • 10. 1: How Does Seagull Work?
  • 11. • Each day, Seagull runs tests that would take 700 days (serially)! • On average, 350 seagull-runs a day • Each seagull-run has ~ 70,000 tests • Each seagull-run would take 2 days to run serially • Volatile but predictable demand • 30% of load in 3 hours • Predictable peak time (3PM-6PM) What’s the Challenge?
  • 12. • Run 3000 tests concurrently on a 200 machine cluster • 2 days in serial => 14 mins! • Running at such high scale involves: • Downloading 36 TB a day (5 TB/hr peak) • Up to 200 simultaneous downloads for a single large file • 1.5 million Docker containers a day (210K/hr peak) What Do We Do?
  • 14. EC2 Scheduler1 Scheduler2 …… Scheduler’y’… Slave1 Slave2 Slave’n’ Elasticsearch DynamoDB S3 Prioritizer UI … Yelp Developer 1 2 3 4 5 6(a) 6(b) 7 8 Seagull Overview
  • 15. • Cluster of 7 r3.8xlarges • Builds our largest service (artifact) • Uploads artifacts to Amazon S3 • Discover tests Build artifact Discover tests S3 Yelp Developer Jenkins
  • 16. • Largest service that forms major part of website • Several hundreds of MB in size • Takes ~10 mins to build • Uses lots of memory • Huge setup cost • Build it once and download later Yelp Artifact
  • 17. • Takes the artifact and determines test names • Parse Python code to extract test names • Finishes in ~2 mins • Separate test list for each of the 7 different suites Test Discovery
  • 18. EC2 Scheduler1 Scheduler2 …… Scheduler’y’… Slave1 Slave2 Slave’n’ Elasticsearch DynamoDB S3 UI … Yelp Developer 1 2 3 4 5 6(a) 6(b) 7 8 Prioritizer Recap
  • 19. • Schedule longest tests first • Historical test timing data from DynamoDB • Fetch 25 million records/day • Why DynamoDB: • We don’t need to maintain it! • Low cost (just $200/month!) DynamoDB Historical data Test list Prioritizer Test Prioritization
  • 20. EC2 Scheduler1 Scheduler2 …… Scheduler’y’… Slave1 Slave2 Slave’n’ Elasticsearch DynamoDB S3 UI … Yelp Developer 1 2 3 4 5 6(a) 6(b) 7 8 Prioritizer Recap
  • 21. • Run ~350 seagull-runs/day: • each run ~70000 tests (~ 25 million tests/day) • total serial time of 48 hours/run • Challenging to run lots of tests at scale during peak times Runs submitted per 10 mins Peak The Testing Problem
  • 22. • Resource management system • mesos-master: On master node • mesos-slave: On every slave • Slaves register resources Mesos master • Schedulers subscribe to Mesos master for consuming resources • Master offers resources to schedulers in a fair manner Mesos Master Slave2 Slave2 Scheduler 1 Scheduler 2 Apache Mesos
  • 23. Seagull leverages resource management abilities of Apache Mesos • Each run has a Mesos scheduler • Each scheduler distributes work amongst ~600 workers (executors) • 200 instances r3.8xlarge machines (32 cores/256GB) Running Tests in Parallel
  • 24. Test (color coded for different schedulers Set of tests (bundle) C1 Scheduler C1 Slave1(s1) Slave S1 Terminology (Key)
  • 25. C1 Yelp Devs C2 Seagull Schedulers Seagull Cluster Test Set of tests (bundle) Key Mesos Master Slave1(s1) Slave2 (s2) S1 S2 User1 User2 S1 Parallel Test Execution
  • 26. 2: Key Challenges: Artifact Downloads
  • 27. • Each executor needs to have the artifact before running tests • 18,000 requests per hour at peak • Each request is for a large file (hundreds of MBs) • A single executor (out of 600) taking long to download could delay the entire seagull-run. Why Is Artifact Download Critical?
  • 28. EC2 Scheduler1 Scheduler2 …… Scheduler’y’… Slave1 Slave2 Slave’n’ Elasticsearch DynamoDB S3 UI … Yelp Developer 1 2 3 4 5 6(a) 6(b) 7 8 Prioritizer Recap
  • 29. Docker Amazon S3 Elasticsearch Amazon DynamoDB Fetch artifact Takes 10 mins on average Start Service Run Tests Report Results Seagull Executor
  • 30. Tes t Set of tests (bundle) C1 Scheduler C1 Slave1(s1) Slave S1 Artifact for Scheduler C1 A1 Exec C1 A1 Executor of scheduler C1 Terminology (Key)
  • 31. • Scheduler C1 starts and distributes works amongst 600 executors • Each executor (a.k.a task): • own artifact (independent) • Runs for ~ 10 mins on average • Each slave runs 15 executors (C1 uses a total of 40 slaves) • 200 * 15 * 6 = 18000 reqs/hr! (13.5 TB/hour) S 3 Seagull Cluster Slave 40 Exec C1 A1 …. Exec C1 A1 Slave 1 Exec C1 A1 …. Exec C1 A1 Artifact Handling
  • 32. • Lots of requests took as long as 30 mins! • We choked NAT boxes with tons of request • Avoiding NAT required bigger effort • Wanted a quick solutions Slow Download Times
  • 33. • Executors from same scheduler can share artifacts • Disadvantages: • Executors are no longer independent • Locking implementation for downloading artifacts S3 Still doesn’t scale well Seagull Cluster Slave 40 Exec C2 A2 Exec C1 A1 Slave 1 Exec C1 A1 Exec C1 A1 Exec C2 A2 Exec C2 A1 A2 A2A2 A1 Sharing Artifacts
  • 34. • Artifactcache consisting of 9 r3.8xlarges • Replicate each artifact across each of the 9 artifact caches • Nginx distributes requests • 10 Gbps network bandwidth helped Artifactcache Seagull Cluster Slave 40 Exec C2Exec C1 Slave 1 Exec C1Exec C1 Exec C2 Exec C2 A1 A2A2 A1 Separate Artifactcache
  • 35. Number of active schedulers per 10m Download time (secs) per 10m Artifact Download Metrics
  • 36. • Why not use so much network bandwidth from our Amazon EC2 compute? • The entire cluster serves as the artifactcache • Cache scales as the cluster scales • Bandwidth comparison: • Centralized cache ~ 30 Mbps/executor 9 (# caches) * 10 (Gbps) / 3000 (# of executors)] • Distributed cache ~ 666 Mbps/executor 200 (#caches) * 10 (Gbps) / 3000 (# of executors) Distributed Artifactcache
  • 37. Random Selector Seagull Cluster Slave 1 Artifact Pool Slave 2 Artifact Pool Slave 3 Artifact Pool A1 Slave 4 Benefits of distributed artifact caching: • Very scalable • No extra machines to maintain • Significant reduction in out-of-space disk issues • Fast downloads due to less contention A1 Artifact Pool Distributed Artifact Caching
  • 38. Artifact Download Time (secs) per 10 min Number of Downloads per 10 mins Can we improve download times further? Distributed Artifactcache Performance
  • 39. • At peak times: • Lots of downloads happens • Most artifacts end up being downloaded on 90% of slaves • Once a machine downloads an artifact it should serve other requests for that artifact • Disadvantage: Bookkeeping Stealing Artifact
  • 40. 1. Slave 4 gets A2 Seagull Cluster Slave 1 Artifact Pool A2 2. Bundle starts on Slave 2 3. Slave 2 pulls A2 from Slave 4 5a. Slave 3 gets A2 from Slave 3 5b. Slave 1 steals A2 from Slave 2 Exec C2 Slave 2 Artifact Pool A2 Exec C2 Slave 3 Artifact Pool A2 Exec C2 Slave 4 Artifact Pool A2 Steal 4. Bundles start on Slave 1 & 3 Random Selector Stealing Artifact
  • 41. ARTIFACT STEAL TIME (per 10m) NUM OF STEAL (per 10m) Performance: Stealing in Distributed Artifact Caching
  • 43. 3: Auto Scaling and Fault Tolerance
  • 44. • Used Auto Scaling group provided by AWS but it wasn’t easy to ‘select’ which instances to terminate • Mesos uses FIFO to assign work whereas Auto Scaling also uses FIFO to terminate • Example: 10% Slave working -> remove 10% -> terminate slaves doing work Runs submitted (per 10 mins) Auto Scaling
  • 45. • CPU and memory demand is volatile • Seagull tells Mesos to reserve the max amount of memory a task requires ( ) • Total memory required to run a set of ( ) tasks concurrently: Reserved Memory
  • 46. • Total available memory for slave ‘i’: • Let denote the set of all slaves in our cluster • Total available memory available: • Gull-load: Ratio of total reserved memory to total memory available Gull-load
  • 47. GullLoad Running lots of executors Gull-load
  • 48. Calculate Gull-load for each machine Sort on Gull-load Select slaves with least Gull-load (10%) Terminate Slaves Add 10% Extra Machines Invoke Auto Scaling (Every 10 mins) Yes NoYes No Gull-load > 0.5 Gull-load < 0.9 Gull-load (GL) Action (# slaves) 0.5 < GL < 0.9 Nothing GL > 0.9 Add 10% GL < 0.5 Remove 10% How Do We Scale Automatically?
  • 49. • Started with all Reserved instances. Too expensive! • Shifted to all Spot. Always knew it was risky.. • One fine day, all slaves were gone! • A mix of On-Demand (25%) and Spot (75%) instances Reserved, On-Demand, or Spot Instances?
  • 50. Seagull provides fault tolerance at two levels • Hardware level: Spreading our machines geographically (preventive) • Infrastructure level: Seagull retries upon failure (corrective) Fault Tolerance and Reliability
  • 51. • Equally dividing machines amongst AZs • us-west-2: a => 60, b => 66, c => 66 • Easy to terminate a slave and recreate it quickly • In the event of losing Spot instances: • Our seagull-runs keep running using the On-Demand instances • Add on-demand instances until Spot Instances are available again (manual) Preventive Fault Tolerance (Reliability)
  • 52. • Lots of reasons for executors to fail: • Bad service • Docker problems (>100 concurrent containers/machine) • External partners (e.g., Sauce Labs) • How do we do it: • Task Manager (inside scheduler) tracks life cycle of each executor/task • Fixed number of retries upon failure/timeout Corrective Fault Tolerance
  • 53. Tes t Set of tests (bundle) C1 Scheduler C1 Slave1(s1) Slave S1 Artifact for Scheduler 1 A1 Exec C1 A1 Executor of scheduler C1 Tracks life-cycle for each task i.e. queued, running, finished Terminology (Key) Task Manager
  • 54. Yelp Devs C1 Seagull Schedulers Seagull Cluster Test Set of tests (bundle) Key Mesos Master S1 (uswest2a) S2 (uswest2b) S1 User1 S2 Task Manager S1 Crashed Rerun Bundles? Corrective Fault Tolerance
  • 55. • How Seagull works and interacts with other systems • An extremely efficient artifact hosting design • Custom scaling policy and its use of gull-load • Fault tolerance at scale using: • AWS • Executor retry logic What Did We Learn?
  • 56. • Sanitize code for open source • Explore why Amazon S3 downloads are so slow • Avoiding NAT box • Using multiple buckets • Breaking our artifact to smaller files • Improve scaling: • Ability to use other instance types • Reduce cost by choosing Spot instance types with minimum GB/$ Future Work