SlideShare uma empresa Scribd logo
1 de 42
Leveraging In-Memory Key
Value Stores for Large Scale
Operations with Redis and
CFEngine
Mike Svoboda
Staff Systems and Automation Engineer
www.linkedin.com/in/mikesvoboda
msvoboda@linkedin.com
https://github.com/linkedin/sysops-api
My Background with
LinkedIn / CFEngine
 Hired at LinkedIn into System Operations in 2010
 When I started, our server count was 300 machines
 Implemented CFEngine automation in 2010
 Since then, we have grown 100 times that size
 Created our Redis API in 2012 to provide visibility
What is Redis?
 Redis is an in-memory key value store, similar to
Memcached with additional features
 Offers on disk persistence (snapshots to disk) - You
can use this as a real database instead of just a
volatile cache
 Offers simple data structures out of the box and
commands to work with them natively
 dictionaries, lists, sets, sorted sets, etc.
 Highly scalable data store - A single Redis server
can satisfy hundreds of thousands of requests per
second
 Supports transactions - Group commands together
so they are executed as a single transaction.
What is CFEngine?
CFEngine:
 Is an IT infrastructure automation framework that helps
manage infrastructure throughout its lifecycle
 Builds, deploys, and manages systems
 Provides auditing
 Maintains infrastructure by enforcing intended system state
for compliance
 Runs on the smallest embedded devices, servers, desktops,
mainframes, and big iron. CFEngine easily supports tens of
thousands of hosts. Provides horizontal scalability.
How CFEngine works
CFEngine reduces
operational costs
 Using CFEngine automation is
more effective than hiring
additional headcount
 Stop fighting fires every day
 Allow operations to focus on
tomorrow’s problems
 Stay ahead of the curve
 Keeping the lights on is
automated
 Respond to outages rapidly
Why LinkedIn chose CFEngine
 Very mature codebase

 Not dependent on underlying virtual machines like
Ruby, Python, Perl, etc.
 Flexible architecture
 Easily scale upwards to support thousands of
machines
 Just as simple to support smaller environments
 Zero reported security vulnerabilities

 Lightweight footprint
What CFEngine has done for
LinkedIn
Since implementing CFEngine:
 Operations has become extremely agile
 Quickly respond and resolve outages
 System administration workload has reduced, even with
100x the amount of servers
 Have built new datacenter in minutes with little effort
 Real time visibility after creating our Redis infrastructure,
driven by CFEngine execution
 Can answer any question imaginable about all of our servers in
seconds
 Know every action that happens on our machines
How LinkedIn uses CFEngine
Functions we have automated:









Hardware failure detection
Account administration
Privilege escalation
Software deployment
O/S configuration management
Process / service management
Software deployment
System monitoring

You never need to log into a machine to manage it
Two problems still existed for Linkedin that
automation didn’t address
 The company wanted to be able to answer any question
imaginable about production.
 We didn’t want to break production by pushing new
automation changes.

To solve both problems, we needed visibility.
Problem #1: The company wants
questions answered. STAT!
 Management / Engineers want to have questions answered
immediately and ask several times a day interrupting your
work.
LinkedIn was hunting for data
What LinkedIn sysadmins were doing
• Questions about Infrastructure were answered by sysadmins
SSHing to machines to hunt for data.
• As our scale increased, we used a remote execution tool to
parallelize some variant of SSH / DSH

 Thousands of network connections
were made to remote machines
from a single host to fetch data.
 Did I get results from everything?
 Parse results after collection
Forcing command execution on
remote machines doesn’t scale
 Machines were missed, data wasn’t collected

 Firewalls mangled packets
 SSHD offline or didn’t spawn on the remote host
 Depended on system accounts being valid

 Network connections failed to the remote machine
 Data collection shouldn’t be complicated
 Unsure if we were able to collect all of the necessary
data.
Problem #2: We didn’t want to break production
by pushing new automation changes.
 Ops was hesitant of using automation because they
didn’t know where things would break
 When automation was expanded, we didn’t know where
systems need alternative behavior to work correctly (or
where they have been modified by developers with root
access)
 Ops had to be agile. We have to work fast. The
business needs us to modify production multiple times a
day, but we had to make changes without breaking it
Automation changes were
happening in the blind
 Sysadmins were under pressure from
 large ticket queues
 numerous change requests
 business needs to scale

 Automation changes were being performed without fully
understanding the impact before that change was
executed
 We realized that this could lead to mistakes, disasters,
outages, and pink slips. To keep this from happening, I
built our Redis API to provide visibility.
To provide visibility, we had to
scale data collection
 We had to build a reliable system that was extremely fast,
which could give us results of remote command execution
from tens of thousands of systems in seconds
 Querying this data could not put load on production
systems
 The cache needed to be publically available to the
company via an API so they could answer their own
questions
 We needed to quickly add new data into the cache before
pushing automation changes to view production impact.
We built a cache and populated it with
data to answer arbitrary questions
 Instead of executing commands remotely, we have CFEngine
populate the cache with commonly queried data
 CFEngine executes expensive commands like lshw or
dmidecode once and make the output available for everybody
to use

 Data collection becomes a scheduled event that happens once
a day - This data collection becomes a cost of doing business
 With the same data being gathered on all machines, it
becomes trivial to compare two or more pieces of hardware
Architecture of the Cache

Step 1: Rely on CFEngine
execution to drive data
insertion
Step 2: Shard your data

Step 3: Use software load
balancing!
Step 1: CFEngine drives data insertion
Leverage automation to change what you insert
or remove from the cache
The cache is a simple dictionary,
sharded over multiple Redis servers.
Step 2: Extract Sharded Data
 Determine scope. How much data do I need to answer
my question?
 For each CFEngine policy server running Redis, search
Redis for matching keys in the dictionary
 For each key we find from a search, perform the
relevant data extraction





Contents
Md5sum
os.stat()
wordcount
Step 3: Use Software
Load Balancing!
 Have clients populate multiple Redis servers on
insertion - Pick a Redis server at random on
extraction (Load balancing)
 If we don’t get a response from our first choice,
pick another Redis server at random (failover)

 Find randomized CFEngine policy servers with Redis
from each level in the scope
 If the CFEngine policy server responds, push it
into a list of machines we need to query for data
 If the CFEngine policy server doesn’t respond,
pick another one at random (fail over)
Local Scope
Example: Local cache extraction
$ time extract_sysops_cache.py 

--search /etc/passwd 
--contents | grep msvoboda | wc -l
487
real

0m1.813s

user 0m1.484s

sys

0m0.087s
Site (datacenter) Scope
Example: Site cache extraction
$ time extract_sysops_cache.py 

--site lva1 
--search /etc/passwd 
--contents | grep msvoboda | wc -l

8687
real

0m19.169s

user 0m30.286s
sys

0m1.271s
Global Scope
Example: Global cache
extraction
$ time extract_sysops_cache.py 

--scope global 
--search /etc/passwd 
--contents | grep msvoboda | wc -l

27344

real

0m44.827s

user

1m39.532s

sys

0m4.288s
Make it fast!
Become Multithreaded
Make it faster!
Build a Redis pipeline
Cache extraction with a pipeline
Extracting the Cache for Fun
and Profit
[msvoboda@esv4-infra01 ~]$ extract_sysops_cache.py 
--scope local 
--search mps*cm.conf 
--md5sum 
--prefix-hostnames
esv4-2360-mps01.corp.linkedin.com#/etc/cm.conf
esv4-2360-mps02.corp.linkedin.com#/etc/cm.conf
esv4-2360-mps03.corp.linkedin.com#/etc/cm.conf
esv4-2360-mps04.corp.linkedin.com#/etc/cm.conf

12721673715de3ee6b9dec487529355e
56b03a16c69e5b246a565dbcda44ba28
11e20e28ec60ac6c71cbb71b0a6c9b35
55402eda02e7f5c17dc7535455adc097
Make it fastest!
Compression is significant!
 Less network overhead on cache insertion
 Less network overhead on cache extraction
 More stuff we can put into the Cache
 With less network I/O = faster results delivered
 Less CPU usage on extraction
Seconds for cache insertion
CPU cycles for cache insertion
Data size in megabytes of the cache
for an entire datacenter
Time for cross country complete
datacenter cache extraction
Drink from the firehose
With Redis API, you can now be confident in
pushing automation changes
 You know what systems will be affected before a change

 You aren’t hit with surprises in production
 You have added visibility
 You don’t have to log into machines to modify or update
Summary
Before implementation
of CFEngine & Redis API
at LinkedIn

After implementation of
CFEngine & Redis API
at LinkedIn

Headcount

6 people supporting a
few hundred machines

6 people supporting tens of
thousands of machines

Time spent

Hours to build a single
machine

Build complete datacenters
in minutes

Productivity

Hours spent collecting
data before change,
change itself causing
outages

Can focus on building
infrastructure, team
became proactive to fix
future problems, not
reactive / firefighting

Ease of scaling
server deployment

Incredibly difficult
to respond to change,
low visibility into
production

Superior administration,
rapid response to changing
needs, complete system
visibility
Open Source
Questions?
msvoboda@linkedin.com
www.linkedin.com/in/mikesvoboda
You can download the code from this
presentation here:
https://github.com/linkedin/sysops-api

Mais conteúdo relacionado

Último

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Último (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Destaque

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Destaque (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine

  • 1. Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer www.linkedin.com/in/mikesvoboda msvoboda@linkedin.com https://github.com/linkedin/sysops-api
  • 2. My Background with LinkedIn / CFEngine  Hired at LinkedIn into System Operations in 2010  When I started, our server count was 300 machines  Implemented CFEngine automation in 2010  Since then, we have grown 100 times that size  Created our Redis API in 2012 to provide visibility
  • 3. What is Redis?  Redis is an in-memory key value store, similar to Memcached with additional features  Offers on disk persistence (snapshots to disk) - You can use this as a real database instead of just a volatile cache  Offers simple data structures out of the box and commands to work with them natively  dictionaries, lists, sets, sorted sets, etc.  Highly scalable data store - A single Redis server can satisfy hundreds of thousands of requests per second  Supports transactions - Group commands together so they are executed as a single transaction.
  • 4. What is CFEngine? CFEngine:  Is an IT infrastructure automation framework that helps manage infrastructure throughout its lifecycle  Builds, deploys, and manages systems  Provides auditing  Maintains infrastructure by enforcing intended system state for compliance  Runs on the smallest embedded devices, servers, desktops, mainframes, and big iron. CFEngine easily supports tens of thousands of hosts. Provides horizontal scalability.
  • 6. CFEngine reduces operational costs  Using CFEngine automation is more effective than hiring additional headcount  Stop fighting fires every day  Allow operations to focus on tomorrow’s problems  Stay ahead of the curve  Keeping the lights on is automated  Respond to outages rapidly
  • 7. Why LinkedIn chose CFEngine  Very mature codebase  Not dependent on underlying virtual machines like Ruby, Python, Perl, etc.  Flexible architecture  Easily scale upwards to support thousands of machines  Just as simple to support smaller environments  Zero reported security vulnerabilities  Lightweight footprint
  • 8. What CFEngine has done for LinkedIn Since implementing CFEngine:  Operations has become extremely agile  Quickly respond and resolve outages  System administration workload has reduced, even with 100x the amount of servers  Have built new datacenter in minutes with little effort  Real time visibility after creating our Redis infrastructure, driven by CFEngine execution  Can answer any question imaginable about all of our servers in seconds  Know every action that happens on our machines
  • 9. How LinkedIn uses CFEngine Functions we have automated:         Hardware failure detection Account administration Privilege escalation Software deployment O/S configuration management Process / service management Software deployment System monitoring You never need to log into a machine to manage it
  • 10. Two problems still existed for Linkedin that automation didn’t address  The company wanted to be able to answer any question imaginable about production.  We didn’t want to break production by pushing new automation changes. To solve both problems, we needed visibility.
  • 11. Problem #1: The company wants questions answered. STAT!  Management / Engineers want to have questions answered immediately and ask several times a day interrupting your work.
  • 13. What LinkedIn sysadmins were doing • Questions about Infrastructure were answered by sysadmins SSHing to machines to hunt for data. • As our scale increased, we used a remote execution tool to parallelize some variant of SSH / DSH  Thousands of network connections were made to remote machines from a single host to fetch data.  Did I get results from everything?  Parse results after collection
  • 14. Forcing command execution on remote machines doesn’t scale  Machines were missed, data wasn’t collected  Firewalls mangled packets  SSHD offline or didn’t spawn on the remote host  Depended on system accounts being valid  Network connections failed to the remote machine  Data collection shouldn’t be complicated  Unsure if we were able to collect all of the necessary data.
  • 15. Problem #2: We didn’t want to break production by pushing new automation changes.  Ops was hesitant of using automation because they didn’t know where things would break  When automation was expanded, we didn’t know where systems need alternative behavior to work correctly (or where they have been modified by developers with root access)  Ops had to be agile. We have to work fast. The business needs us to modify production multiple times a day, but we had to make changes without breaking it
  • 16. Automation changes were happening in the blind  Sysadmins were under pressure from  large ticket queues  numerous change requests  business needs to scale  Automation changes were being performed without fully understanding the impact before that change was executed  We realized that this could lead to mistakes, disasters, outages, and pink slips. To keep this from happening, I built our Redis API to provide visibility.
  • 17. To provide visibility, we had to scale data collection  We had to build a reliable system that was extremely fast, which could give us results of remote command execution from tens of thousands of systems in seconds  Querying this data could not put load on production systems  The cache needed to be publically available to the company via an API so they could answer their own questions  We needed to quickly add new data into the cache before pushing automation changes to view production impact.
  • 18. We built a cache and populated it with data to answer arbitrary questions  Instead of executing commands remotely, we have CFEngine populate the cache with commonly queried data  CFEngine executes expensive commands like lshw or dmidecode once and make the output available for everybody to use  Data collection becomes a scheduled event that happens once a day - This data collection becomes a cost of doing business  With the same data being gathered on all machines, it becomes trivial to compare two or more pieces of hardware
  • 19. Architecture of the Cache Step 1: Rely on CFEngine execution to drive data insertion Step 2: Shard your data Step 3: Use software load balancing!
  • 20. Step 1: CFEngine drives data insertion Leverage automation to change what you insert or remove from the cache
  • 21. The cache is a simple dictionary, sharded over multiple Redis servers.
  • 22. Step 2: Extract Sharded Data  Determine scope. How much data do I need to answer my question?  For each CFEngine policy server running Redis, search Redis for matching keys in the dictionary  For each key we find from a search, perform the relevant data extraction     Contents Md5sum os.stat() wordcount
  • 23. Step 3: Use Software Load Balancing!  Have clients populate multiple Redis servers on insertion - Pick a Redis server at random on extraction (Load balancing)  If we don’t get a response from our first choice, pick another Redis server at random (failover)  Find randomized CFEngine policy servers with Redis from each level in the scope  If the CFEngine policy server responds, push it into a list of machines we need to query for data  If the CFEngine policy server doesn’t respond, pick another one at random (fail over)
  • 25. Example: Local cache extraction $ time extract_sysops_cache.py --search /etc/passwd --contents | grep msvoboda | wc -l 487 real 0m1.813s user 0m1.484s sys 0m0.087s
  • 27. Example: Site cache extraction $ time extract_sysops_cache.py --site lva1 --search /etc/passwd --contents | grep msvoboda | wc -l 8687 real 0m19.169s user 0m30.286s sys 0m1.271s
  • 29. Example: Global cache extraction $ time extract_sysops_cache.py --scope global --search /etc/passwd --contents | grep msvoboda | wc -l 27344 real 0m44.827s user 1m39.532s sys 0m4.288s
  • 30. Make it fast! Become Multithreaded
  • 31. Make it faster! Build a Redis pipeline
  • 32. Cache extraction with a pipeline
  • 33. Extracting the Cache for Fun and Profit [msvoboda@esv4-infra01 ~]$ extract_sysops_cache.py --scope local --search mps*cm.conf --md5sum --prefix-hostnames esv4-2360-mps01.corp.linkedin.com#/etc/cm.conf esv4-2360-mps02.corp.linkedin.com#/etc/cm.conf esv4-2360-mps03.corp.linkedin.com#/etc/cm.conf esv4-2360-mps04.corp.linkedin.com#/etc/cm.conf 12721673715de3ee6b9dec487529355e 56b03a16c69e5b246a565dbcda44ba28 11e20e28ec60ac6c71cbb71b0a6c9b35 55402eda02e7f5c17dc7535455adc097
  • 34. Make it fastest! Compression is significant!  Less network overhead on cache insertion  Less network overhead on cache extraction  More stuff we can put into the Cache  With less network I/O = faster results delivered  Less CPU usage on extraction
  • 35. Seconds for cache insertion
  • 36. CPU cycles for cache insertion
  • 37. Data size in megabytes of the cache for an entire datacenter
  • 38. Time for cross country complete datacenter cache extraction
  • 39. Drink from the firehose
  • 40. With Redis API, you can now be confident in pushing automation changes  You know what systems will be affected before a change  You aren’t hit with surprises in production  You have added visibility  You don’t have to log into machines to modify or update
  • 41. Summary Before implementation of CFEngine & Redis API at LinkedIn After implementation of CFEngine & Redis API at LinkedIn Headcount 6 people supporting a few hundred machines 6 people supporting tens of thousands of machines Time spent Hours to build a single machine Build complete datacenters in minutes Productivity Hours spent collecting data before change, change itself causing outages Can focus on building infrastructure, team became proactive to fix future problems, not reactive / firefighting Ease of scaling server deployment Incredibly difficult to respond to change, low visibility into production Superior administration, rapid response to changing needs, complete system visibility
  • 42. Open Source Questions? msvoboda@linkedin.com www.linkedin.com/in/mikesvoboda You can download the code from this presentation here: https://github.com/linkedin/sysops-api

Notas do Editor

  1. The CFEngine agent runs on each host, using the network when it can to avoid unnecessary traffic, and with a pull-based technology. Once a policy has been deployed, the CFEngine agent keeps all the discovered facts that inform policy locally and decisions about the policy can be made without needing to talk to a master server. This avoids unnecessary communication and enables CFEngine to continue working even if the network becomes unavailable, e.g. for mobile devices.
  2. Can rapidly respond to changing business needsQuickly respond and resolve outagesAllows systems to be built in a repeatable wayBusiness can expand rapidly to meet demandOperations becomes agile If your systems run CFEngine, they become dependable and reliableIntended system state is always enforcedYou can comfortably delegate escalated privileges (root) to trusted users.Allow engineers to test delta changes before production automation commits
  3. New datacenters can be built effortlesslyMachines converge to known system stateAllows horizontal scalability
  4. We could enforce system state, but it was difficult to answer arbitrary questions from thousands of machines.Automation doesn’t provide direct visibility, but gives you the tools to build itAs your size grows / scale increases, it becomes more difficult to get answerers from thousands of machines.Automation only allows you to make “reports” when machines match a state, but…Only the automation engineer has access to do thisYou can’t extract data for text parsingPolicy has to be written, tested, pushed, results collected
  5. What software is installed? Are all machines in datacenter X running the same version of openssh?Where are processes running? Do my webservers have Apache online?Who has network connections to machine X?What hardware characterics are machines built with? How much RAM / storage / CPU does every machine in datacenter X have?What machines around me are connected to the same network switch?
  6. I found myself searching for data 3-4 times a day across thousands of machinesWasn’t working on solving business problems. My effort was just to make sure I wasn’t going to break things.Needed to be able to quickly and reliably get results so I could push automation changes
  7. t scale, remote command execution breaks down Remote command execution isn’t requiredData collection shouldn’t be complicatedUnnecessary to make thousands of network connectionsMake it easy to parse data via grep / sed / awk
  8. Build an in-memory cache with commonly requested dataYou don’t know what questions which will need answering in the future, so, provide as much data as possible - Snapshot the state of the machineUse RedHat’s “sosreport” or Sun / Oracle’s “explorer” as examples of how to snapshot systems to collect data people would want to useProcess tables, mount tables, loaded kernel modules, installed software, running processes, executing services, user accounts, uptimes, load averages, etc.
  9. Don’t centralize everything to one machine. Allow your CFEngine policy servers to only respond to queries of the machines they administrate.Don’t build automation frameworks with single policy servers. Provide multiple machines for failover and software load balancing.
  10. Everytime CFEngine automation executes on our machines, we populate 4x Redis caches in parallel across our multiple CFEngine policy serversWe collect executed commands and whatever files off of the filesystem we’re interested inSome data is collected every 5 minutes. 30 minutes. 24 hours. Process tables change rapidly. Hardware does not. Each machine populates around 100+ entries into each Redis cache.
  11. The cache is a simple python dictionary. The every key is unique. Format of the key is <hostname>#<filename>The value of every key is an array.Array[0] = contents of the file / command executed.Array[1] = md5sum of the file / command contentsArray[2] = os.stat() of the file (does not apply to executed commands)Array[3] = “wordcount” (number of chars, lines, words of contents)Your compute power is at your “clients” that populate the cache. For thousands of machines, you have hundreds of thousands of CPU cores. On data insertion, have every client compute and populate the cache with the above data so you don’t have to compute the above from one host on extraction. Extraction just becomes a simple cache dump. Comparing the md5sum of several thousand objects is a simple string comparison. Extend the array with whatever metadata you might possibly be interested in.
  12. Sharding is a database design principle whereby rows of a database table are held on separate physical hosts. Multiple hosts are queried to build the complete working set. One machine does not hold the complete database. With the dataset spread out over several servers, you can exploit more system resources (network, CPU, memory) . The tool aggregates data from multiple policy servers running Redis to construct the complete working set.If I’m only interested in Production, don’t query StagingDetermine scope. Local – just query my cageSite – query all cages for a specific datacenterGlobal – query all cages for all datacenter
  13. Test that Redis responds from the randomly chosen MPS. If Redis responds to a server.info call, we know that we can query it for data.
  14. Local extraction (Default behavior of the utility)Only query one randomized MPS from my local core.Returns the least amount of dataLeast amount of network overhead Helps users in the company learn how to use the utility and how to query the data that exists in the cacheLeast amount of load of the MPS serving Redis queriesCommonly is all that’s needed
  15. Site extraction Query one randomized MPS from each cage of the datacenter.Amount of data returned is directly related to the size / number of machines in that datacenter. Moderate network overhead.Could be cross-country network traffic i.e. Extract all hardware failures from Atlanta to Sunnyvale Useful for auditing all machines in a particular application group / service.Drives multiple MPS from a site.
  16. Global extraction Query one randomized MPS from each cage of every datacenter.Amount of data returned is immense. Heavy network overhead.Cross country / “global” traffic from multiple continentsNecessary for discovering questions that are most commonly need to be answered.Where have I experienced hardware failure?What version of the CFEngine RPM is installed everywhere?Drives multiple MPS from every site.
  17. - our data is shared for global scope queries, but we have to perform the exact same operation on 30 Redis servers to build our working setSearching for keys. We should perform parallel searches in across all MPS to return the list of keys that we need to extract.Extracting the keys. When pulling data from the MPS, we might as well pull data from the 60x MPS in parallel.
  18. Pipelines are a subclass of the base Redis class that provide support for buffering multiple commands to the server in a single request. They can be used to dramatically increase the performance of groups of commands by reducing the number of back-and-forth TCP packets between the client and server.The pipeline is similar in concept to a large TCP sliding window. If I need to fetch 1000 objects from the MPS, send it a single pipeline request and have the MPS feed my client 1000 objects at once. This greatly reduces back and forth communication.
  19. We can insert data into the Redis caches in plain text, but there’s no reason why we can’t compress it on insertion and decompress on extraction. Why use compression?Less network overhead on cache insertion.More CPU horsepower from hundreds of thousands of CPU cores on your end nodes.Less network overhead on cache extraction.More stuff we can shove into the cache.We’re holding these objects in RAM. Space is expensive. If we can reduce the number of bytes we use in memory, we can shove more data into the cacheWith less network overhead = less time to extraction. MORE BETTER.Less CPU overhead on client extraction.It may not make sense, but decompressing data from gigabytes of data on cache is actually less CPU overhead on a single machine than it is to process the additional network packets.Added security benefit of data being in binary form in the cache. Makes modification more complicated / can’t directly be scanned on the network.We evaluated 4 compression algorithms compared to plain text.Bz2ZlibLzmalz4
  20. When fetching data from 30x Redis servers in parallel, the network link on the requesting client becomes the bottleneck.This is why we don’t want to have a single machine as a “front end” for this API. Competing queries from multiple sources would have network starvation unless we went to 10gbit.30x Redis servers can easily saturate a gigabit network link.Start off in the small local scope to build your grep / sed / awk command set to figure out how to parse your data. Once you have your command constructed, increase scope.