SlideShare uma empresa Scribd logo
1 de 81
Secrets at Planet-Scale:
Engineering the Internal Google Key Management System
(KMS)
QCon San Francisco 2019, Nov 11-13
Anvita Pandit
Google LLC
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
high-availability-google-key-
management/
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Anvita Pandit
- Software engineer in Data Protection
/ Security and Privacy org in Google
for 2 years.
- Engineering Resident.
- DEFCON 2019 Biohacking village:
co-presented “Hacking Race”
workshop with @HerroAnneKim
Not the Google Cloud KMS
Agenda
1. Why use a KMS?
2. Essential product features
3. Walkthrough of encrypted storage use case
4. System specs and architectural decisions
5. Walkthrough of an outage
6. More architecture!
7. Challenge: safe key rotation
The Great
Gmail
Outage of
2014
https://googleblog.blogspot.com/2014/01/todays-outage-for-several-google.html
Why Use a KMS?
Why Use a KMS?
Core motivation: code needs secrets!
Why Use a KMS?
Core motivation: code needs secrets!
Secrets like:
● Database passwords, third party API and OAuth tokens
● Cryptographic keys used for data encryption, signing, etc
Why Use a KMS?
Core motivation: code needs secrets!
Where?
Why Use a KMS?
Core motivation: code needs secrets!
Where?
● In code repository?
https://github.com/search?utf8=%E2%9C%93&q=remove+password&type=Commits&ref=searchresults
Why Use a KMS?
Core motivation: code needs secrets!
Where?
● In code repository?
● On production hard drives?
Why Use a KMS?
Core motivation: code needs secrets!
Where?
● In code repository?
● On production hard drives?
Alternative:
● Use a KMS!
Centralized Key Management
Solves key problems for everybody.
Centralized Key Management
Solves key problems for everybody.
Offers:
● Separate management of key-handling code
Centralized Key Management
Solves key problems for everybody.
Offers:
● Separate management of key-handling code
● Separation of trust
Centralized Key Management
Solves key problems for everybody
Centralized Key Management
Solves key problems for everybody
1. Access control lists (ACLs)
Centralized Key Management
Solves key problems for everybody
1. Access control lists (ACLs)
● Who is allowed to use the key? Who is allowed to make
updates to the key configuration?
Centralized Key Management
Solves key problems for everybody
1. Access control lists (ACLs)
● Who is allowed to use the key? Who is allowed to make
updates to the key configuration?
● Identities are specified with the internal authentication
system (see ALTS)
Centralized Key Management
Solves key problems for everybody.
2. Auditing aka Who touched my keys?
Centralized Key Management
Solves key problems for everybody.
2. Auditing aka Who touched my keys?
● Binary verification
Centralized Key Management
Solves key problems for everybody.
2. Auditing aka Who touched my keys?
● Binary verification
● Logging (but not the secrets!)
Google’s Root of Trust
Storage Systems (Millions)
Data encrypted with data keys (DEKs)
KMS (Tens of Thousands)
Master keys and passwords are stored in KMS
Root KMS (Hundreds)
KMS is protected with a KMS master key in Root KMS
Root KMS master key distributor (Hundreds)
Root KMS master key is distributed in memory
Physical safes (a few)
Root KMS master key is backed up on hardware devices
Google’s Root of Trust
Storage Systems (Millions)
Data encrypted with data keys (DEKs)
KMS (Tens of Thousands)
Master keys and passwords are stored in KMS
Root KMS (Hundreds)
KMS is protected with a KMS master key in Root KMS
Root KMS master key distributor (Hundreds)
Root KMS master key is distributed in memory
Physical safes (a few)
Root KMS master key is backed up on hardware devices
Google’s Root of Trust
Storage Systems (Millions)
Data encrypted with data keys (DEKs)
KMS (Tens of Thousands)
Master keys and passwords are stored in KMS
Root KMS (Hundreds)
KMS is protected with a KMS master key in Root KMS
Root KMS master key distributor (Hundreds)
Root KMS master key is distributed in memory
Physical safes (a few)
Root KMS master key is backed up on hardware devices
Google’s Root of Trust
Storage Systems (Millions)
Data encrypted with data keys (DEKs)
KMS (Tens of Thousands)
Master keys and passwords are stored in KMS
Root KMS (Hundreds)
KMS is protected with a KMS master key in Root KMS
Root KMS master key distributor (Hundreds)
Root KMS master key is distributed in memory
Physical safes (a few)
Root KMS master key is backed up on hardware devices
Google’s Root of Trust
Storage Systems (Millions)
Data encrypted with data keys (DEKs)
KMS (Tens of Thousands)
Master keys and passwords are stored in KMS
Root KMS (Hundreds)
KMS is protected with a KMS master key in Root KMS
Root KMS master key distributor (Hundreds)
Root KMS master key is distributed in memory
Physical safes (a few)
Root KMS master key is backed up on hardware devices
Category Requirement
Availability 5 nines => 99.999% of requests are served
Latency 99% of requests are served < 10 ms
Scalability Planet-scale!
Security Effortless key rotation
Design Requirements
Decisions, decisions
● Not an encryption/decryption service.
Decisions, decisions
● Not an encryption/decryption service.
● Not a traditional database
Decisions, decisions
● Not an encryption/decryption service.
● Not a traditional database
● Key wrapping
● Stateless serving
Key Wrapping
Key Wrapping
● Fewer centrally-managed keys improves availability but
requires more trust in the client
Insight: At the KMS layer, key material is not mutable state.
Immutable key material + key wrapping
==> Stateless server ==> Trivial scaling
Keys in RAM ==> Low latency serving
Stateless Serving
What Could Go Wrong?
The Great
Gmail
Outage of
2014
https://googleblog.blogspot.com/2014/01/todays-outage-for-several-google.html
Each team
maintains their
own KMS
configurations
Each team
maintains their
own KMS
configurations,
all stored in
Google’s
monolithic repo
Source
Repository
(holds
encrypted
configs)
Individual
Team Config
Changes
Config
merge
cron
job
Single
Merged
Config
Update
Data
Pusher
KMS
KMS
KMS
KMS
KMS
KMS
Many KMS
Servers
Each
Local
Config
Client
KMS
Server
Local
Config
Client
Normal Operation
Which get automatically merged
into a combined config file
Which is distributed to all
KMS shards for serving
Sees
incorrect
image of
source repo
😵
Merging
Problem
🐛
Truncated
Config
☠ Client
A bad config pushed globally
means a global outage
All Local
Configs
☠
Lessons Learned
The KMS had become
● a single point of failure
● a startup dependency for services
● often a runtime dependency
==> KMS Must Not Fail Globally
● No more all-at-once global rollout of binaries
and configuration
● Regional failure isolation and client isolation
● Minimize dependencies
KMS Must Not Fail Globally
Google KMS Current Stats:
● No downtime since the Gmail outage in
2014 January: >> 99.9999%
● 99.9% of requests are served < 6 ms
● ~107
requests/sec (~10 M QPS)
● ~104
processes & cores
Challenge: Safe Key Rotation
Make It Easy To Rotate Keys
● Key compromise
○ Also requires access to cipher text
Make It Easy To Rotate Keys
● Key compromise
○ Also requires access to cipher text
● Broken ciphers
○ Access to cipher text is enough
Make It Easy To Rotate Keys
● Key compromise
○ Also requires access to cipher text
● Broken ciphers
○ Access to cipher text is enough
● Rotating keys limits the window of vulnerability
Make It Easy To Rotate Keys
● Key compromise
○ Also requires access to cipher text
● Broken ciphers
○ Access to cipher text is enough
● Rotating keys limits the window of vulnerability
● But rotating keys means there is potential for data loss
Goals
1. KMS users design with rotation in mind
2. Using multiple key versions is no harder than using a
single key
3. Very hard to lose data
Robust Key Rotation at Scale - 0
Robust Key Rotation at Scale - 1
Goal #1: KMS users design with rotation in mind
● Users choose
○ Frequency of rotation: e.g. every 30 days
○ TTL of cipher text: e.g. 30,90,180 days, 2 years, etc.
Robust Key Rotation at Scale - 1
Goal #1: KMS users design with rotation in mind
● Users choose
○ Frequency of rotation: e.g. every 30 days
○ TTL of cipher text: e.g. 30,90,180 days, 2 years, etc.
● KMS guarantees ‘Safety Condition’
○ All ciphertext produced within the TTL can be
deciphered using a keyset in the KMS.
Robust Key Rotation at Scale - 2
Goal #2: Using multiple key versions is no harder than using
a single key
Robust Key Rotation at Scale - 2
Goal #2: Using multiple key versions is no harder than using
a single key
● Tightly integrated with Google's standard cryptographic
libraries: see Tink
Robust Key Rotation at Scale - 2
Goal #2: Using multiple key versions is no harder than using
a single key
● Tightly integrated with Google's standard cryptographic
libraries: see Tink
○ Keys support multiple key versions
○ Each of which can be a different cipher
Time ⇢
Robust Key Rotation at Scale - 3
T0 T1 T2 T3 T4 T5 T6 T7
V1 A P P A A A SFR
V2 A P P A A A
A - Active
P - Primary
SFR - Scheduled for Revocation
Goal #3: Very hard to lose data
Time ⇢
Robust Key Rotation at Scale - 3
T0 T1 T2 T3 T4 T5 T6 T7
V1 A P P A A A SFR
V2 A P P A A A
A - Active
P - Primary
SFR - Scheduled for Revocation
Goal #3: Very hard to lose data
Time ⇢
Robust Key Rotation at Scale - 3
T0 T1 T2 T3 T4 T5 T6 T7
V1 A P P A A A SFR
V2 A P P A A A
A - Active
P - Primary
SFR - Scheduled for Revocation
Goal #3: Very hard to lose data
Time ⇢
Robust Key Rotation at Scale - 3
T0 T1 T2 T3 T4 T5 T6 T7
V1 A P P A A A SFR
V2 A P P A A A
A - Active
P - Primary
SFR - Scheduled for Revocation
Goal #3: Very hard to lose data
Recap: Key Rotation
Recap: Key Rotation
● Presents an availability vs security tradeoff
Recap: Key Rotation
● Presents an availability vs security tradeoff
● KMS
○ Derives the number of key versions to retain
Recap: Key Rotation
● Presents an availability vs security tradeoff
● KMS
○ Derives the number of key versions to retain
○ Adds/Promotes/Demotes/Deletes Key Versions over time
Google KMS - Summary
Implementing encryption at scale required highly available key management.
At Google’s scale this means 5 9s of availability.
To achieve all requirements, we use several strategies:
● Best practices for change management and staged rollouts
● Minimize dependencies and aggressively defend against their unavailability
● Isolate by region & client type
● Combine immutable keys + wrapping to achieve scale
● A declarative API for key rotation
We Are Hiring!
anvita@google.com
■ Google Cloud Encryption at Rest whitepaper:
https://cloud.google.com/security/encryption-at-rest/default-encryption/
■ Google Application Layer Transport Security:
https://cloud.google.com/security/encryption-in-transit/application-layer-transp
ort-security/
+ Infographic https://cloud.withgoogle.com/infrastructure/data-encryption/step-7
■ Tink cryptographic library https://github.com/google/tink
■ Site Reliability Engineering (SRE) handbook:
https://landing.google.com/sre/book.html
Further Reading
Bonus Content
Challenge: Data Integrity
Causes of Bit Errors
■ Corruption in transit as NICs (network cards) twiddle bits.
■ Corruption in memory from broken CPUs
■ Cosmic rays flip bits in DRAM
■ [not an exhaustive list]
○ Crypto provides leverage
○ Key material corruption can render large chunks of data
unusable.
Hardware Faults
Software Mitigations
○ Verify correctness of crypto operations at start of a process
■ During a request, after using the KEK to wrap a DEK and before
responding to the customer, we unwrap the same DEK
■ Storage services
● Read back plain text after writing encrypted data blocks
● Replicate/parity protect at a higher layer
Key Sensitivity Annotations
Key Sensitivity Annotations
Users determine the consequence if their keys were to be compromised using
the CIA triad
Sensitivity Annotations
Users determine the consequence if their keys were to be compromised using
the CIA triad:
● Confidentiality
● Integrity
● Availability
Sensitivity Annotations
● Each consequence has corresponding policy recommendations
● For example, only a verifiably built program can contact a key that could leak
user data.
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
high-availability-google-key-
management/

Mais conteúdo relacionado

Mais de C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

Mais de C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Secrets at Planet-scale: Engineering the Internal Google KMS

  • 1. Secrets at Planet-Scale: Engineering the Internal Google Key Management System (KMS) QCon San Francisco 2019, Nov 11-13 Anvita Pandit Google LLC
  • 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ high-availability-google-key- management/
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. Anvita Pandit - Software engineer in Data Protection / Security and Privacy org in Google for 2 years. - Engineering Resident. - DEFCON 2019 Biohacking village: co-presented “Hacking Race” workshop with @HerroAnneKim
  • 5. Not the Google Cloud KMS
  • 6. Agenda 1. Why use a KMS? 2. Essential product features 3. Walkthrough of encrypted storage use case 4. System specs and architectural decisions 5. Walkthrough of an outage 6. More architecture! 7. Challenge: safe key rotation
  • 8.
  • 9. Why Use a KMS?
  • 10. Why Use a KMS? Core motivation: code needs secrets!
  • 11. Why Use a KMS? Core motivation: code needs secrets! Secrets like: ● Database passwords, third party API and OAuth tokens ● Cryptographic keys used for data encryption, signing, etc
  • 12. Why Use a KMS? Core motivation: code needs secrets! Where?
  • 13. Why Use a KMS? Core motivation: code needs secrets! Where? ● In code repository?
  • 15. Why Use a KMS? Core motivation: code needs secrets! Where? ● In code repository? ● On production hard drives?
  • 16. Why Use a KMS? Core motivation: code needs secrets! Where? ● In code repository? ● On production hard drives? Alternative: ● Use a KMS!
  • 17. Centralized Key Management Solves key problems for everybody.
  • 18. Centralized Key Management Solves key problems for everybody. Offers: ● Separate management of key-handling code
  • 19. Centralized Key Management Solves key problems for everybody. Offers: ● Separate management of key-handling code ● Separation of trust
  • 20. Centralized Key Management Solves key problems for everybody
  • 21. Centralized Key Management Solves key problems for everybody 1. Access control lists (ACLs)
  • 22. Centralized Key Management Solves key problems for everybody 1. Access control lists (ACLs) ● Who is allowed to use the key? Who is allowed to make updates to the key configuration?
  • 23. Centralized Key Management Solves key problems for everybody 1. Access control lists (ACLs) ● Who is allowed to use the key? Who is allowed to make updates to the key configuration? ● Identities are specified with the internal authentication system (see ALTS)
  • 24. Centralized Key Management Solves key problems for everybody. 2. Auditing aka Who touched my keys?
  • 25. Centralized Key Management Solves key problems for everybody. 2. Auditing aka Who touched my keys? ● Binary verification
  • 26. Centralized Key Management Solves key problems for everybody. 2. Auditing aka Who touched my keys? ● Binary verification ● Logging (but not the secrets!)
  • 27.
  • 28.
  • 29.
  • 30.
  • 31. Google’s Root of Trust Storage Systems (Millions) Data encrypted with data keys (DEKs) KMS (Tens of Thousands) Master keys and passwords are stored in KMS Root KMS (Hundreds) KMS is protected with a KMS master key in Root KMS Root KMS master key distributor (Hundreds) Root KMS master key is distributed in memory Physical safes (a few) Root KMS master key is backed up on hardware devices
  • 32. Google’s Root of Trust Storage Systems (Millions) Data encrypted with data keys (DEKs) KMS (Tens of Thousands) Master keys and passwords are stored in KMS Root KMS (Hundreds) KMS is protected with a KMS master key in Root KMS Root KMS master key distributor (Hundreds) Root KMS master key is distributed in memory Physical safes (a few) Root KMS master key is backed up on hardware devices
  • 33. Google’s Root of Trust Storage Systems (Millions) Data encrypted with data keys (DEKs) KMS (Tens of Thousands) Master keys and passwords are stored in KMS Root KMS (Hundreds) KMS is protected with a KMS master key in Root KMS Root KMS master key distributor (Hundreds) Root KMS master key is distributed in memory Physical safes (a few) Root KMS master key is backed up on hardware devices
  • 34. Google’s Root of Trust Storage Systems (Millions) Data encrypted with data keys (DEKs) KMS (Tens of Thousands) Master keys and passwords are stored in KMS Root KMS (Hundreds) KMS is protected with a KMS master key in Root KMS Root KMS master key distributor (Hundreds) Root KMS master key is distributed in memory Physical safes (a few) Root KMS master key is backed up on hardware devices
  • 35. Google’s Root of Trust Storage Systems (Millions) Data encrypted with data keys (DEKs) KMS (Tens of Thousands) Master keys and passwords are stored in KMS Root KMS (Hundreds) KMS is protected with a KMS master key in Root KMS Root KMS master key distributor (Hundreds) Root KMS master key is distributed in memory Physical safes (a few) Root KMS master key is backed up on hardware devices
  • 36. Category Requirement Availability 5 nines => 99.999% of requests are served Latency 99% of requests are served < 10 ms Scalability Planet-scale! Security Effortless key rotation Design Requirements
  • 37. Decisions, decisions ● Not an encryption/decryption service.
  • 38. Decisions, decisions ● Not an encryption/decryption service. ● Not a traditional database
  • 39. Decisions, decisions ● Not an encryption/decryption service. ● Not a traditional database ● Key wrapping ● Stateless serving
  • 41. Key Wrapping ● Fewer centrally-managed keys improves availability but requires more trust in the client
  • 42. Insight: At the KMS layer, key material is not mutable state. Immutable key material + key wrapping ==> Stateless server ==> Trivial scaling Keys in RAM ==> Low latency serving Stateless Serving
  • 43. What Could Go Wrong?
  • 45. Each team maintains their own KMS configurations Each team maintains their own KMS configurations, all stored in Google’s monolithic repo Source Repository (holds encrypted configs) Individual Team Config Changes Config merge cron job Single Merged Config Update Data Pusher KMS KMS KMS KMS KMS KMS Many KMS Servers Each Local Config Client KMS Server Local Config Client Normal Operation Which get automatically merged into a combined config file Which is distributed to all KMS shards for serving Sees incorrect image of source repo 😵 Merging Problem 🐛 Truncated Config ☠ Client A bad config pushed globally means a global outage All Local Configs ☠
  • 46. Lessons Learned The KMS had become ● a single point of failure ● a startup dependency for services ● often a runtime dependency ==> KMS Must Not Fail Globally
  • 47. ● No more all-at-once global rollout of binaries and configuration ● Regional failure isolation and client isolation ● Minimize dependencies KMS Must Not Fail Globally
  • 48. Google KMS Current Stats: ● No downtime since the Gmail outage in 2014 January: >> 99.9999% ● 99.9% of requests are served < 6 ms ● ~107 requests/sec (~10 M QPS) ● ~104 processes & cores
  • 50. Make It Easy To Rotate Keys ● Key compromise ○ Also requires access to cipher text
  • 51. Make It Easy To Rotate Keys ● Key compromise ○ Also requires access to cipher text ● Broken ciphers ○ Access to cipher text is enough
  • 52. Make It Easy To Rotate Keys ● Key compromise ○ Also requires access to cipher text ● Broken ciphers ○ Access to cipher text is enough ● Rotating keys limits the window of vulnerability
  • 53. Make It Easy To Rotate Keys ● Key compromise ○ Also requires access to cipher text ● Broken ciphers ○ Access to cipher text is enough ● Rotating keys limits the window of vulnerability ● But rotating keys means there is potential for data loss
  • 54. Goals 1. KMS users design with rotation in mind 2. Using multiple key versions is no harder than using a single key 3. Very hard to lose data Robust Key Rotation at Scale - 0
  • 55. Robust Key Rotation at Scale - 1 Goal #1: KMS users design with rotation in mind ● Users choose ○ Frequency of rotation: e.g. every 30 days ○ TTL of cipher text: e.g. 30,90,180 days, 2 years, etc.
  • 56. Robust Key Rotation at Scale - 1 Goal #1: KMS users design with rotation in mind ● Users choose ○ Frequency of rotation: e.g. every 30 days ○ TTL of cipher text: e.g. 30,90,180 days, 2 years, etc. ● KMS guarantees ‘Safety Condition’ ○ All ciphertext produced within the TTL can be deciphered using a keyset in the KMS.
  • 57. Robust Key Rotation at Scale - 2 Goal #2: Using multiple key versions is no harder than using a single key
  • 58. Robust Key Rotation at Scale - 2 Goal #2: Using multiple key versions is no harder than using a single key ● Tightly integrated with Google's standard cryptographic libraries: see Tink
  • 59. Robust Key Rotation at Scale - 2 Goal #2: Using multiple key versions is no harder than using a single key ● Tightly integrated with Google's standard cryptographic libraries: see Tink ○ Keys support multiple key versions ○ Each of which can be a different cipher
  • 60. Time ⇢ Robust Key Rotation at Scale - 3 T0 T1 T2 T3 T4 T5 T6 T7 V1 A P P A A A SFR V2 A P P A A A A - Active P - Primary SFR - Scheduled for Revocation Goal #3: Very hard to lose data
  • 61. Time ⇢ Robust Key Rotation at Scale - 3 T0 T1 T2 T3 T4 T5 T6 T7 V1 A P P A A A SFR V2 A P P A A A A - Active P - Primary SFR - Scheduled for Revocation Goal #3: Very hard to lose data
  • 62. Time ⇢ Robust Key Rotation at Scale - 3 T0 T1 T2 T3 T4 T5 T6 T7 V1 A P P A A A SFR V2 A P P A A A A - Active P - Primary SFR - Scheduled for Revocation Goal #3: Very hard to lose data
  • 63. Time ⇢ Robust Key Rotation at Scale - 3 T0 T1 T2 T3 T4 T5 T6 T7 V1 A P P A A A SFR V2 A P P A A A A - Active P - Primary SFR - Scheduled for Revocation Goal #3: Very hard to lose data
  • 65. Recap: Key Rotation ● Presents an availability vs security tradeoff
  • 66. Recap: Key Rotation ● Presents an availability vs security tradeoff ● KMS ○ Derives the number of key versions to retain
  • 67. Recap: Key Rotation ● Presents an availability vs security tradeoff ● KMS ○ Derives the number of key versions to retain ○ Adds/Promotes/Demotes/Deletes Key Versions over time
  • 68. Google KMS - Summary Implementing encryption at scale required highly available key management. At Google’s scale this means 5 9s of availability. To achieve all requirements, we use several strategies: ● Best practices for change management and staged rollouts ● Minimize dependencies and aggressively defend against their unavailability ● Isolate by region & client type ● Combine immutable keys + wrapping to achieve scale ● A declarative API for key rotation
  • 70. ■ Google Cloud Encryption at Rest whitepaper: https://cloud.google.com/security/encryption-at-rest/default-encryption/ ■ Google Application Layer Transport Security: https://cloud.google.com/security/encryption-in-transit/application-layer-transp ort-security/ + Infographic https://cloud.withgoogle.com/infrastructure/data-encryption/step-7 ■ Tink cryptographic library https://github.com/google/tink ■ Site Reliability Engineering (SRE) handbook: https://landing.google.com/sre/book.html Further Reading
  • 73. Causes of Bit Errors ■ Corruption in transit as NICs (network cards) twiddle bits. ■ Corruption in memory from broken CPUs ■ Cosmic rays flip bits in DRAM ■ [not an exhaustive list]
  • 74. ○ Crypto provides leverage ○ Key material corruption can render large chunks of data unusable. Hardware Faults
  • 75. Software Mitigations ○ Verify correctness of crypto operations at start of a process ■ During a request, after using the KEK to wrap a DEK and before responding to the customer, we unwrap the same DEK ■ Storage services ● Read back plain text after writing encrypted data blocks ● Replicate/parity protect at a higher layer
  • 77. Key Sensitivity Annotations Users determine the consequence if their keys were to be compromised using the CIA triad
  • 78.
  • 79. Sensitivity Annotations Users determine the consequence if their keys were to be compromised using the CIA triad: ● Confidentiality ● Integrity ● Availability
  • 80. Sensitivity Annotations ● Each consequence has corresponding policy recommendations ● For example, only a verifiably built program can contact a key that could leak user data.
  • 81. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ high-availability-google-key- management/