SlideShare uma empresa Scribd logo
1 de 48
Runtime Analysis in the Cloud:Challenges and Opportunities Wolfgang Grieskamp Software Engineer, Google Corp
About Me	 < 2000: Formal Methods Research Europe (TU Berlin) 2000-2006: Microsoft Research: Languages and Model-Based Testing Tools 2007-2011: Microsoft Windows Interoperability Program: protocol testing and tools  Since 4/2011: Google: Google+ platform and tools DISCLAIMER: This talk does not necessarily represent Google’s opinion or direction.
Content of this talk General blah blah about cloud computing Monitoring the Cloud Testing the Cloud Using the Cloud for Development A formal method guy’s dream… Conclusion
About the Cloud
What is Cloud Computing? From Wikipedia, the free encyclopedia Cloud computing is the delivery of computing as a service rather than a product, whereby shared resources, software and information are provided to computers and other devices as a utility (like the electricity grid) over a network (typically the Internet).
What is Cloud Computing? From Wikipedia
Some Properties of Cloud Computing ,[object Object]
It is device independent
It has a unified access control model
You pay for only what you use
It can scale-up or scale-down on demand
Its failsafe,[object Object]
Cloud Stack
IAAS (Infrastructure As A Service) Basic building blocks Storage Networking Computation Homogenous, easy migration (based on VMs) Distributed Data Centers over Geographical Zones Players: Amazon, GoGrid, Rackspace, Microsoft, Google Estimated revenue 2010: $1b (Source: Economist)
Platform As A Service (PAAS) Basic Building Blocks Operating System  Frameworks and Development Tools Deployment and Monitoring Tools Players: Microsoft, Google, IBM, SAP, … Estimated Revenue 2010: $311m (Source: Economist)
Software As A Service (SAAS) Device and location independent applications, typically running in a browser (Email, Social, Retail, Enterprise Apps, etc.) Many different players Estimated revenue 2010: $11.7b (Source: Economist)
Monitoring the Cloud
Monitoring vs RV vs Testing What’s the difference? A (strict) take: Monitoring collects and presents information for human analysis Runtime verification collects and transforms information for automated analysis which ultimately leads to a verdict Testing does the above things in an isolated, staged or mocked, environment. In particular, stimuli from the environment are simulated.  In practice, boundaries are not so clear.     For this talk RV = Monitoring (adapting to Google conventions)
Anatomy of a Data Center Data Center A Data Center B   …… Controller Server Controller Server Server Server Server … Storage Storage Storage Storage Storage Note: abstracted and simplified
Anatomy of a Server Data Center A Data Center B   …… Controller Server (VM) Controller Server Controller Job Job Job Server Server Server Server … Monitor Monitor Monitor Storage Storage Storage Storage Logs Alert Note: abstracted and simplified
Anatomy of a Service Data Center A Data Center B   …… Controller Service (across Servers) Job Job Server Controller Job Job Job Server Server Server Server … Job Storage Storage Storage Storage Storage Storage Note: abstracted and simplified
Monitoring Types @ Google Black Box Monitoring White Box Monitoring Log Analysis
Black Box Monitoring Job Monitor Frequently send requests and analyze the response  Possible because server jobs are ‘stateless’ and always input enabled If failure rate over a certain time interval exceeds a given ratio, raise an alert and page an engineer Engineers aim for minimizing paging and avoiding false positives
Black Box Monitoring: How its done @ Google There are rule based languages for defining request/responses. Each rule: Synthesizes an HTTP request Analyzes the response using a regular expression Specifies frequency and allowed failure ratio Rules are like tests: a simple trigger and a simple response analysis  Monitors can be also custom code Job Monitor
Black Box Monitoring: Issues? Is the ‘stateless’ hypothesis feasible? Nothing is really stateless -- state is passed as parameters in cookies, continuation tokens, etc. However, as these are health tests, state can be ignored What is the relation to testing? In theory very similar, only that the environment is not mocked.  In practice uses quite different frameworks/languages  What about service/system level monitoring? Its only about one job.  Doesn’t give failure root cause (it only measures a symptom) Job Monitor
ChallengeIntegrate Black-Box Monitoring and Testing Job Monitor Black-box monitoring can be seen as particular way of executing tests end-to-end on the live product such that the impact on performance can be neglected. Frameworks which integrate design and execution of monitoring rules and test cases are promising Mainly an engineering challenge
Job Job ChallengeSystem/Service Level Black-Box Monitoring Monitor Monitor Not commonly done  Main purpose would be failure cause analysis and failure prevention Simple local monitoring already discovers failures  Is there a strong point of doing it at runtime (vs. log analysis)? Only if real-time prevention and potentially repair is important   Monitor
ChallengeProtocolcontract verification Job Monitor At Google, all communication between jobs happens via a single homogeneous RPC mechanism based on a message format definition language (called protocol buffers) Also all (= terra bytes) of data is stored in formats specified by protocol buffers One could formulate data invariant and protocol sequencing contracts over protocol buffers and enforce them at runtime
White-Box Monitoring Server exports collection of probe points (variables) Memory, # RPCs, # Failures, etc. Monitor collects time series of those values and computes functions over them Dashboards prepare information graphically Mostly used for diagnosis by humans Job Monitor
White-Box Monitoring:How its done @ Google Job Monitor Declarative language for time series computations Collects samples from the server by memory scraping Merging of similar data from multiple servers running the same job Rich support for diagram rendering in the browser
White-Box Monitoring:Issues? Design for monitorability/testability? Its already ubiquitous throughout, since software engineers are themselves on-call… Distributed collection/network load? Not really an issue because it’s sample based Relation to testing? Same as with black-box – should be a common framework. Automatic root cause analysis and self-repair? Current systems mostly build for human analysis and repair.   Self-repair would be a big thing. Job Monitor
ChallengeSelf-repair Job Monitor Cloud system are homogenous and operate with redundancy Many VMs with exactly the same properties Self-repair could identify and ‘drain’ faulty parts of the system, apply fallbacks, rollback software updates, etc. One major cause of cloud failures are outages Another major cause are software updates  A semi-automated approach suggesting actions to a human would be already very useful Ever got paged at 2am in the morning?
ChallengeHybrid White-Box Monitoring / RV Foundations Job Monitor The data collected in white-box monitoring represents continuous and often stochastic functions over time.  The triggers for discrete actions (like alerts) are thresholds over integrated values of those functions. Sounds like hybrid systems/automaton. Has anybody looked at it like this in RV community?
Log Analysis Job Logs Collect data from each server’s run containing information like operation flow, exceptions, etc. Store data over a window of time (say for last 24h) Access data from various sources programmatically to analyze issues (post-mortem, performance, etc.) Allows for correlation of system/service wide information
Log Analysis:How its done @ Google Job Logs Very fined grained logging on job side; huge amounts of data collected Logs are stored in Bigtable (Google’s large-scale storage solution) Logs are analyzed using parallel (cloud) computing, e.g. with Sawzall, a declarative language based on map-reduce Logs are most often used for failure cause analysis/debugging
Log Analysis:Issues? Job Logs Amount of data and accessibility? Not really an issue because of highly performing distributed file systems Format of the data? Logs are structured data (at Google, protocol buffers) Encryption? A big issue: if it can’t be decrypted, not much may be diagnosable. If it can be decrypted, the access to this now decrypted data needs to be restricted.
ChallengePrivacy and Encryption Job Monitor Data logged (or otherwise analyzed) during monitoring may contain encrypted proprietary information A problem may not be diagnosable without decryption Decrypted clear text data (in particular if logged) needs to be highly protected Automatic obfuscation and/or anonymization would be highly desirable A protocol may need to be designed for this in the first place
Testing the Cloud
Testing @Google
ChallengeIntegration Testing Job Job Job Storage Two or more components are plugged together with a partially mocked environment These tests are usually very ‘flaky’ (unreliable) because: Difficulty to construct mocked component’s precise behavior (its more than a simple mock in a unit test) Difficulty to synthesize mocked component’s initial state (it may have a complex state) Potential solution: model-based testing
Model-Based Testing in a Nutshell Requirements Feedback   Author Model Feedback   Feedback   Generate Test Suite Issue Expected Outputs (Test Oracle) Inputs (Test Sequences) Verdict Control   Observe   Feedback   Implementation
Technical Document Testing Program of  Windows: A Success Story for MBT 222 protocols/technical documents tested 22,847 pages studied and converted into requirements 36,875 testable requirements identified and converted into test assertions 69% tested using MBT 31% tested using traditional test automation 66,962 person days (250+ years) Hyderabad: 250 test engineers Beijing: 100 test engineers 38
Comparison MBT vs Traditional ,[object Object]
Vendor 2 modeled 85% of all test suites, performing relatively much better than Vendor 139 Grieskamp et. al: Model-based quality assurance of protocol documentation: tools and methodology. Softw. Test., Verif. Reliab. 21(1): 55-71 (2011)
Exploiting the Cloud for Development
Idle Resources	 Peak demand problem: as with other utilities, the cloud must have capacity to deal with peak times: 7am, 7pm, etc. Huge amounts of idle computing resources available in the DCs outside of those peak times Literally hundreds of VMs may be available for a single engineer on a low-priority job base ,[object Object],[object Object]
Consequences for Testing, Program Analysis, etc.  Need to rethink the base assumptions of some of the existing approaches for testing and program analysis for massive coarse-grained parallelism: Early divide-and-conquer ideal, e.g. initial random seed than run-till-end and collect and compare Try different heuristics on the same problem; see which one wins Techniques like SMT and concolic execution can largely benefit from this
A formal method guy’s dream…

Mais conteúdo relacionado

Mais procurados

Benefit From Unit Testing In The Real World
Benefit From Unit Testing In The Real WorldBenefit From Unit Testing In The Real World
Benefit From Unit Testing In The Real WorldDror Helper
 
Control Flow Testing
Control Flow TestingControl Flow Testing
Control Flow TestingHirra Sultan
 
Test driven development and unit testing with examples in C++
Test driven development and unit testing with examples in C++Test driven development and unit testing with examples in C++
Test driven development and unit testing with examples in C++Hong Le Van
 
Assessing Model-Based Testing: An Empirical Study Conducted in Industry
Assessing Model-Based Testing: An Empirical Study Conducted in IndustryAssessing Model-Based Testing: An Empirical Study Conducted in Industry
Assessing Model-Based Testing: An Empirical Study Conducted in IndustryDharmalingam Ganesan
 
SE2_Lec 21_ TDD and Junit
SE2_Lec 21_ TDD and JunitSE2_Lec 21_ TDD and Junit
SE2_Lec 21_ TDD and JunitAmr E. Mohamed
 
Ivv workshop model-based-testing-of-nasa-systems
Ivv workshop model-based-testing-of-nasa-systemsIvv workshop model-based-testing-of-nasa-systems
Ivv workshop model-based-testing-of-nasa-systemsDharmalingam Ganesan
 
SE2_Lec 20_Software Testing
SE2_Lec 20_Software TestingSE2_Lec 20_Software Testing
SE2_Lec 20_Software TestingAmr E. Mohamed
 
White Box Testing
White Box TestingWhite Box Testing
White Box TestingAlisha Roy
 
5 black box and grey box testing
5   black box and grey box testing5   black box and grey box testing
5 black box and grey box testingYisal Khan
 
Model-based Testing of a Software Bus - Applied on Core Flight Executive
Model-based Testing of a Software Bus - Applied on Core Flight ExecutiveModel-based Testing of a Software Bus - Applied on Core Flight Executive
Model-based Testing of a Software Bus - Applied on Core Flight ExecutiveDharmalingam Ganesan
 
Mutation Testing and MuJava
Mutation Testing and MuJavaMutation Testing and MuJava
Mutation Testing and MuJavaKrunal Parmar
 
An Introduction to Unit Testing
An Introduction to Unit TestingAn Introduction to Unit Testing
An Introduction to Unit TestingJoe Tremblay
 
Testing, black ,white and gray box testing
Testing, black ,white and gray box testingTesting, black ,white and gray box testing
Testing, black ,white and gray box testingAamir Shakir
 
Unit 3 Control Flow Testing
Unit 3   Control Flow TestingUnit 3   Control Flow Testing
Unit 3 Control Flow Testingravikhimani
 
Black box testing or behavioral testing
Black box testing or behavioral testingBlack box testing or behavioral testing
Black box testing or behavioral testingSlideshare
 

Mais procurados (20)

Benefit From Unit Testing In The Real World
Benefit From Unit Testing In The Real WorldBenefit From Unit Testing In The Real World
Benefit From Unit Testing In The Real World
 
Control Flow Testing
Control Flow TestingControl Flow Testing
Control Flow Testing
 
Test driven development and unit testing with examples in C++
Test driven development and unit testing with examples in C++Test driven development and unit testing with examples in C++
Test driven development and unit testing with examples in C++
 
Assessing Model-Based Testing: An Empirical Study Conducted in Industry
Assessing Model-Based Testing: An Empirical Study Conducted in IndustryAssessing Model-Based Testing: An Empirical Study Conducted in Industry
Assessing Model-Based Testing: An Empirical Study Conducted in Industry
 
SE2_Lec 21_ TDD and Junit
SE2_Lec 21_ TDD and JunitSE2_Lec 21_ TDD and Junit
SE2_Lec 21_ TDD and Junit
 
Ivv workshop model-based-testing-of-nasa-systems
Ivv workshop model-based-testing-of-nasa-systemsIvv workshop model-based-testing-of-nasa-systems
Ivv workshop model-based-testing-of-nasa-systems
 
SE2_Lec 20_Software Testing
SE2_Lec 20_Software TestingSE2_Lec 20_Software Testing
SE2_Lec 20_Software Testing
 
White Box Testing
White Box TestingWhite Box Testing
White Box Testing
 
5 black box and grey box testing
5   black box and grey box testing5   black box and grey box testing
5 black box and grey box testing
 
Black & White Box testing
Black & White Box testingBlack & White Box testing
Black & White Box testing
 
Model-based Testing of a Software Bus - Applied on Core Flight Executive
Model-based Testing of a Software Bus - Applied on Core Flight ExecutiveModel-based Testing of a Software Bus - Applied on Core Flight Executive
Model-based Testing of a Software Bus - Applied on Core Flight Executive
 
Mutation Testing and MuJava
Mutation Testing and MuJavaMutation Testing and MuJava
Mutation Testing and MuJava
 
An Introduction to Unit Testing
An Introduction to Unit TestingAn Introduction to Unit Testing
An Introduction to Unit Testing
 
10 software testing_technique
10 software testing_technique10 software testing_technique
10 software testing_technique
 
Testing, black ,white and gray box testing
Testing, black ,white and gray box testingTesting, black ,white and gray box testing
Testing, black ,white and gray box testing
 
system verilog
system verilogsystem verilog
system verilog
 
Black box software testing
Black box software testingBlack box software testing
Black box software testing
 
Unit 3 Control Flow Testing
Unit 3   Control Flow TestingUnit 3   Control Flow Testing
Unit 3 Control Flow Testing
 
Unit Testing
Unit TestingUnit Testing
Unit Testing
 
Black box testing or behavioral testing
Black box testing or behavioral testingBlack box testing or behavioral testing
Black box testing or behavioral testing
 

Semelhante a Rv11

Muves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 FinalMuves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 FinalElastic Grid, LLC.
 
Continuous Profiling in Production: What, Why and How
Continuous Profiling in Production: What, Why and HowContinuous Profiling in Production: What, Why and How
Continuous Profiling in Production: What, Why and HowSadiq Jaffer
 
Stating the obvious - 121 Test Automation Day, Dublin, 2018
Stating the obvious - 121 Test Automation Day, Dublin, 2018Stating the obvious - 121 Test Automation Day, Dublin, 2018
Stating the obvious - 121 Test Automation Day, Dublin, 2018Giulio Vian
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionJorge Cardoso
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programsgreenwop
 
Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?Adam Goucher
 
Successful Software Projects - What you need to consider
Successful Software Projects - What you need to considerSuccessful Software Projects - What you need to consider
Successful Software Projects - What you need to considerLloydMoore
 
Workshop BI/DWH AGILE TESTING SNS Bank English
Workshop BI/DWH AGILE TESTING SNS Bank EnglishWorkshop BI/DWH AGILE TESTING SNS Bank English
Workshop BI/DWH AGILE TESTING SNS Bank EnglishMarcus Drost
 
Drools & jBPM Info Sheet
Drools & jBPM Info SheetDrools & jBPM Info Sheet
Drools & jBPM Info SheetMark Proctor
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
 
The Magic Of Application Lifecycle Management In Vs Public
The Magic Of Application Lifecycle Management In Vs PublicThe Magic Of Application Lifecycle Management In Vs Public
The Magic Of Application Lifecycle Management In Vs PublicDavid Solivan
 
Stepin evening presented
Stepin evening presentedStepin evening presented
Stepin evening presentedVijayan Reddy
 
Software Testing and Quality Assurance Assignment 2
Software Testing and Quality Assurance Assignment 2Software Testing and Quality Assurance Assignment 2
Software Testing and Quality Assurance Assignment 2Gurpreet singh
 
Unit Testing Essay
Unit Testing EssayUnit Testing Essay
Unit Testing EssayDani Cox
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
#DOAW16 - DevOps@work Roma 2016 - Testing your databases
#DOAW16 - DevOps@work Roma 2016 - Testing your databases#DOAW16 - DevOps@work Roma 2016 - Testing your databases
#DOAW16 - DevOps@work Roma 2016 - Testing your databasesAlessandro Alpi
 
Practical machine learning
Practical machine learningPractical machine learning
Practical machine learningFaizan Javed
 
Software engineering
Software engineeringSoftware engineering
Software engineeringFahe Em
 
Software engineering
Software engineeringSoftware engineering
Software engineeringFahe Em
 

Semelhante a Rv11 (20)

Muves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 FinalMuves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 Final
 
Continuous Profiling in Production: What, Why and How
Continuous Profiling in Production: What, Why and HowContinuous Profiling in Production: What, Why and How
Continuous Profiling in Production: What, Why and How
 
Stating the obvious - 121 Test Automation Day, Dublin, 2018
Stating the obvious - 121 Test Automation Day, Dublin, 2018Stating the obvious - 121 Test Automation Day, Dublin, 2018
Stating the obvious - 121 Test Automation Day, Dublin, 2018
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injection
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programs
 
Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?
 
Benchmarking PyCon AU 2011 v0
Benchmarking PyCon AU 2011 v0Benchmarking PyCon AU 2011 v0
Benchmarking PyCon AU 2011 v0
 
Successful Software Projects - What you need to consider
Successful Software Projects - What you need to considerSuccessful Software Projects - What you need to consider
Successful Software Projects - What you need to consider
 
Workshop BI/DWH AGILE TESTING SNS Bank English
Workshop BI/DWH AGILE TESTING SNS Bank EnglishWorkshop BI/DWH AGILE TESTING SNS Bank English
Workshop BI/DWH AGILE TESTING SNS Bank English
 
Drools & jBPM Info Sheet
Drools & jBPM Info SheetDrools & jBPM Info Sheet
Drools & jBPM Info Sheet
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
The Magic Of Application Lifecycle Management In Vs Public
The Magic Of Application Lifecycle Management In Vs PublicThe Magic Of Application Lifecycle Management In Vs Public
The Magic Of Application Lifecycle Management In Vs Public
 
Stepin evening presented
Stepin evening presentedStepin evening presented
Stepin evening presented
 
Software Testing and Quality Assurance Assignment 2
Software Testing and Quality Assurance Assignment 2Software Testing and Quality Assurance Assignment 2
Software Testing and Quality Assurance Assignment 2
 
Unit Testing Essay
Unit Testing EssayUnit Testing Essay
Unit Testing Essay
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
#DOAW16 - DevOps@work Roma 2016 - Testing your databases
#DOAW16 - DevOps@work Roma 2016 - Testing your databases#DOAW16 - DevOps@work Roma 2016 - Testing your databases
#DOAW16 - DevOps@work Roma 2016 - Testing your databases
 
Practical machine learning
Practical machine learningPractical machine learning
Practical machine learning
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 

Último

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Rv11

  • 1. Runtime Analysis in the Cloud:Challenges and Opportunities Wolfgang Grieskamp Software Engineer, Google Corp
  • 2. About Me < 2000: Formal Methods Research Europe (TU Berlin) 2000-2006: Microsoft Research: Languages and Model-Based Testing Tools 2007-2011: Microsoft Windows Interoperability Program: protocol testing and tools Since 4/2011: Google: Google+ platform and tools DISCLAIMER: This talk does not necessarily represent Google’s opinion or direction.
  • 3. Content of this talk General blah blah about cloud computing Monitoring the Cloud Testing the Cloud Using the Cloud for Development A formal method guy’s dream… Conclusion
  • 5. What is Cloud Computing? From Wikipedia, the free encyclopedia Cloud computing is the delivery of computing as a service rather than a product, whereby shared resources, software and information are provided to computers and other devices as a utility (like the electricity grid) over a network (typically the Internet).
  • 6. What is Cloud Computing? From Wikipedia
  • 7.
  • 8. It is device independent
  • 9. It has a unified access control model
  • 10. You pay for only what you use
  • 11. It can scale-up or scale-down on demand
  • 12.
  • 14. IAAS (Infrastructure As A Service) Basic building blocks Storage Networking Computation Homogenous, easy migration (based on VMs) Distributed Data Centers over Geographical Zones Players: Amazon, GoGrid, Rackspace, Microsoft, Google Estimated revenue 2010: $1b (Source: Economist)
  • 15. Platform As A Service (PAAS) Basic Building Blocks Operating System Frameworks and Development Tools Deployment and Monitoring Tools Players: Microsoft, Google, IBM, SAP, … Estimated Revenue 2010: $311m (Source: Economist)
  • 16. Software As A Service (SAAS) Device and location independent applications, typically running in a browser (Email, Social, Retail, Enterprise Apps, etc.) Many different players Estimated revenue 2010: $11.7b (Source: Economist)
  • 18. Monitoring vs RV vs Testing What’s the difference? A (strict) take: Monitoring collects and presents information for human analysis Runtime verification collects and transforms information for automated analysis which ultimately leads to a verdict Testing does the above things in an isolated, staged or mocked, environment. In particular, stimuli from the environment are simulated.  In practice, boundaries are not so clear. For this talk RV = Monitoring (adapting to Google conventions)
  • 19. Anatomy of a Data Center Data Center A Data Center B …… Controller Server Controller Server Server Server Server … Storage Storage Storage Storage Storage Note: abstracted and simplified
  • 20. Anatomy of a Server Data Center A Data Center B …… Controller Server (VM) Controller Server Controller Job Job Job Server Server Server Server … Monitor Monitor Monitor Storage Storage Storage Storage Logs Alert Note: abstracted and simplified
  • 21. Anatomy of a Service Data Center A Data Center B …… Controller Service (across Servers) Job Job Server Controller Job Job Job Server Server Server Server … Job Storage Storage Storage Storage Storage Storage Note: abstracted and simplified
  • 22. Monitoring Types @ Google Black Box Monitoring White Box Monitoring Log Analysis
  • 23. Black Box Monitoring Job Monitor Frequently send requests and analyze the response Possible because server jobs are ‘stateless’ and always input enabled If failure rate over a certain time interval exceeds a given ratio, raise an alert and page an engineer Engineers aim for minimizing paging and avoiding false positives
  • 24. Black Box Monitoring: How its done @ Google There are rule based languages for defining request/responses. Each rule: Synthesizes an HTTP request Analyzes the response using a regular expression Specifies frequency and allowed failure ratio Rules are like tests: a simple trigger and a simple response analysis Monitors can be also custom code Job Monitor
  • 25. Black Box Monitoring: Issues? Is the ‘stateless’ hypothesis feasible? Nothing is really stateless -- state is passed as parameters in cookies, continuation tokens, etc. However, as these are health tests, state can be ignored What is the relation to testing? In theory very similar, only that the environment is not mocked. In practice uses quite different frameworks/languages What about service/system level monitoring? Its only about one job. Doesn’t give failure root cause (it only measures a symptom) Job Monitor
  • 26. ChallengeIntegrate Black-Box Monitoring and Testing Job Monitor Black-box monitoring can be seen as particular way of executing tests end-to-end on the live product such that the impact on performance can be neglected. Frameworks which integrate design and execution of monitoring rules and test cases are promising Mainly an engineering challenge
  • 27. Job Job ChallengeSystem/Service Level Black-Box Monitoring Monitor Monitor Not commonly done Main purpose would be failure cause analysis and failure prevention Simple local monitoring already discovers failures Is there a strong point of doing it at runtime (vs. log analysis)? Only if real-time prevention and potentially repair is important Monitor
  • 28. ChallengeProtocolcontract verification Job Monitor At Google, all communication between jobs happens via a single homogeneous RPC mechanism based on a message format definition language (called protocol buffers) Also all (= terra bytes) of data is stored in formats specified by protocol buffers One could formulate data invariant and protocol sequencing contracts over protocol buffers and enforce them at runtime
  • 29. White-Box Monitoring Server exports collection of probe points (variables) Memory, # RPCs, # Failures, etc. Monitor collects time series of those values and computes functions over them Dashboards prepare information graphically Mostly used for diagnosis by humans Job Monitor
  • 30. White-Box Monitoring:How its done @ Google Job Monitor Declarative language for time series computations Collects samples from the server by memory scraping Merging of similar data from multiple servers running the same job Rich support for diagram rendering in the browser
  • 31. White-Box Monitoring:Issues? Design for monitorability/testability? Its already ubiquitous throughout, since software engineers are themselves on-call… Distributed collection/network load? Not really an issue because it’s sample based Relation to testing? Same as with black-box – should be a common framework. Automatic root cause analysis and self-repair? Current systems mostly build for human analysis and repair. Self-repair would be a big thing. Job Monitor
  • 32. ChallengeSelf-repair Job Monitor Cloud system are homogenous and operate with redundancy Many VMs with exactly the same properties Self-repair could identify and ‘drain’ faulty parts of the system, apply fallbacks, rollback software updates, etc. One major cause of cloud failures are outages Another major cause are software updates A semi-automated approach suggesting actions to a human would be already very useful Ever got paged at 2am in the morning?
  • 33. ChallengeHybrid White-Box Monitoring / RV Foundations Job Monitor The data collected in white-box monitoring represents continuous and often stochastic functions over time. The triggers for discrete actions (like alerts) are thresholds over integrated values of those functions. Sounds like hybrid systems/automaton. Has anybody looked at it like this in RV community?
  • 34. Log Analysis Job Logs Collect data from each server’s run containing information like operation flow, exceptions, etc. Store data over a window of time (say for last 24h) Access data from various sources programmatically to analyze issues (post-mortem, performance, etc.) Allows for correlation of system/service wide information
  • 35. Log Analysis:How its done @ Google Job Logs Very fined grained logging on job side; huge amounts of data collected Logs are stored in Bigtable (Google’s large-scale storage solution) Logs are analyzed using parallel (cloud) computing, e.g. with Sawzall, a declarative language based on map-reduce Logs are most often used for failure cause analysis/debugging
  • 36. Log Analysis:Issues? Job Logs Amount of data and accessibility? Not really an issue because of highly performing distributed file systems Format of the data? Logs are structured data (at Google, protocol buffers) Encryption? A big issue: if it can’t be decrypted, not much may be diagnosable. If it can be decrypted, the access to this now decrypted data needs to be restricted.
  • 37. ChallengePrivacy and Encryption Job Monitor Data logged (or otherwise analyzed) during monitoring may contain encrypted proprietary information A problem may not be diagnosable without decryption Decrypted clear text data (in particular if logged) needs to be highly protected Automatic obfuscation and/or anonymization would be highly desirable A protocol may need to be designed for this in the first place
  • 40. ChallengeIntegration Testing Job Job Job Storage Two or more components are plugged together with a partially mocked environment These tests are usually very ‘flaky’ (unreliable) because: Difficulty to construct mocked component’s precise behavior (its more than a simple mock in a unit test) Difficulty to synthesize mocked component’s initial state (it may have a complex state) Potential solution: model-based testing
  • 41. Model-Based Testing in a Nutshell Requirements Feedback Author Model Feedback Feedback Generate Test Suite Issue Expected Outputs (Test Oracle) Inputs (Test Sequences) Verdict Control Observe Feedback Implementation
  • 42. Technical Document Testing Program of Windows: A Success Story for MBT 222 protocols/technical documents tested 22,847 pages studied and converted into requirements 36,875 testable requirements identified and converted into test assertions 69% tested using MBT 31% tested using traditional test automation 66,962 person days (250+ years) Hyderabad: 250 test engineers Beijing: 100 test engineers 38
  • 43.
  • 44. Vendor 2 modeled 85% of all test suites, performing relatively much better than Vendor 139 Grieskamp et. al: Model-based quality assurance of protocol documentation: tools and methodology. Softw. Test., Verif. Reliab. 21(1): 55-71 (2011)
  • 45. Exploiting the Cloud for Development
  • 46.
  • 47. Consequences for Testing, Program Analysis, etc. Need to rethink the base assumptions of some of the existing approaches for testing and program analysis for massive coarse-grained parallelism: Early divide-and-conquer ideal, e.g. initial random seed than run-till-end and collect and compare Try different heuristics on the same problem; see which one wins Techniques like SMT and concolic execution can largely benefit from this
  • 48. A formal method guy’s dream…
  • 49. Many Languages, one Subject… Service These languages are all about the same semantic thing! They use the same data structures (messages), protocols, architecture, etc.
  • 50. Unify some of those Languages! Service
  • 51. SSL – Service Specification Language
  • 52. Conclusions The cloud brings nothing really new, but it changes priorities. Some of the problems traditionally researched in RV and testing are non-brainers (at least at Google). Others do wait for a solution. Have ideas? Apply for a Google Research Award (Google it). Deadline is February 1, 2012.

Notas do Editor

  1. Formal method guyMSR: AsmL, Spec#, Spec ExplorerWindows: Improving Microsoft Documentation (Open Documents) as resulting regulatory scrutiny of Microsoft. Using Model-Based Testing on a large scale in a project w/ over 300 person years of test effortSince April: Engineer @ Google. Currently working on G+ platform and tools. Just graduated from what they call a ‘Noogler’ (new Googler) ramping up on Google technology. To be clear: What I’m saying here is my personal viewpoint not necessarily inline with Google’s official opionion or direction. My intend is to not show you some research results or technologies as they come out of Google. Rather I’m trying to extract some challenges and opportunities for runtime verifcation research from my personal viewpoint on how things work @ Google. Other people @Google or in the industry might have totally different viewpoints.Also note that some of the technology used in the Google stack is confidential, others well-known or even open source. When it comes to the confidential parts, I will need to abstract and extract the underlying abstract concepts. So don’t take me literally when talking about the Google stack.
  2. Will generally talk about what cloud computing is, how the market looks like. There is a lot of confusion about cloud computing, and I’ll try to set some scope. This maybe cold coffee for many but perhaps not for all.The largest part of this talk, will talk about monitoring (or RV or RA or however you want to call it) the cloud. This will present how a data center works, and how Google uses monitoring techniques to run it. It will pinpoint where there IMO no issues as the existing technologies work surprisingly well. It will pinpoint where I see gaps and challenges for research.I will shortly talk about testing the cloud. The biggest challenge here is integration testing, for which I suggest MBT. In this context I will also shortly talk about my experience with MBT when working for MS.I will dig into how the cloud can be actually exploited for software development. There are exciting opportunities here which wait to be applied.I will insert a small plug about a personal vision or dream around languages and specificationConclusion
  3. Computation as a serviceIts like a utilityBased on shared resources
  4. The picture you find on Wikipedia. It gives an idea of the components:Applications/SoftwarePlatformInfrastructure (Hardware)We dig into some of this in detail later on
  5. Some properties of cloud computing as a user experiences it.
  6. Let me talk a little bit about the cloud stack from the Market perspective. Most information I present here is extracted from an excellent article in the Economist (including this nice picture)
  7. The three segments of the cloud stackSAAS‘PARSE’Eye-ASS
  8. [Pronouncedeye-ass]Provides the hardware layers (data centers)For Google, one large one is for example in The Dalles Oregon. Size of two football fields, cooling towers four stories high, etc.A very important property is homogeneity of the hardware, often also achieved by using VMs. That makes it possible to migrate jobs in and between DCsThere are a number of big players. The actual size of the DCs (in number of machines) is highly confidential for each player. Guess why? …. A big issuein IAAS is the allocation problem, i.e. how many machines are required to provide certain services. The various players have there own ‘black magic’ to compute this.The revenue numbers are taking from the Economist article. The actual revenue is not disclosed directly by the players, so this is an estimate of the author of this article, and I have no idea about its accuracy… (In particular, this is not a number from Google!) The number looks actual relative small. This maybe related to that IAAS is not usually sold ‘as is’ but the actual end products – PAAS or SAAS – is sold.
  9. Pronounced parseThis is basically the operating system plus frameworks and development tools. Not too many players in this space. For Google, it’s the app engine framework, which allows you to create and place applications in the Google cloud. For Microsoft, its Azure and Visual Studio.Estimated revenue of the whole market: again relative small. For companies like MS, platform is more a strategic investment which pays off indirectly.
  10. Pronounced SARSNow this is how a user actually experiences the cloudThere are many players hereThe business is the largest
  11. After this introduction, let us get a bit more technical and talk about monitoring of the cloud
  12. The notions Monitoring, RV, RA, Testing, are often used to name similar or related things. What are the differences?Here is an attempt to make a definition.However, as in practice the boundaries are not so clear, and in particular at Google, monitoring is often used where other people would say RV, I will identify RV and monitoring.Testing is still a subject by itself though there is a lot of overlap.
  13. Lets take a closer look how a DC actually works.If some sends a request on a domain like google.com, then the first thing is that a regional DNS will resolve this to particular DC closest to the locationIt reaches there some controller which will forward the request in kind of a hierarchy until a particular server (VM) is reached which handles the requestNote that high performing NFS is very important in a DC, so machines share a common storageMachines may be further organized in racks which may have certain replicated resources for shared storage
  14. A server (or VM) usually runs a number of jobs (processes). Certain jobs which highly interact with each other may be arranged to run on the same serverIf it comes to monitoring, usually each (or a number) of jobs has associated a dedicated process which monitors the health of this job,The monitors collect data and can send alerts to user instances in the systen, alert managers.
  15. A service (in contrast to a server) is about actually serving the initial user requestIt usually splits the task in sub-requests which are served by other jobs, often called backends. This is the major source of complexity in managing this kind of software.For certain activities, there maybe hundreds of jobs involved to get one request finally served
  16. Now lets take a closer look at how monitoring works @Google. Black box: checks the health and basic functionalityWhite box: provides access to the internal state of a job collecting time series of dataLog analysis: processes after the fact logged data
  17. Lets look at what are the issues with BB monitoring (if any). Where does it work and where not?
  18. This is one problem. The worlds of test and of BB monitoring are largely disjoint.This maybe partly because of the original different engineering disciplines of operating engineers (called SREs) and software engineersIt would be nice to run some of the monitoring rules already at test time. And it would be nice to run of the test cases at monitoring time.It’s a matter of setting up a framework like Junit to decorate test cases for monitoring and provide other required meta dataNot really rocket science, though
  19. Monitor what actually goes wrong when a request is serviced, following all its way to the topologyActually not important to catch failures; this works alreadyMaybe more important to analysis causes of failures and potentially prevent them before things crash