Site Reliability Engineering - Descubra a nova era para (Infraestrutura|Opera...Pery Lemke
SRE (Site Reliability Engineer) é um mindset mais antigo que DevOps, porém somente agora vem se tornando popular, mas você sabe o que é?
Criada originalmente pelo Google e aplicado em diversas empresas no mundo, SRE é uma função e uma cultura que está tornando ainda mais dinâmica a área de (Infraestrutura|Operações) de TI.
Esta talk visa mostrar o impacto do SRE na geração de valor das empresas em que o mesmo está sendo aplicado.
Site Reliability Engineering (SRE), ou "Engenharia de Confiabilidade de Sites" (em uma tradução livre), é uma disciplina que incorpora aspectos da engenharia de software e os aplica a resolução de problemas de operações de TI. Ou seja, são profissionais de engenharia de software que se responsabilizam, de forma multidisciplinar, na gestão e automação do ambiente de tecnologia. Os principais objetivos são criar sistemas de software ultra escaláveis e altamente confiáveis. De acordo com Ben Treynor, fundador da Equipe de Confiabilidade de Site da Google, o SRE é "o que acontece quando um engenheiro de software é encarregado do que costumava ser chamado de operações"
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
SRE (service reliability engineer). The talk is to explain the SRE philosophy and the principles of production engineering and operations in clouds.
(Language – English)
Pavlo is ADOP (Accenture DevOps Platform) Service Reliability Team Lead, SRE practitioner. Has more then 18 years of IT experience in Ops and Dev.
This talk explains a proven approach to assessment SRE practices for an organization. The approach uses a 9 pillar model and 7 step transformation blueprint to determine current state of SRE practices and to set a roadmap to improve SRE practices towards industry best practices.
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaKeet Sugathadasa
When it comes to Site Reliability Engineering, short for SRE, the resources available online are only limited to the books published by Google themselves. They do share some useful case studies that will help us understand what SRE is, and how to understand the concepts given in it, but they do not clearly explain how to build your own SRE team for your organization. The concept of SRE was cooked fresh within the walls of Google and later released to the general public as a practice for anyone to follow.
In this presentation I would like to give a brief introduction to SRE and why it is important to any Software Engineering organization. This is based on my experiences and learnings from leading a Site Reliability Engineering team for leading organizations in the US and Norway.
This presentation was conducted by me as a Tech Talk as an Associate Technical Lead at Creative Software Sri Lanka.
<p>From <a href="https://en.wikipedia.org/wiki/Site_reliability_engineering" target="_blank">Wikipedia</a>: Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.<p>
<p>Over the past year Acquia has built their own SRE team to help their products and services scale with the demand of our growing number of customers. We wish to share our experience so that others are enabled to do the same and reap the rewards.</p>
<p>This presentation will discuss how the SRE team came about at Acquia, what achievements we have made so far, and the lessons we have learned along the way. We will then show the steps on how to introduce SRE to your workplace so you can deliver more reliable and scalable services to your customers! We will specifically cover:</p>
<ul>
<li>SRE's basic concepts and history from Google</li>
<li>The management support you will need to get started</li>
<li>Introducing the idea of service level objectives and error budgets</li>
<li>Operational Responsibility Assessments as a tool to measure risk</li>
<li>Creating a Launch Readiness Checklist to standardize and improve product launches</li>
<li>Finding ideal candidates for your SRE team</li></ul>
<p>The intended audience are software engineers, system administrators, and managers that have a desire to improve how they do their work and how their products/services perform.</p>
Site Reliability Engineering - Descubra a nova era para (Infraestrutura|Opera...Pery Lemke
SRE (Site Reliability Engineer) é um mindset mais antigo que DevOps, porém somente agora vem se tornando popular, mas você sabe o que é?
Criada originalmente pelo Google e aplicado em diversas empresas no mundo, SRE é uma função e uma cultura que está tornando ainda mais dinâmica a área de (Infraestrutura|Operações) de TI.
Esta talk visa mostrar o impacto do SRE na geração de valor das empresas em que o mesmo está sendo aplicado.
Site Reliability Engineering (SRE), ou "Engenharia de Confiabilidade de Sites" (em uma tradução livre), é uma disciplina que incorpora aspectos da engenharia de software e os aplica a resolução de problemas de operações de TI. Ou seja, são profissionais de engenharia de software que se responsabilizam, de forma multidisciplinar, na gestão e automação do ambiente de tecnologia. Os principais objetivos são criar sistemas de software ultra escaláveis e altamente confiáveis. De acordo com Ben Treynor, fundador da Equipe de Confiabilidade de Site da Google, o SRE é "o que acontece quando um engenheiro de software é encarregado do que costumava ser chamado de operações"
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
SRE (service reliability engineer). The talk is to explain the SRE philosophy and the principles of production engineering and operations in clouds.
(Language – English)
Pavlo is ADOP (Accenture DevOps Platform) Service Reliability Team Lead, SRE practitioner. Has more then 18 years of IT experience in Ops and Dev.
This talk explains a proven approach to assessment SRE practices for an organization. The approach uses a 9 pillar model and 7 step transformation blueprint to determine current state of SRE practices and to set a roadmap to improve SRE practices towards industry best practices.
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaKeet Sugathadasa
When it comes to Site Reliability Engineering, short for SRE, the resources available online are only limited to the books published by Google themselves. They do share some useful case studies that will help us understand what SRE is, and how to understand the concepts given in it, but they do not clearly explain how to build your own SRE team for your organization. The concept of SRE was cooked fresh within the walls of Google and later released to the general public as a practice for anyone to follow.
In this presentation I would like to give a brief introduction to SRE and why it is important to any Software Engineering organization. This is based on my experiences and learnings from leading a Site Reliability Engineering team for leading organizations in the US and Norway.
This presentation was conducted by me as a Tech Talk as an Associate Technical Lead at Creative Software Sri Lanka.
<p>From <a href="https://en.wikipedia.org/wiki/Site_reliability_engineering" target="_blank">Wikipedia</a>: Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.<p>
<p>Over the past year Acquia has built their own SRE team to help their products and services scale with the demand of our growing number of customers. We wish to share our experience so that others are enabled to do the same and reap the rewards.</p>
<p>This presentation will discuss how the SRE team came about at Acquia, what achievements we have made so far, and the lessons we have learned along the way. We will then show the steps on how to introduce SRE to your workplace so you can deliver more reliable and scalable services to your customers! We will specifically cover:</p>
<ul>
<li>SRE's basic concepts and history from Google</li>
<li>The management support you will need to get started</li>
<li>Introducing the idea of service level objectives and error budgets</li>
<li>Operational Responsibility Assessments as a tool to measure risk</li>
<li>Creating a Launch Readiness Checklist to standardize and improve product launches</li>
<li>Finding ideal candidates for your SRE team</li></ul>
<p>The intended audience are software engineers, system administrators, and managers that have a desire to improve how they do their work and how their products/services perform.</p>
Independently from the DevOps movement but starting from the same problems, Google developed its own strategy defining a new specific role called SRE (Site Reliability Engineer). This introduction tries to explain the history and the concept of this methodology and to compare it with the DevOps manifesto to understand what does it mean to adopt DevOps and what does it mean to be an SRE and what the two things are sharing and where they diverge.
According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain the term SRE (Site Reliability Engineering) and introduce key metrics for an SRE team SLI, SLO, and SLA.
Youtube Channel here: https://www.youtube.com/playlist?list=PLm_COkBtXzFq5uxmamT0tqXo-aKftLC1U
Site Reliability Engineer (SRE), We Keep The Lights On 24/7NUS-ISS
There are many phases in the software development cycle, from requirements to development and testing, but at the tail of the process, is an often overlooked aspect: deployment and delivery. With the paradigm shift of delivering on-site software to offering software-as-a-service, Site Reliability Engineering is beginning to take a greater role in product delivery.
This session aims to give a glimpse of the work that goes into site reliability engineering (SRE) and effort that goes into keeping a service going 24/7.
An overview of Google's Site Reliability Engineering with a view toward possible incorporation in the IEEE P2675 DevOps security standard. (Creative Commons with credit.)
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.
Presenter: Perry Statham
SRE Squad Leader with IBM Cloud DevOps Services
In this presentation, the IBM DevOps Services SRE team will give a brief introduction to Site Reliability Engineering, then show how they adopted its principals in their existing enterprise organization.
Managing a team and project are quite synonymous. Especially, teams require effective distribution of responsibility / roles. Once that is setup, a proper process guides people to make progress. All this fits into a product lifecycle, which is essential to develop the right product, in the right way, and deliver it at the right time.
Service Level Terminology : SLA ,SLO & SLIKnoldus Inc.
Measuring outcomes is always at the top of our mind when approaching goals. While we do have specific targets we may be aiming for, circling back to confirm that the resulting outcome is in fact what you were after is extremely important. Small course corrections are required. Outcomes may be more general but often attract the attention and support of decision-makers earlier.
Key measurements and thresholds to hold us accountable for our efforts as well as communicate expectations across the entire organization needed to be established. Nearly every resource you find regarding site reliability engineering will talk about key metrics used to establish high-level objectives, indicators of the movement toward or away from those objectives, and ultimately what agreements are in place should objectives be unfulfilled.
SLIs will help us know how we are performing against our SLOs and our SLA will outline the consequences (good or bad) of meeting those objectives. Once we have data to observe, we will begin orienting ourselves to it and establish what we believe our SLIs and SLOs to be.
Here’s an outline of the webinar -
~ Learn what an SRE is and isn't.
~ Understand the difference between service-level indicators (SLI), service-level objectives (SLO), and service-level agreements (SLA).
~ Gain an understanding of error budgets and how to calculate reliability cost.
~ Learn how SREs can embed themselves within development teams to increase operational stability
Getting started with Site Reliability Engineering (SRE)Abeer R
"Getting started with Site Reliability Engineering (SRE): A guide to improving systems reliability at production"
This is an intro guide to share some of the common concepts of SRE to a non-technical audience. We will look at both technical and organizational changes that should be adopted to increase operational efficiency, ultimately benefiting for global optimizations - such as minimize downtime, improve systems architecture & infrastructure:
- improving incident response
- Defining error budgets
- Better monitoring of systems
- Getting the best out of systems alerting
- Eliminating manual, repetitive actions (toils) by automation
- Designing better on-call shifts/rotations
How to design the role of the Site Reliability Engineer (who effectively works between application development teams and operations support teams)
How Small Team Get Ready for SRE (public version)Setyo Legowo
How Urbanindo small team engineering team implement Site Reliability Engineering (SRE) in their daily work life and why we choose SRE instead of ordinary DevOps.
Shift left - find defects earlier through automated test and deploymentClaudia Ring
Do you know how much time it takes or how that translates into dollars lost every time you fix a defect in development, QA, or Production? The cost of application failures or errors increases exponentially the further into the delivery pipeline they are when found. If application defects are discovered by end users in Production, or errors cause a Production outage, the cost can be thousands per second, in addition to the intangible loss of reputation.
So how do you begin to identify defects earlier in software development and prevent them from becoming major, costly errors later on? Join Al Wagner, IBM Technical Evangelist, as he discusses how to "shift left" and;
Incorporate service virtualization and automated testing into development for a more thorough and accurate representation of application quality
Integrate deployment automation with continuous testing to remove wait times on application promotion
Adopt best practices that have proven successful for IBM customers who are currently shifting left
Driving on from Agile, organisations are looking to
dramatically increase the rate at which they deliver
new software updates to their customers / business
users by embracing DevOps. This presentation will
explain the Micro Focus approach to DevOps and
how we can help organisations like yours as they
move to Continuous Delivery.
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...Tori Wieldt
How do you make DevOps magic when you aren’t Google? This talk will help whether you’re still figuring out how to create a site reliability practice at your company or you’re trying to improve the processes and habits of an existing SRE team.
A talk that I gave at the Pricefy HQ as an introduction to Site Reliability Engineering and how it relates to DevOps. As the traditional SRE saying goes "hope is not a strategy". So you better be prepared when reallity knocks the door. Enjoy!
GCS - Aula 09 - GCS Ágil
Aspectos quanto ao conceito de GCS Ágil, práticas ágeis relacionadas à GCS e Padrões de Gestão de Configuração de Software
Disciplina de Gestão de Configuração de Software do Curso de Especialização em Engenharia de Software.
Independently from the DevOps movement but starting from the same problems, Google developed its own strategy defining a new specific role called SRE (Site Reliability Engineer). This introduction tries to explain the history and the concept of this methodology and to compare it with the DevOps manifesto to understand what does it mean to adopt DevOps and what does it mean to be an SRE and what the two things are sharing and where they diverge.
According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain the term SRE (Site Reliability Engineering) and introduce key metrics for an SRE team SLI, SLO, and SLA.
Youtube Channel here: https://www.youtube.com/playlist?list=PLm_COkBtXzFq5uxmamT0tqXo-aKftLC1U
Site Reliability Engineer (SRE), We Keep The Lights On 24/7NUS-ISS
There are many phases in the software development cycle, from requirements to development and testing, but at the tail of the process, is an often overlooked aspect: deployment and delivery. With the paradigm shift of delivering on-site software to offering software-as-a-service, Site Reliability Engineering is beginning to take a greater role in product delivery.
This session aims to give a glimpse of the work that goes into site reliability engineering (SRE) and effort that goes into keeping a service going 24/7.
An overview of Google's Site Reliability Engineering with a view toward possible incorporation in the IEEE P2675 DevOps security standard. (Creative Commons with credit.)
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.
Presenter: Perry Statham
SRE Squad Leader with IBM Cloud DevOps Services
In this presentation, the IBM DevOps Services SRE team will give a brief introduction to Site Reliability Engineering, then show how they adopted its principals in their existing enterprise organization.
Managing a team and project are quite synonymous. Especially, teams require effective distribution of responsibility / roles. Once that is setup, a proper process guides people to make progress. All this fits into a product lifecycle, which is essential to develop the right product, in the right way, and deliver it at the right time.
Service Level Terminology : SLA ,SLO & SLIKnoldus Inc.
Measuring outcomes is always at the top of our mind when approaching goals. While we do have specific targets we may be aiming for, circling back to confirm that the resulting outcome is in fact what you were after is extremely important. Small course corrections are required. Outcomes may be more general but often attract the attention and support of decision-makers earlier.
Key measurements and thresholds to hold us accountable for our efforts as well as communicate expectations across the entire organization needed to be established. Nearly every resource you find regarding site reliability engineering will talk about key metrics used to establish high-level objectives, indicators of the movement toward or away from those objectives, and ultimately what agreements are in place should objectives be unfulfilled.
SLIs will help us know how we are performing against our SLOs and our SLA will outline the consequences (good or bad) of meeting those objectives. Once we have data to observe, we will begin orienting ourselves to it and establish what we believe our SLIs and SLOs to be.
Here’s an outline of the webinar -
~ Learn what an SRE is and isn't.
~ Understand the difference between service-level indicators (SLI), service-level objectives (SLO), and service-level agreements (SLA).
~ Gain an understanding of error budgets and how to calculate reliability cost.
~ Learn how SREs can embed themselves within development teams to increase operational stability
Getting started with Site Reliability Engineering (SRE)Abeer R
"Getting started with Site Reliability Engineering (SRE): A guide to improving systems reliability at production"
This is an intro guide to share some of the common concepts of SRE to a non-technical audience. We will look at both technical and organizational changes that should be adopted to increase operational efficiency, ultimately benefiting for global optimizations - such as minimize downtime, improve systems architecture & infrastructure:
- improving incident response
- Defining error budgets
- Better monitoring of systems
- Getting the best out of systems alerting
- Eliminating manual, repetitive actions (toils) by automation
- Designing better on-call shifts/rotations
How to design the role of the Site Reliability Engineer (who effectively works between application development teams and operations support teams)
How Small Team Get Ready for SRE (public version)Setyo Legowo
How Urbanindo small team engineering team implement Site Reliability Engineering (SRE) in their daily work life and why we choose SRE instead of ordinary DevOps.
Shift left - find defects earlier through automated test and deploymentClaudia Ring
Do you know how much time it takes or how that translates into dollars lost every time you fix a defect in development, QA, or Production? The cost of application failures or errors increases exponentially the further into the delivery pipeline they are when found. If application defects are discovered by end users in Production, or errors cause a Production outage, the cost can be thousands per second, in addition to the intangible loss of reputation.
So how do you begin to identify defects earlier in software development and prevent them from becoming major, costly errors later on? Join Al Wagner, IBM Technical Evangelist, as he discusses how to "shift left" and;
Incorporate service virtualization and automated testing into development for a more thorough and accurate representation of application quality
Integrate deployment automation with continuous testing to remove wait times on application promotion
Adopt best practices that have proven successful for IBM customers who are currently shifting left
Driving on from Agile, organisations are looking to
dramatically increase the rate at which they deliver
new software updates to their customers / business
users by embracing DevOps. This presentation will
explain the Micro Focus approach to DevOps and
how we can help organisations like yours as they
move to Continuous Delivery.
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...Tori Wieldt
How do you make DevOps magic when you aren’t Google? This talk will help whether you’re still figuring out how to create a site reliability practice at your company or you’re trying to improve the processes and habits of an existing SRE team.
A talk that I gave at the Pricefy HQ as an introduction to Site Reliability Engineering and how it relates to DevOps. As the traditional SRE saying goes "hope is not a strategy". So you better be prepared when reallity knocks the door. Enjoy!
GCS - Aula 09 - GCS Ágil
Aspectos quanto ao conceito de GCS Ágil, práticas ágeis relacionadas à GCS e Padrões de Gestão de Configuração de Software
Disciplina de Gestão de Configuração de Software do Curso de Especialização em Engenharia de Software.
O GitLab é um gerenciador de repositórios baseado em Git. Suas ferramentas incluem um wiki, um gerenciador de tarefas e um pipeline de CI/CD, etc. O GitLab é similar ao GitHub, porém, por ser open source, pode ser armazenado em infraestrutura própria, além da versão em nuvem do mesmo, podendo ter repositórios públicos e privados.
GitLab Runner, o GitLab permite que você use o Runner, que é um projeto open source que é usado para executar os jobs e enviar os resultados de volta para o GitLab. Ou seja, ele nos permite buildar sem a necessidade de nenhuma instalação externa.
Veremos uma introdução de como tudo isso funciona.
O Visual Studio Summit 2015 reuniu desenvolvedores de software de todo o Brasil e o MVP Ramon Durães iniciou o evento com a palestra "Impacto do DevOps nos negócios" discutindo a importância da agilidade, qualidade e segurança no desenvolvimento de software para atender o consumidor 5.0
Presentation that took place during "IQPC SOA Event", September 2008, São Paulo, Brazil.
A case of a brazilian Telco that implanted SOA using an interesting approach.
Authors: Davi Carvalho (CIO) and Denis Bertoluci (Software Architecture Manager)
Utilizando metologias ágeis com VSTS: Scrum e XP, YES WE CAN! (ALM204)André Dias
Será apresentada uma breve introdução sobre o SCRUM, as práticas de gerenciamento e os pensamentos que o tornam tão “polêmico” e em seguida serão apresentadas práticas de engenharia de software que complementam o SCRUM utilizando o Visual Studio Team System para gerenciar Story Cards, Tasks, Kanban, acompanhamento de Burndown, além de práticas da Extreme Programming como TDD, Refactoring e Continuous Integration.
Software na medida certa: desmistificando pontos de função - apresentado no III Simpósio de Gestão Pública e TI do Governo de Pernambuco, em 01 de dezembro/2010
Semelhante a Uma introdução à SRE - Site reliability engineering (20)
2. - Desenvolvimento: Toda atividade relacionada a desenvolver novas features,
corrigir bugs e reduzir o débito técnico
- Operações: Atividades voltadas à manutenção e configuração de servidores e
infraestrutura
- Deploy: Ação de publicar novas versões de um determinado software
Glossário
3. Devs
Objetivos: Desenvolver e publicar software, com
alterações e novas features
Antes de tudo, como era antes?
Sysadmins:
Objetivos: Manter os sistemas estáveis e
funcionais
4. Surgem os conflitos de interesse
Times disfuncionais e custos
diretos e indiretos
Toda nova publicação
pode potencialmente
quebrar os sistemas
rodando
Demoras e restrições
para publicar novas
features e correções de
bug geram o custo de
oportunidade
+
5.
6. Cultura
Devops
Uma combinação de práticas unindo dev e operações, visando diminuir o tempo do
ciclo de desenvolvimento e promover entrega contínua
9. Sistemas instáveis degradam a
confiança do usuário e trazem
diversos prejuízos.
Gerenciando o
risco e
melhorando a
estabilidade
10. Nem sempre um sistema estável ao
extremo é a melhor solução
Custos e riscos de
uma alta
disponibilidade
11. Custo de recursos redundantes
Para termos uma disponibilidade alta, uma das estratégias mais comuns é a
redundância de recursos, onde disponibilizamos a mesma aplicação em vários
servidores diferentes
12. Custo de oportunidade
Ao escolher aumentar a estabilidade, estamos abrindo mão de desenvolver novas
features e produtos
13. SLIs, SLOs e SLAs
SLI (Service level indicator): qualquer tipo de
métrica relacionada com a disponibilidade, como
latência, throughput e quantidade de erros.
SLO (Service level objective): é o alvo desejado
para os SLIs definidos, geralmente usado
internamente.
SLA (Service level agreement): Um acordo,
geralmente formalizado por contratos e com
obrigações legais vinculadas.
Exemplo:
SLI: Latência dos requests
SLO: Deve ser menor que 300 milissegundos, para
uso interno do time
SLA: Deve ser menor que 500 milissegundos, com
consequências atreladas (multas ou outras
implicações legais)
14. O que é uma disponibilidade desejável?
Existem vários fatores a se considerar, como:
- Criticidade do serviço
- Riscos envolvidos nas falhas sistêmicas
- Esse serviço é ligado diretamente à receita da empresa?
- Existem competidores no mercado? Qual a disponibilidade que eles oferecem?
18. Error budgets
Após definir a disponibilidade desejada, podemos definir nossos error budgets (algo
como orçamento para falhas) e com isso podemos ter decisões mais embasadas. Ex:
- Com 99.9% de SLA, podemos ter 8h de downtime por ano ou 2h por trimestre.
- Se no meio do ano já tivemos 7h de downtime, quer dizer que estamos quase
passando os objetivos, então devemos trabalhar mais em estabilidade
Novas features Estabilidade
21. Mas e o cargo de SRE/Devops?
- Geralmente as pessoas nesse papel irão
cuidar da estrutura e auxiliar diversos times
com automações, ferramentas de
observabilidade e entrega contínua por conta
do background mais especializado
- Importante lembrar que gerir a
confiabilidade dos sistemas é dever de todos.
Isso evita aquela separação entre
desenvolvimento e operações mencionada
anteriormente.
22. “Hope is not a
strategy.”
- Traditional SRE saying