ING is a global financial services provider serving over 35 million customers across more than 40 countries. The organization aims to improve the reliability of its services through a Site Reliability Engineering (SRE) approach. ING's SRE team works to enable engineering teams by delivering tooling, facilitating processes like post-mortem reviews, and providing consulting and education. The SRE team focuses on reducing the time to repair issues through engineering improvements like adopting Prometheus for monitoring and Mattermost for ChatOps. After two years, some lessons learned are to never compromise on mindset in hiring SREs, assign a product owner, test the SRE approach in a pilot phase, and ensure tooling is easy for others to use and adopt
3. ING is a global financial service provider servicing more than 35 million customers. In the
Netherlands we are the banking sector market leader with over 8 million retail customers
3
Customers
35 million
private, corporate and
institutional customers
Countries
more than 40
In Europe, Asia, Australia,
North and South America
Employees
52,000 worldwide
12,416 in NL
Market leaders Benelux
Growth markets
Commercial Banking
Challengers
4. 4
Mobile Banking used by
3,5 million customers
who generate 4,4 million
log ins per day.
Internet Banking used
by 6,1 million customers
who jointly log in 1,4
million times a day.
17400 machines are
spread over 2 data
centers and use 14 PB of
storage.
6. Site Reliability Engineering, as pioneered by Google, is doing
work historically done by operations teams but using
engineers who aim is to automate the toil within their
organization.
By design, it is crucial that SRE teams are focused on
engineering. There is a 50% cap on operational work (tickets,
on-call, manual tasks) and at least 50% of SRE time should
be spent on engineering.
Site Reliability Engineering (SRE) is what happens when you ask a software
engineer to design an operations team
6
7. Within ING we have a number of challenges related to our reliability that we
want to solve through SRE
7
Teams are not in control of monitoring solutions and
cannot fix it when broken.
It takes too long for an alert to reach the right team: on
average we need 69 minutes before an engineer starts
working an incident resolution.
We do not learn enough from mistakes made – we have
yet to become a learning organization.
We prove we are in control with documents, not by
checking the actual state of our code in production.
Teams are not always aware of their services’
performance and cannot take full responsibility for run.
Our centralized monitoring solutions sometimes
encounter scalability and availability issues.
Our centralized alerting solution is unreliable and
does not send alerts directly to BizDevOps teams.
The same incidents occur multiple times and
we do not follow up on incidents enough.
Our engineers spend more time on completing
documents than coding.
Teams do not always measure availability
from a white box monitoring perspective.
8. We have adopted the Spotify model and work in Tribes composed of
BizDevOps squads: our SRE team is positioned centrally within NL as a silo
8
SRE
enable & supportCL
PO
9. Product
Development
Capacity Planning
Testing + Release
Procedures
Postmortem/Root Cause Analysis
Incident Response
Monitoring
Our SRE team enables engineering teams through delivery of tooling,
facilitation, consulting and education
9
We facilitate BizDevOps squads during post mortems
and consult whenever our help is needed in fixing or
identifying reliability issues.
We build tooling to enable BizDevOps squads. At the moment
we focus on Prometheus (alerting, white box monitoring and
traffic modeling) and Mattermost (ChatOps).
We educate others about SRE during demos and
we develop training materials.
We facilitate the creation of more SRE teams and
ask them to join our SRE community meetings
with the other NL-based SRE teams.
We are not on call: BizDevOps teams are
responsible for their own build and run.
10. We aim to reduce our time to repair through engineering by improving our
monitoring with Prometheus and introducing ChatOps with MatterMost
10
pull metrics
queries
push alerts
Prometheus &
Alert manager
11. And now for the E in SRE: Introducing the Reliability Toolkit
11
Alert
Manager
Model
Builder
SMS
E-mail
ChatOps
Tools Metrics
NLA
Client libraries in engineering frameworks
CollectD
Alert
Manager
Model
Builder
SMS
E-mail
ChatOps
Alert
Manager
Model
Builder
SMS
E-mail
ChatOps
Alert
Manager
Model
Builder
SMS
E-mail
ChatOps
15. Our learnings after two years of SRE at ING
15
People
Process
Technology
▪ Never compromise on mindset in hiring SREs.
▪ Assign a PO to protect team focus on engineering and to spread the SRE love.
▪ Consider what mix works well for you in terms of new and existing hires, or think about
possibilities of SRE internships.
▪ Test if SRE works for you by doing a pilot phase.
▪ Have a vision on your definition of SRE as a team, define a roadmap together.
▪ Learn from others through online resources, at conferences or company visits.
▪ Prepare to spend time explaining and promoting SRE and your tooling.
▪ Beer o’clock is great for team bonding.
▪ Make it attractive for others to use your tooling: take away pain from teams,
incorporate your tooling in widely used frameworks, find quick wins.
▪ Productization takes time, a lot of time. Don’t underestimate this.
▪ Consider scalability and ownership in your tooling strategy.