The speaker discusses how Lookout has scaled its engineering organization and technical infrastructure over time. In 2011, Lookout had problems with unreliable deployments and a monolithic codebase. It introduced new tools like JIRA, Jenkins, and Git/Gerrit to improve its workflow. It also automated deployments and now has a much higher success rate. As Lookout has grown, it has moved to a more distributed architecture with over 100 microservices running on different technologies. Scaling organizational knowledge and coordinating many independent services will be ongoing challenges.
2. Hello everybody, welcome to Lookout! I'm excited to be up here talking about
one of my favorite subjects, scaling.
Not just scaling in a technical sense, but scaling *everything*. Scaling
people, scaling projects, scaling services, scaling hardware, everything
needs to scale up as your company grows, and I'm going to talk about what
we've been doing here.
First, I should talk about ->
4. Who I am.
- I've spoken a lot before about continuous deployment and automation,
generally via Jenkins. As part of the Jenkins community, I help run the
project infrastructure and pitch in as the marketing events coordinator,
cheerleader, blogger, and anything else that Kohsuke (the founder) doesn't
want to do.
Prior to Lookout I've worked almost entirely on consumer web applications,
not in a controllers and views sense, but rather building out backend
services and APIs to help handle growth
At Lookout, I've worked a lot on the Platform and Infrastructure team, before
promoted, or demoted depending on how you look at it, to the Engineering Lead
for ->
6. The Lookout for Business team
I could easily talk for over 30 minutes about some of the challenges building
business products presents, but suffice it to say, it's chock full of tough
problems to be solved.
Not many companies grow to the point to where they're building out multiple
product lines and revenue streams, but at Lookout we've now got Consumer,
Data Platform and now Business projects underway.
It's pretty exciting, but not what I want to talk about.
Let's start by ->
10. 2011
In the olden days, we did things pretty differently, in almost all aspects. I
joined as the sixth member of the server engineering team, a group that now
has 20-30 engineers today.
-> Coming in with a background in continuous deployment, the first thing that
caught my eye was
12. Our release process was like running a gauntlet every couple weeks, and maybe
we'd ship at the end of those two weeks, maybe not. It was terribly
error-prone and really wasn't that great.
James ran the numbers for me at one point, and during this time-period we
were experiencing a "successful" deployment rate of ->
14. This means that 1/3 of the time when we would try to deploy code into
production, something would go wrong and we would have to rollback the
deploy and find out what went wrong.
Unfortunately, since it took us two or more weeks to get the release out, we
had on average ->
16. 68 commits per deployment, so one or more commits out of 68 could have caused
the failure.
After a rollback, we'd have to sift through all those commits and find the
bug, fix it and then re-deploy.
Because of this ->
18. About 2/3rds of our deployments slipped their planned deployment dates. As an
engineering organization, we couldn't really tell the product owner when
changes were going to be live for customers with *any* confidence!
19.
20. There were a myriad of reasons for these problems, including:
- lack of test automation (tests existed, not reliably running, using
Bitten with practically zero developer feedback)
- painful deployment process
To make things more difficult, all our back-end application code was in a ->
22. monolithic Rails application.
While it served it's purpose as the company was bootstrapping itself, but was
starting to show its age and prove challenging with more and more developers
interacting with the repository.
23.
24. The team was at an interesting junction during this time, problems were
readily acknowledged with the way things were done, but finding the bandwidth
and buy-in to fix them were difficult to come by.
I think every startup that grows from 20 to 100 people goes through this
phase when it is in denial of it's own growing pains.
As more people joined the team, we pushed past the denial though and started working
on ->
28. The Burgess Challenge. While having beers one night with James and the server team lead
Dave, James asked if we could fix our release process and get us from two-ish
week deployments to *daily* deployments, in ->
30. 60 days. This was right at the end of the year, with Thanksgiving and
Christmas breaks coming up, we had some slack in the product pipeline and
decided to take the project on, and enter 2012 a different engineering org
than we had left 2011.
We started the process by bringing in some ->
34. JIRA. While I could rant on how much I hate JIRA, I think it's a better tool
than Pivotal Tracker was for us. Pivotal Tracker worked well when the team
and the backlog were much smaller, and less inter-dependent than they were in
late 2011.
Another tool we introduced was ->
36. Jenkins
- Talk about the amount of work just to get tests passing *consistently* in
Jenkins
- Big change in developer feedback on test runs compared to previously.
We also moved our code from Subversion into ->
38. Git and Gerrit. Gerrit being a fantastic Git-based code-review tool. At the
time the security team was already using GitHub:Firewall for their work. We
discussed at great length whether the vanilla GitHub branch, pull request, merge
process would be sufficient for our needs and whether or not a "second tool"
like Gerrit would provide any value.
I could, and have in the past, given entire presentations on the benefits of
the Gerrit-based workflow, so I'll try to condense as much as possible into
this slide of our new code workflow ->
39.
40. describe the new workflow, comparing it to the previous SVN based one (giant
commits, loose reviews, etc)
41.
42. with Jenkins in the mix, our fancy Gerrit workflow had the added value of
ensuring all our commits passed tests before even entering the main tree.
We doing a much better job of consistently getting higher quality code into
the repository, but we still couldn't get it to production easily
Next on the fix-it-list was ->
44. The release process itself.
At the time our release process was a mix of manual steps and capistrano
tasks
- Automation through Jenkins
- Consistency with stages (no more update_faithful)
We've managed to change entire engineering organization such that ->
52. If you're going to use a pre-tested commit workflow with an active
engineering organization such as ours, make sure plan ahead and have plenty
of hardware, or virtualized hardware for Jenkins
We've started to invest in OpenStack infrastructure and the jclouds plugin
for provisioning hosts to run all our jobs on.
With over a 100 build slaves now, we had to also make sure we had ->
54. Automated the management of those build slaves, nobody has time to hand-craft
hundreds of machines and ensure that they're consistent. Additionally, we
didn't want to waste developer time playing the "it's probably the machine's
fault" game every time a test failed.
57. Not much to say here, every company is going to be different but you can't
just ignore that there are social and cultural challenges in taking a small
engineering team and growing to 100+ people.
61. With regards to scaling the technical stack, I'm not going to spend too much
time on this since the other people here tonight will speak to this in more
detail than I probably should get into, but there are some major highlights
from a server engineering standpoint
Starting with the databases ->
63. Global Derpbase woes
Moving more and more data out of non-sharded tables
Experimenting with various connection pooling mechanisms (worth mentioning?)
65. Diagnosing a big ball of mud
Migrating code onto the first service (Pushcart)
Slowly extracting more and more code from monorails, project which is ongoing
67. I never thought this would have a big impact on scaling the technical stack,
but modernizing our front-end applications has helped tremendously
The JavaScript community has changed tremendously since the company was
founded, the ecosystem is much more mature and the web in general has
changed.
By rebuilding front-end code as single-page JavaScript applications (read:
Backbone, etc), we are able to reduce complexity tremendously on the backend
by turning everything into more or less JSON API services
70. The future at Lookout is going to be very interesting, both technically and
otherwise.
On the technical side of the things we're seeing more of a ->
72. Diversified technical portfolio. Before the year is out, we'll have services
running running in Java, Ruby and even Node.
TO support more varied services, we're getting much more friendly ->
74. with the JVM, either via JRuby or other JVM-based languages. More things are
being developed for and deployed on top of the JVM. Which offers some
interesting some interesting opportunities to change our workflow further
with things like:
- Remote debugging
- Live profiling
- Better parallelism
75.
76. With an increasingly diverse technical stack, and stratified services
architecture, we're going to be faced with the technical and organization
challenges of operating ->
78. 100 services at once.
When a team which owns a service is across the office, or across the country,
how does that mean for clearly expressing service dependencies, contracts and
interactions on an on-going basis?
With all these services floating around, how do we maintain our ->
80. Institutional knowledge amongst the engineering team
Growth means the size of our infrastructure exceeds the mental capacity of
singular engineers to understand each component in detail.
81.
82. We're not alone in this adventure, we have much to learn from companies like
Amazon, or Netflix, who have traveled this path before.
I wish I could say that the hard work is over, and that it's just smooth
sailing and printing money from here on out, but that's not true.
There's still a lot of hard work to be done, and difficult problems to talk
about as we move into a much more service-oriented, and multi-product
architecture.
I'd like to ->