Velocity 2013 - How Edmunds learned from failure, begin opening communications between silos, and build a DevOps culture over beer and whiteboards.
(HINT: Download to see the presenter's notes for what may not make sense without a speaker!)
- The automative resource of the Internet - Originally in print, then Gopher in 1994, Web in 1996
- Our environment is highly distributed. When you visit Edmunds.com you’re interacting one or more of our 30 web apps spread out across a couple hundred hosts. - The website itself is built on Apache Tomcat, Solr, MongoDB, and Oracle Coherence. - Internally, you’ll also find ActiveMQ, Oracle, and some lingering WebLogic services we’ll soon be doing away with. - We rely heavily on a mix of different tools to build and support all this: chef, jenkins, CloudStack, AppDynamics, Splunk, to name a few. - But I’m getting ahead of myself because how we got to this architecture is part of the tale on how Edmunds came to embrace a DevOps mindset.
- So then where does our story start? - Let me be up front: WE STUMBLED. WE PERFECTED THE FACEPALM. - The specifics of our situations when the shit hit the fan may have felt unique, but they’re not. - We learned from our mistakes with the intent of getting better. - Let’s talk facepalms...
- This may be familiar... - In 2005, we had 30 servers. In 2006, we burst up to 300 and held steady for a few years with slow growth. - In 2009, we saw radical jump in server deployment - We grew in number of servers, but not in the number of admins - We had Kickstart, but that’s only good at bootstrap time - BladeLogic + AnthillPro seemed a good solution, but there were major issues - Growth is painful
- One very specific breakdown in our history that stands out to me. - 2007 - Edmunds 2.0: Introducing CMS for the business - All content was locked to a monthly release cycle - Six months of functional testing, without any performance validation. - Two months before launch, performance testing uncovered scalability issues. - Ops response: double application infrastructure and throw a hardware cache appliance. - Breakdown in relationships between Dev/Ops lead to major business costs. - Fast forward to 2009; remember that big jump in the number of servers we were deploying?
- 2010 Edmunds Redesign: Complete rewrite of all website code + modular breakout of applications. - Good collaboration between Dev/Ops to understand requirements on all sides. - But QA + BETA were build brick-by-brick, and not easily reproducible. - Armed with BladeLogic + AnthillPro, build/deploy was more automated but weren’t coupled together! - Production environment took 3 months to build while BETA served the new website. - We started to realize that the real challenge wasn’t technology but culture .
We wanted to stop working like this...
and start building like this.
We really wanted to get out of here.
- And go here - This is the Daily Pint! Let me buy you a beer! - This is where the wildest of ideas are born - Disagreements are worked through with positive jest and jeers - It is where we talked it over
- Then we’d take it here! - THE MOST UNDER RATED TOOL YOU ALREADY HAVE. - Floor-to-ceiling whiteboards where we worked out our ideas. - We talked gaps in handoffs, failure rates due to manual builds, linking tools in together - “self-service”, Automated testing, and much much more. - What happened those was no “ops”, no “dev”. We were technologists working to solve problems with no boundaries of roles in the way. - Our proposal: tear down silos. - We did just that!
- So who and how did this happen? - TechLeads who spent too much time in war rooms started chewing on the problem together. - Identified gaps in provisioning/config management and app deployment tools. - Scott McNealy was right about hardware/software dependencies. - Two teams, Production Engineering & Automation Engineering set about to provide tools which bridged the divide. - (ProdEng = Ops) + (AutoEng = Dev) == How we really started gaining inroads. (NOT IDEAL!) - Members of both these teams shed traditional views on what they were supposed to do and just did it. - The result were improved relationships, better tooling, and a clearer perspective on how future projects could work.
- So we started linking all our tools together! - “Your tools don’t make your culture, but they do have an impact on the people who do.”
- We now talk about data that our tools provide us - You can talk from your gut, but you better back it up with data - We pushed ownership and accountability by leveraging what we found with data . - The metrics were clearly pointing out our failures, allowing us to learn how to prevent them in the future.
- Armed with a tighter toolchain and a new way of working together, we were once again about to be put to the test. - Edmunds began investing resources into “the cloud”. - Heavily virtualized since 2010, but no clear “cloud” offerings - Two teams, one objective: make edmunds.com work on $x cloud platform - Why two? DIVERSIFY.
- This was our first shot at a “new” project armed with our new practices + tooling - They were uncharted waters, even though we’d been virtualized for a few years. “Cloud” is a different beast. - But with familiar tooling + improved communications, these teams produced success results that were easily measured. - Environment build time down to less than a week. - Done with 95% of the same toolset for both cloud platforms.
- We’ve all spent our careers as firefighters. - Street cred with co-workers, bosses, executives as cool headed during a mess - So what about when there are less - or different - kinds of fires? - By increasing accountable individuals, more “self-service”, less fires == increased capacity for business acumen. - This is the business value of what we call DevOps is leading is to.
- To go from this to this... - Invest in addressing systemic issues around communication + partnerships, we increase our capacity to take on other challenges - No big secret, it’s been talked about by Damon Edwards, John Willis - Covered beautifully in “The Phoenix Project” - Technologist in the age of the Internet are no longer back-office workers keeping the lights on - We help shape the direction of our companies; direct impact on revenue in ways our field sees change now yearly. - We needed to change the way we work together to free ourselves for “bigger things”. - An exciting time to be working in our field!
- Okay, back to our cloud initiatives... - With this additional capacity, here’s a few things we learned to give value to our company - Cloud isn’t free; server sprawl can be expensive and lack of education with “self-service” becomes a major issue. - How much does it cost to operate your environment? It’s tough to calculate! - Licensing by host or CPUs is costly at scale, so look for alternatives to those things you pay a premium for. - Managing operating costs starts with understanding where the money is going!
- A great growing experience the last few years @ Edmunds. - No rose-tinted glasses to suggest we’ve solved all our problems! BUT WE GOT SOME BIG ONES! - And today we work a helluva lot more like this! - So, let’s take on the challenge of showing some metrics of success by adopting a DevOps culture...
- Application Availability has increased. Not the holy metrics of “four 9’s”, but a bump all the same! - The number of high-severity INCs has dropped 50% year-over-year - The number of TKTs filed has dropped 50% year-over-year --- Self-service is slick! - The MTTR of pre-production issues has drastically reduced from 5 days to 2 days and even faster than that in most situations. - The time it takes us to build runways has gone down from 3 months to 1 week! - Deeper inspection of our costs-per-host, we’re expecting to begin shaving off overall operating costs drastically for next year’s budget. - Team morale? Well...
We got out of here.
And into here, so it’s pretty good.
- Always more to be done! You’re never “finished” growing. - Devs on-call! (You build it, you run it!) - Reducing infrastructure footprint == reducing operating costs - More RESTful applications - Other cloud offerings?