Anúncio
Anúncio

Mais conteúdo relacionado

Similar a Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Tooling(20)

Mais de Optimizely(20)

Anúncio

Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Tooling

  1. HISTORY Cox Automotive Cox Enterprises The foundation of Cox Enterprises starts with the purchase of the Dayton Daily News by Governor James M. Cox Manheim Auto Auction The first step in the evolution of Cox Automotive began with the acquisition of Manheim Auto Auction which now has more than 100 locations worldwide Autotrader.com Autotrader.com was launched as an online vehicle classifieds site Kelley Blue Book Autotrader.com acquired Kelley Blue Book along with vAuto and HomeNet Autotrader Group Autotrader.com acquires VinSolutions and the Autotrader Group forms Manheim Expands Manheim acquires DealShield, NextGear Capital, and Ready Auto Transport Cox Automotive Cox Automotive forms as a division of Cox Enterprises, Inc. Transforming the way the world buys, owns, sells and uses cars 1898 1968 1999 2010 2011 2014 2012
  2. 5 HOW WE’RE ORGANIZED Service Sales Marketing InventoryMobility Operations
  3. 7 WHY WE MADE THE SWITCH
  4. 8
  5. 9
  6. 10
  7. 11
  8. 12
  9. HOW WE EVALUATED OUR OPTIONS 13
  10. 14
  11. ENGINEERING NEEDS Compatibility Reliability Speed / Performance Customization 15
  12. DEVELOPMENT COMPATIBILITY 16
  13. 18 ARCHITECTURE Optimizely Web Optimizely Edge Optimizely FullStack Controlling Web/Edge placement Feature Flags and Continuous Deployment
  14. OUR CUSTOMIZATIONS React Component Force Experiments & Feature Flags with Feature Variables Automated Testing Adobe Analytics & Launch Integration 19
  15. BONUS ROUND Optimizely Program Management Team Collaboration Idea Backlog Test Planning and Scheduling Feature Variables More Flexibility 20
  16. CLOSING THOUGHTS 21 ANALYTICS/STATISTICS ENGINEERING

Notas do Editor

  1. [SETH] Hello! We thought we'd start by introducing ourselves.    My name is Seth Stuck, and I lead R&D Analytics and experimentation for emerging AI and machine learning technologies deployed across more than 20 Cox Automotive brands such as Autotrader, Kelley Blue Book, and Manheim.    [SCOTT] And I'm Scott Povlot, Principal Technical Architect for Autotrader. I specialize in creating performant, reliable and user friendly websites that do what the user expects.    [SETH] Scott and I worked together on Autotrader's implementation of Optimizely, and wanted to share with you all today some of what we learned along the way as we began the daunting journey of migrating our organization off of a home-grown system it had been using for years. We plan to focus on two primary things: 1) The problems we needed our testing and feature flagging platform to solve and 2) some things we learned along the way that might prove helpful to other teams facing similar decisions or initiaitives in the near future. But first, let's give you some background...
  2. [SETH] Hello! We thought we'd start by introducing ourselves.    My name is Seth Stuck, and I lead R&D Analytics and experimentation for emerging AI and machine learning technologies deployed across more than 20 Cox Automotive brands such as Autotrader, Kelley Blue Book, and Manheim.    [SCOTT] And I'm Scott Povlot, Principal Technical Architect for Autotrader. I specialize in creating performant, reliable and user friendly websites that do what the user expects.    [SETH] Scott and I worked together on Autotrader's implementation of Optimizely, and wanted to share with you all today some of what we learned along the way as we began the daunting journey of migrating our organization off of a home-grown system it had been using for years.  We plan to focus on two primary things: 1) The problems we needed our testing and feature flagging platform to solve and 2) some things we learned along the way that might prove helpful to other teams facing similar decisions or initiaitives in the near future. But first, let's give you some background...
  3. Let me start by giving you some history in how these brands fit in to our parent company’s history. Cox Automotive is a division of Cox Enterprises which has a 120-year history. What started with the purchase one newspaper in Dayton, Ohio in 1898 now covers the world of telecom and automotive. The first step into automotive was the acquisition of Manheim Auctions in 1968 which now has more than 100 locations worldwide. The next step in the automotive track was the launch of Autotrader.com in 1999.  Over the years, Cox Automotive has added multiple other brands, including Kelley Blue Book, Dealer.com, Dealertrack, vAuto, and so on.  
  4. In all, Cox Automotive now houses over 20 brands with a variety of testing & experimentation needs both on the consumer as well as the client side.  This is a visualization of our various domestic brands and how they support the automotive world. For example, Autotrader and Kelley Blue Book participate in the marketing space with a variety of car listing and ad products.  [CLICK] The darker blue sections are the brands that are currently participating in our Testing & Experimentation Program.  As you can see, we’ve expanded the program beyond Autotrader and Kelley Blue Book to include several other CAI brands and we expect to continuously add more in the coming years. 
  5. To go a little deeper on Autotrader,  since that will be the focus of our discussion today... Autotrader originally came about as a magazine in 1973.  Cox Enterprises bought the magazine in 1988 and ran it in partnership with Landmark Publishing.  In 1997, it converted to a digital format as AutoConnect and in 1999 became Autotrader.com.  We dropped the ".com" from the Autotrader brand in 2015. THere's a view of the Autotrader homepage, and as you can see, the primary purpose of the site is to search for cars for sale. But you can also sell your current vehicle alongside dealer cars, you can research cars, and you can obtain vehicle values, of course powered by Kelley Blue Book.
  6. Now, to the topic of testing at Autotrader…  What we’re going to focus on today is a discussion specific to server-side testing and feature-flagging, and our major considerations related to those capabilities. What’s interesting here is that our Product and Engineering teams were honestly ahead of the curve several years ago when they built a home-grown solution for leveraging feature flags to support server-side testing. This tool was called “Launch Control,” and it allowed us to put product changes behind feature flags, and discretionarily serve feature flags to specified allocations of traffic. Our analysts were then able to analyze the relative performance of key metrics between different experiences. It was a real game-changer at the time. Over the years, however, the industry eventually caught up and surpassed this proprietary tooling and the team had a choice to make: Re-invest into the local solution or completely retool. 
  7. So as we took a step back to evaluate our existing tooling, we realized there were several significant deficiencies… For example, we realized we had accumulated a large variety of proprietary “rules” required to keep our tests clean and valid. Case in point, we were often running into issues where control audiences were getting re-used… or started at a different time from the test group. This was a consequence of the way Launch Control had evolved. It had started out, as the tool’s name suggests, as a resource for Engineers to manage launches behind feature flags – and the mechanics of running a test were a secondary concern. So as we matured in our use of server-side testing, we began uncovering this and other issues… And we've seen this elsewhere with home-grown solutions – where you ultimately end up relying on proprietary knowledge of how to use the tool to get the tool to work properly. This of course introduces risk for human error or even continuity problems as the people who have this proprietary knowledge evolve into other roles 
  8. As another example, our home-grown tool also only allowed us to run 10 concurrent “experiments” at a time. This was something that we could have modified, but it would have required re-investment into the tool – and at a cost that wouldn’t have been much cheaper than buying 3rd party – and in exchange for only solving one of the multiple issues we were having. The result was that we had to manually manage a host of rules about which types of tests could be run when, which ultimately slowed down and limited our ability to experiment at scale. All of our launches and tests getting backed up on the tarmac, so to speak, put the business behind schedule – which then results in frustrations related to testing and launches that are largely a function of the limitation of a tool no one was really responsible for maintaining or evolving day-to-day. Scott will share a little more on that later... 
  9. One of the basic things you expect your test tooling to do is bucket users in various test groups and keep them “locked in” to that respective test group. We discovered that we had a small subset of power users, who – because they were consistently coming to the site over long periods of time – had been in many prior experiments. As a result, our internal tool was having a hard time maintaining all of the various tokens these users had accumulated. So it kept dropping the most recent one… which caused these power users to be re-bucketed in ongoing experiments each time they came back to the site. Our analytics team was able to work around this on the backend to ensure our analyses were still valid, but it created unnecessary weight and complication to test results that didn’t need to be there. We had other issues with mutual exclusivity, too, where we were unable to target and exclude users who’ve already been bucketed into ongoing experiments. Again, just sort of a limitation of trying to retrofit a feature-flagging tool for robust experimentation. 
  10. And then there were the issues with ramping… While we could adjust the percentage of traffic in a given variant, we found significant issues in the way the tool would bucket new users as the traffic allocation changed over the course of the ramp – resulting in an over-representation of return users in the control group as we ramped. As you can imagine, this introduced all kinds of bias into our results and left the business feeling confused.
  11. And then there are the analytics bells and whistles that can really set a testing program up for success – features like multi-arm bandit testing, dynamic audience randomization, stats engine, and the ability to dive deep on test results with audience segments. These are features our engineering team probably could have built proprietary versions of in-house, but these capabilities aren’t directly related to Cox Auto’s core competency and would have carried with them opportunity costs that made them cost-prohibitive to prioritize. As a result, as was the case with the Engineering rules I mentioned earlier, abalysts – too – had to mentaly retain a host of proprietary rules and workarounds just to ensure our test results weren't invalidated. This all ultimately resulted in huge but hidden sunk costs within the organization. With so many inefficieicies sprinkled throughout the process, we struggled to scale our testing capabilities – or to launch products with confidence as quickly as the business wanted. 
  12. So, for these reasons (and a few others) we came to the conclusion that we needed to re-evalaute our tooling. After all that,you can probably understand why we ended up with a bias to buy vs build our next gen solution. We needed something more robust, and we needed it supported by a team whose core competancy was experimentation.  So, how did we pick which 3rd party tool/partner to work with? We ultimately did all the basic RFP things I imagine most of your companies do when evaluating build vs buy options – including identifying business requirements across our agile partners, comparing pros and cons across various solutions relative to our needs, price vs feature comparisons, etc.  But the thing that was perhaps a little different and special…
  13. …was that we setup a robust, rapid, one-day proof of concept exercise. In that one day, we had group sessions with partners from product, engineering, architecture, UX and analytics, and we also did breakout sessions with each of those functions separately. In these sessions, we’d done a ton of pre-work to design POC exercises that would highlight each group's key concerns and use cases, which allowed each function to dive right into an MVP implementation, put the tooling through its paces, and ultimately determine whether the tool would meet their needs.  So we basically set up a series of experiments to test our experimentation tool… to determine whether it was a “winner” for our testing program. That’s about as meta as it gets… With that... I’m going to hand off to Scott now who will outline some of the specific features our Engineering and Architecture groups were looking for, how we validated that these features would meet out needs, and how we ultimately iterated our way through the implementation of Optimizely Full Stack and replaced our home-grown tooling. Scott?
  14. Our Experimentation and Feature Flag capability needed to fit within our existing application architecture and toolset.
  15. Web Performance is a key metric for our website as it likely is for most of you. Our major concern for Optimizely Web and FullStack was how it affected our site speed.  We were early adopters of Optimizely Edge which provided a 1 second improvement over Web in most cases.
Anúncio