Hugh Williams will discuss building Cassini, a new search engine at eBay which processes over 250 million search queries and serves more than 2 billion page views each day. Hugh will trace the genesis and building of Cassini as well as highlight and demonstrate the key features of this new search platform. He will discuss some of the challenges in scaling arguably the world’s largest real-time search problem, including the unique considerations associated with e-commerce and eBay’s domain, and how Hadoop and HBase are used to solve these problems
7. 97 million
active buyers and sellers worldwide
250 million queries
each day to our search engine
200+ million items
live in more than 50,000 categories
8. 9 petabytes of data
in our Hadoop and Teradata clusters
2 billion page views
each day
75 billion database calls
each day
9. Huge Opportunity: Taking the “e” out of ecommerce
Yesterday Today Tomorrow
Online Online
4% 6%
Web-
influenced Online
offline +
Offline
37% Offline
Offline
96%
2008 = $325B 2013 = $10T
Source: Forrester, Euromonitor and
Economist Intelligence Unit Source: Forrester Source: Economist Intelligence Unit
17. Project Cassini at eBay
Our most ambitious core engineering
project
► Entirely new codebase
► World-class, from a world-class team
► Platform for ranking innovation
► Uses all data by default
► Flexible
► Automated
► Four major tracks, 100+ engineers
► Complete in less than 18 months
19. A Short Primer on Indexing
When a user types a query, it isn’t practical to
exhaustively scan 200+ million items
Instead, we create an inverted index, and use it
to rank the items and find the best matches
An inverted index is similar to the index in the
back of a book:
A set of searchable terms
For each term, a list of locations
20. An Inverted Index
cat 3: 1, 2, 7
1 cat on the mat fat cat
2
3
4 wild cat
5
6
7
8
22. Larger index than Voyager
Descriptions, Seller data, other metadata, …
Much more history in our indexes
More computationally expensive work at index-
time (and less at query-time)
Ability to rescore or reclassify entire site
inventory
23. Hadoop:
Distributed indexing – platform for hourly index
refreshes
Fault tolerance through HDFS replication
Better utilization of hardware – can generate
different index types with one cluster
24. HBase:
Column-oriented data store on top of HDFS
Used to store eBay’s items
Bulk and incremental item writes
Fast item reads for index construction
Fast item reads and writes for item annotation
25. Everyone is still learning
Some issues only appear at scale
Production cluster configuration is challenging
Hardware issues
Tuning cluster configuration to our work loads
HBase stability
Monitoring health of HBase
Managing workflows – many step map/reduce
jobs
Notas do Editor
Great to be here – privilege to speak to you allToday, going to talk to you about eBay, our new search engine Cassini, and how Hadoop and Hbase is used in searchHighlight title – and mention that I work on marketplaces (ebay.com, and its sister sites all over the world)Let me begin by giving you a brief overview of eBay…
We’re 16 years old. Here is a shot of the original site – called AuctionWeb – that eBay’s founder, Pierre Omidyar, launched over Labor Weekend in 1995 … as an “experiment.”I’ve circled some text on this page, not sure if you can read it … but it says “There are always SEVERAL HUNDRED auctions underway, so you’re bound to find something interesting.” “Several hundred” … those were our humble beginnings, though pretty impressive at the timeThe only thing that’s remained the same since 1995 is that eBay has always connected buyers and sellers.
In 2010, we sold $62 billion in merchandise.
We’re one of the Web’s largest properties… and the pace of change is being driven largely by our customers and their new and their increasingly more sophisticated shopping expectations …<read slide>
We are fast becoming a data company, where our engineers use data everyday to inform what they doAnd we have a lot of data, as you can imagine from our 97 million users, 200+ million listings, 250 million search queries, and 2 billion page views each day
Before I move on to talk about Search, I want to let you know that it’s becoming more interesting at eBay:Customers are changing how they shop, and we’re at the center of this revolution. Nearly half of all offline purchases have an online component. The offline and online worlds are merging … and this is THE NEW RETAIL landscapeAnd it’s being driven by consumers who are using their smartphones and mobile devices to change the way they shop. eBay and mobile commerce are at the center of this shift – more change is going to happen in commerce in next year or two than in the past ten.
I’ve set the context on eBay.Now, I want to introduce you to project Cassini, our most ambitious engineering project at eBay.We are completely rewriting our search engine, and Hadoop and Hbase are key to this rewrite.But, first, let me tell you something about our current search engine, Voyager
Voyager is named after a 1976 satellite that <fix>.
It’s been driving the search experience on eBay since the early 2000s.Improvements to Voyager have been critical to improving the buyer experience and driving our sellers’ businesses.
However, Voyager is behind the times: a lot has happened in search since 2002Our best match ranking uses only tens of factors in computing our best match ranking functionIt only allows search of item titles by default -- we don’t rank using the great information that’s in the descriptions and elsewhereSearch is very literal – it finds almost exactly what you type, it doesn’t always understand what you mean
Voyager is a challenge to manage and run as an engineering team.It’s very manual, so deployments of software and data take time.Troubleshooting is slow.We decided in late 2010 that Voyager needed to be replaced, and that began project Cassini
Cassini is named after a 1996 satellite, a nod to it being many years ahead of Voyager
<read and click>
We’re probably only the major web property that’s completely rewriting its search engine from scratch.You can see many of the features of Cassini, and I’ll just talk about a couple briefly:First, it will use all data by default – all that great data in descriptions, information in images, data about our buyers and sellers, and the signals that come from 2 billion page views each day will be used in Cassini to compute its best match. Our users are going to see world-class results, and it’ll be a much more powerful tool to connect buyers and sellersSecond, automation is key. There’ll be no more manual operation of the search engine – rolling out code and data, monitoring, alerting, remediation, and more are fully automated.Third, it’s a major engineering undertaking: we’ve over 100 engineers working across four parallel tracks to deliver Cassini in less than 18 months from start to finish
We’ve hit a few major internal milestones, and internal users can already use Cassini if they’d like.<read slide>
To understand how Hadoop and Hbase play a role in Cassini, let me explain some of the fundamentals of building a search engine<first point>200 million items would take about 30 seconds, if we could do 1 document every 10 milliseconds and we had 1000 machines working concurrently<second point>An inverted index is an auxiliary data structure that allows fast calculation of the best matching search resultsA typical query takes ten milliseconds using the same 1000 machines, and an inverted index<third point>Walk through using the index in the back of a book…
It isn’t possible to create an index for over 200 million items on a single machine – we can’t keep in memory the terms and all of their positions in the documentsWhat we do at scale is distributed index construction, it is classic map-reduce (and has been so from well before the phrase was coined).We build an inverted index for a small part of the document collection on one machine, and do the same on hundreds of other machines. We merge the small inverted indexes into larger inverted indexes that are distributed to our query serving grid.This is a technical graphic from our team, it shows the seven high level stages to creating all the index pieces we need in Cassini.
Let’s talk about why Cassini indexing is more challenging than in Voyager, and why we changed the architecture dramatically to include Hadoop and Hbase.First reason: Voyager completed pool = 14 days. Cassini = 90 daysSecond reason: we refresh indexes on an hourly basis – Helps improve ranking, for example updating item and seller informationThird reason: full power to our ranking team to make fast twitch changes
Hadoop is the platform for our index construction and index maintenance in CassiniIt’s ideal because it gives us fault tolerance, and smart utilization of our hardware – without Hadoop, we’d probably have small pools of machines that run custom code for different stages of our index constructionOur Hadoop clusters for analytics are much larger, but this is our major use of Hadoop in driving a customer experience.It’s pretty large scale too: while we have over 200 million active items at any time, we also maintain a “completed index” that is over 1 billion item
We use Hbase to store eBay’s items for index construction and maintenance.Hbase, as you know, is a column oriented data store built on top of HDFS that is tightly integrated with the Hadoop Map/Reduce framework. It has no schema, which is great for us – it means what we store can evolve. Hbase supports fast item lookups and scans, both of which are necessary for index constructionIncremental writes are what we normally do: about 10 million items enter eBay each day, and we need them in the searchable index within a couple of minutesBulk writes are necessary when our ranking team wants to rescore all our items
We’ve got running Hadoop at scale mostly down, but we have challenges with HBaseFirst issue: Ops and Dev are both new to hbase. Lots of learning through failuresSecond issue: Test using mini hadoop cluster + local hbaseThird issue: getting the hardware tuned just rightFourth issue: HBase stability – Unstable Region Servers & HBase master. Regions stuck in transition, etcFifth issue: Monitoring – a lot of times we don’t recognize there are issues until jobs begin to failSixth ssue: Workflow – Our index chains have around 20 stagesBut it’s not all doom and gloom, we’ve recently had a couple of weeks of stability, and we’ve getting more confident each week…Before I finish today, I want to show you a couple of pictures of our data center that houses Cassini…
This is our new data center that we opened in Salt Lake City, Utah in May last yearOne of the most efficient data centers ever built, makes clever use of power and cooling technologies
Andhere are the machines inside the data center that run Cassini.
Before I conclude, I want to let you know that we’re hiring in the search team, and right across all the teams that use and maintain Hadoop and HbaseIf you’re an Hadoop or Hbase committer, I’d especially love to talk to you…And with that, I want to thank you all for listening, and I hope you enjoy a great conference