Hadoop World 2011 Keynote: Ebay - Hugh Williams

Project Cassini: ’s
New Search Engine

Vice President of Search, Experience, and Platforms
eBay Marketplaces

$2.63
million
for a lunch with
Warren Buffett

$40,668
for Justin Bieber’s
just-cut hair

$130K
for Princess
Beatrice’s hat

$62
billion
in merchandise sold in 2010

97 million
active buyers and sellers worldwide

250 million queries
each day to our search engine

200+ million items
live in more than 50,000 categories

9 petabytes of data
in our Hadoop and Teradata clusters

2 billion page views
each day

75 billion database calls
each day

Huge Opportunity: Taking the “e” out of ecommerce

Yesterday Today Tomorrow

Online Online
4% 6%
Web-
influenced Online
offline +
Offline
37% Offline
Offline
96%

2008 = $325B 2013 = $10T
Source: Forrester, Euromonitor and
Economist Intelligence Unit Source: Forrester Source: Economist Intelligence Unit

Voyager: our current search engine


► Reliable, critical, proven workhorse


► Circa-2002 textbook design
► Basic ranking functionality
► Title-only match by default
► Very literal search


► Inflexible & Manual
► The next wave of innovation requires a new
search platform…

Project Cassini at eBay
Our new search engine

Our most ambitious core engineering
project

Our most ambitious core engineering
project
► Entirely new codebase
► World-class, from a world-class team
► Platform for ranking innovation
► Uses all data by default
► Flexible
► Automated
► Four major tracks, 100+ engineers
► Complete in less than 18 months


Beginning tests,
likely launch in 2012

A Short Primer on Indexing
 When a user types a query, it isn’t practical to
exhaustively scan 200+ million items
 Instead, we create an inverted index, and use it
to rank the items and find the best matches
 An inverted index is similar to the index in the
back of a book:
 A set of searchable terms
 For each term, a list of locations

An Inverted Index

cat 3: 1, 2, 7

1 cat on the mat fat cat
2
3
4 wild cat
5
6
7
8

Distributed Index Construction

 Larger index than Voyager
 Descriptions, Seller data, other metadata, …
 Much more history in our indexes
 More computationally expensive work at index-
time (and less at query-time)
 Ability to rescore or reclassify entire site
inventory

 Hadoop:
 Distributed indexing – platform for hourly index
refreshes
 Fault tolerance through HDFS replication
 Better utilization of hardware – can generate
different index types with one cluster

 HBase:
 Column-oriented data store on top of HDFS
 Used to store eBay’s items
 Bulk and incremental item writes
 Fast item reads for index construction
 Fast item reads and writes for item annotation

 Everyone is still learning
 Some issues only appear at scale
 Production cluster configuration is challenging
 Hardware issues
 Tuning cluster configuration to our work loads
 HBase stability
 Monitoring health of HBase
 Managing workflows – many step map/reduce
jobs

Hadoop World 2011 Keynote: Ebay - Hugh Williams

Hadoop World 2011 Keynote: Ebay - Hugh Williams

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a Hadoop World 2011 Keynote: Ebay - Hugh Williams

Semelhante a Hadoop World 2011 Keynote: Ebay - Hugh Williams (20)

Mais de Cloudera, Inc.

Mais de Cloudera, Inc. (20)

Último

Último (20)

Hadoop World 2011 Keynote: Ebay - Hugh Williams

Notas do Editor