Yahoo uses Apache Hadoop extensively to power many of its products and services. Hadoop allows Yahoo to gain insights from massive amounts of data, including user data from services like Flickr and Yahoo Mail. Yahoo has contributed over 70% of the code to the Apache Hadoop project to date. Hadoop is critical to Yahoo's business by enabling personalization, spam filtering, content optimization, and other data-driven features. Yahoo runs Hadoop on tens of thousands of servers storing over 100 petabytes of data. The company continues working to enhance Hadoop's scalability, flexibility, and performance to make it more suitable for enterprise use.
1. YAHOO &
HADOOP
USING
AND
IMPROVING
APACHE
HADOOP
AT
YAHOO!
Eric Baldeschwieler
VP, Hadoop Software
2. AGENDA
•
Brief
Overview
•
Hadoop
@
Yahoo!
• Hadoop
Momentum
• The
Future
of
Hadoop
2
3. WHAT’S
happening
-‐
Big
Data
is
here!
-‐ unstructured data
-‐
petabyte scale
-‐
operationally critical
Flickr : sub_lime79
4. TURNING DATA
INTO INSIGHTS
machine learning
logic regression time series
content clustering
algorithms ad inventory modeling
user interest prediction
factorization models
Flickr : NASA Goddard Photo and Video
6. HADOOP:
POWERING
YAHOO!
science
+
big
data + insight =
personal relevance = VALUE
Flickr : DDFic
7. WHAT IS HADOOP?
Commodity
Pig Hive Programming Languages
• Computers
• Network
MapReduce Computation
Focus on
• Simplicity
HDFS
• Redundancy
Storage
• Scale
• Availability
Transforms commodity equipment into a service that:
• HDFS – Stores peta bytes of data reliably
• Map-Reduce – Allows huge distributed computations
Key Attributes
• Redundant and reliable – Doesn’t stop or loose data even as hardware fails
• Easy to program – Our rocket scientists use it directly!
• Very powerful – Allows the development of big data algorithms & tools 7
• Batch processing centric
8. WHAT HADOOP ISN’T
• A
replacement
for
relaFonal
and
data
warehouse
systems
• A
transacFonal
/
online
/
serving
system
• A
low
latency
or
streaming
soluFon
8
9. HADOOP IN THE ENTERPRISE
Business
Intelligence
ApplicaFons
HADOOP
CLUSTER(S) Data
RDMS
EDW
Marts
InteracFons
TransacFons,
Structured
Data
Semi-‐Structured
or
Un-‐Structured
Data
Web
Logs,
Server
Logs,
Business
Social
Media,
etc…
ApplicaFons
9
11. HADOOP @
YAHOO!
“Where
Science
meets
Data”
PRODUCTS
Data Analytics
Content Optimization
Content Enrichment
Yahoo! Mail Anti-Spam
Advertising Products
HADOOP CLUSTERS Ad Optimization
Tens of thousands of servers Ad Selection
Big Data Processing & ETL
APPLIED SCIENCE
User Interest Prediction
Ad inventory prediction
Machine learning -
search ranking
Machine learning - ad
targeting
Machine learning - spam
10s of Petabytes filtering
11
12. FROM PROJECT TO
CORE PLATFORM
90 250
80 40K+ Servers
170 PB Storage 200
70
5M+ Monthly Jobs
60 “Behind
every
150
Thousands of Servers
50 Daily
click”
ProducFon
Petabytes
40
Science
100
30
Impact
20
Research
50
10
0 0
2006 2007 2008 2009 2010
12
13. HADOOP POWERS THE
YAHOO! NETWORK
advertising optimization data analytics
machine learning search ranking
advertising data systems Yahoo! Mail anti-spam
audience, ad and search pipelines ad selection
Yahoo! Homepage Content Optimization
ad inventory prediction
user interest prediction
13
14. CASE STUDY
YAHOO! HOMEPAGE
Personalized
for
each
visitor
twice
the
engagement
Result:
twice
the
engagement
Recommended
links
News
Interests
Top
Searches
+79% clicks +160% clicks +43% clicks
vs. randomly selected vs. one size fits all vs. editor selected
14
15. CASE STUDY
YAHOO! HOMEPAGE
• Serving
Maps
SCIENCE »
Machine learning to build ever
• Users
-‐
Interests
HADOOP better categorization models
CLUSTER
• Five
Minute
USER
CATEGORIZATION
ProducLon
BEHAVIOR
MODELS
(weekly)
• Weekly
PRODUCTION
CategorizaLon
HADOOP
»
Identify user interests using
models
SERVING
CLUSTER
Categorization models
MAPS
(every
5
minutes)
USER
BEHAVIOR
SERVING
SYSTEMS ENGAGED
USERS
Build
customized
home
pages
with
latest
data
(thousands
/
second)
15
16. CASE STUDY
YAHOO! MAIL
Enabling
quick
response
in
the
spam
arms
race
• 450M
mail
boxes
• 5B+
deliveries/day
SCIENCE
• AnLspam
models
retrained
every
few
hours
on
Hadoop
PRODUCTION
40%
less
spam
than
Hotmail
and
55%
less
spam
than
Gmail
16
17. YAHOO! & APACHE HADOOP
Yahoo!
has
contributed
70+%
of
Apache
Hadoop
code
to
date
Hadoop
is
not
our
business,
but
Hadoop
is
key
to
our
business
•
Yahoo!
benefits
from
open
source
eco-‐system
around
Hadoop
•
Hadoop
drives
revenue
at
Yahoo!
by
making
our
core
products
be`er
We
need
Hadoop
to
be
rock
solid
•
We
invest
heavily
in
core
Hadoop
development
•
We
focus
on
scalability,
reliability,
availability
We
fix
bugs
before
you
see
them
•
We
run
very
large
clusters
•
We
have
a
large
QA
effort
•
We
run
a
huge
variety
of
workloads
We
are
good
Apache
Hadoop
ciLzens
•
We
contribute
our
work
to
Apache
•
We
share
the
exact
code
we
run
22. MAKING HADOOP ENTERPRISE-READY
WHAT’S NEXT
Hadoop
is
far
from
“done”
• Current
implementaFon
is
showing
its
age
• Need
to
address
several
deficiencies
in
scalability,
flexibility,
ease
of
use
&
performance
Yahoo!
is
working
on
Next
GeneraLon
of
Hadoop
• MapReduce:
Rewrite
to
improve
performance;
pluggable
support
for
new
programming
models
• HDFS:
Adding
volumes
to
improve
scalability;
Flush
&
sync
support
for
applicaFons
that
log
to
HDFS
Apache
should
remain
the
hub
of
Hadoop
ecosystem
• Yahoo!
contributes
all
Hadoop
changes
back
to
Apache
Hadoop
• Everyone
benefits
from
shared
neutral
foundaFon
22