The document discusses YapMap, a visual search technology focused on threaded conversations. It was built using Hadoop to handle massive scales of data. The presentation covers YapMap's approach to crawling forums and message boards to build a searchable index, its distributed processing pipeline in Hadoop to reconstruct threads from individual posts and generate pre-indexed sub-threads, and how it presents search results with contextual threads and posts.
08448380779 Call Girls In Friends Colony Women Seeking Men
Searching conversations with hadoop
1. Searching
Conversa/ons
using
Hadoop:
More
than
find the talk
Just
Analy/cs
Jacques
Nadeau,
CTO
jacques@yapmap.com
@intjesus
June
13,
2012
2. Agenda
ü What
is
YapMap?
• FiLng
Hadoop
into
your
architecture
• YapMap
Approach
– Crawling
– Processing
– Index
Genera/on
– Results
• Opera/ons,
GeLng
Started
&
Ques/ons
3. What
is
YapMap?
• A
visual
search
technology
• Focused
on
threaded
conversa/ons
• Built
to
provide
beWer
context
and
ranking
• Built
on
Hadoop
ecosystem
for
massive
scale
• Two
self-‐funded
guys
• Motoyap.com
largest
implementa/on
at
650mm
www.motoyap.com
automo/ve
docs
4. Why
do
this?
• Discussion
forums
and
mailings
list
primary
home
for
many
hobbies
• Threaded
search
sucks
– No
context
in
the
middle
of
the
conversa/on
5. How
does
it
work?
Post
1
Post
2
Post
3
Post
4
Post
5
Post
6
6. Conceptual
data
model
Thread
Post
1
Post
2
Post
3
Sub-‐thread
Post
4
Post
5
Post
6
Individual
post
• Single
thread
scaWered
across
many
web
pages
• Posts
don’t
necessarily
arrive
in
order
8. Agenda
• What
is
YapMap?
ü FiLng
Hadoop
into
your
architecture
• YapMap
Approach
– Crawling
– Processing
– Index
Genera/on
– Results
• Opera/ons,
GeLng
Started
&
Ques/ons
9. Evolu/on
of
Hadoop
Hadoop
Today
Hadoop
Tomorrow
• Batch
analysis
system
• Real-‐/me
enterprise
applica/on
pladorm
• Lacks
enterprise
features
• Strong
Enterprise
Features
(e.g.
HA,
Stability,
compat)
• Limited
applica/ons
• BI,
Email/Collabora/on,
primarily
BI
&
analy/cs
Marke/ng
DW,
etc.
• Clusters
focused
on
point
• Shared
resource
suppor/ng
use
cases
a
large
number
of
use
cases
11. General
architecture
RabbitMQ
MapReduce
Processing
Indexing
Results
Crawler
Pipeline
Engine
Presenta/on
HBase
Riak
HDFS/MapRfs
Zookeeper
MySQL
MySQL
12. Hadoop
doesn’t
solve
all
problems
MySQL HBase
Riak
Primary
Use Business
Storage
of
crawl
data,
Storage
of
management
processing
pipeline components
information
directly
related
to
presentation
Key
features
that
Transactions,
SQL,
Consistency,
redundancy,
Predictable
l ow
drove
selection JPA memory
to
persitence
latency,
full
ratio
uptime,
max
one
IOP
per
object
Average
Object
Size Small 20k 2k
Object
Count <1
million 500
million 1
billion
System
Count 2 10 8
Memory
Footprint <1gb 120gb 240gb
Dataset
Size valuated
Voldemort
and
Cassandra
We
also
e 10mb 10tb 2tb
13. How
we
use
Hadoop
• Zookeeper
• Corosync,
Accord,
JGroups
– Distributed
Locks
– Cluster
membership
coordina/on
– Index
distribu/on
coordina/on
• Teradata,
Exadata,
sharded
• HBase
MySQL,
Cassandra
– Primary
Data
store
– Crawl
Caching
– Data
merging
– Processing
Pipeline
• MPI,
JPPF,
Clustered
EJB
• MapReduce
– Index
genera/on
• Gluster,
SAN/NAS,
Lustre
• MapRfs/HDFS
– Index
storage
• Carrot2,
Lingpipe,
Lexaly/cs
• Mahout
– Cluster
iden/fica/on
15. YapMap
crawling
challenges
• Depth
versus
breadth
• Crawls
must
be
throWled
to
avoid
overloading
• Avoid
duplicate
crawling
• Save
progress
of
long
running
crawls
• Need
an
elas/c
and
full
distributed
approach
to
crawling
• Crawler
death
managed
16. Crawler
overview
RabbitMQ
5.
Crawler
Outputs
1.
New
4.
Crawler
retrieves
Posts
(using
Crawl
job
external
assets
append
as
arrives
necessary)
DFS
Crawler
2.
Crawler
checks
document
cache
Aier
achieving
/me
HBase
and/or
quan/ty
6.
Crawler
generates
thresholds,
crawl
more
crawl
tasks
pauses,
checkpoints
in
3.
Crawler
HBase
and
resubmits
Acquires
to
RabbitMQ
queue
Domain
Lock
Zookeeper
18. Processing
pipeline
challenges
• Independent
posts
=>
complete
threads
• Split
long
threads
into
mul/ple
sub-‐threads
• Fully
parallel
processing
pipeline
• Accommodate
out
of
order
data
19. Processing
pipeline
using
HBase
• Mul/ple
steps
with
checkpoints
to
manage
failures
• Idempotent
opera/ons
at
each
stage
of
process
• U/lize
op/mis/c
locking
to
do
coordinated
merges
• Use
regular
cleanup
scans
to
pick
up
lost
tasks
• Control
batch
size
of
messages
to
control
throughput
versus
latency
• Out
of
order
input
assumed
Posts
from
Message
Message
Batch
Crawler
Process
&
pre-‐ Indexing
Build
thread
Merge
+
split
index
sub-‐
parts
threads
RT
threads
Indexing
HBase
Riak
21. Index
genera/on
challenges
• Shard
size
control
• Index
ordering
• Maintain
inverted
and
un-‐inverted
data
in
parallel
• Minimize
merging
costs
• Support
mul/-‐grain
indexing
and
scoring
22. Index
Shards
loosely
based
on
HBase
regions
• HBase
primary
key
Pre-‐index
Docs
order
is
same
as
index
order
• Shards
sized
based
R1
Shard
1
on
paralleliza/on
requirements
– Typically
~5gb
R2
Shard
2
each
• Shards
are
based
on
snapshots
of
R3
Shard
3
splits
for
data
locality
23. MapReduce
for
Index
Genera/on
IndexedTableInputFormat
Term:
Pos/ng
Lists
Map
Reduce
Map
Reduce
Barrier
Map
Split
Term
Distribu/on
Par//oner
Sta/s/cs
FileAndPutOutputCommiWer
Inverted
data
Un-‐inverted
characteris/cs
Inverted
Un-‐inverted
data
Data
Indices
&
dic/onaries
characteris/cs
DFS
HBase
25. Presenta/on
Layer
Challenges
• Distributed
search
tree
• High
performance
index
loading
and
serving
• No
SPOF
• Effec/ve
memory
management
&
alloca/on
• Automa/c
cluster
management
• Smart
index
distribu/on
26. Results
Presenta/on
Layer
1.
Request
5.
Response
2.
Query
Zookeeper
for
4.
Retrieve
assets
Results
SServer
ac/ve
servers
Riak
Results
erver
Zookeeper
3.
Fan-‐out
request,
consolidate
responses
3.
Register
new
shard
Shard
Shard
Shard
Shard
availability
Daemon
Daemon
Daemon
Daemon
Index
Server
Index
Server
1.
Load
shard
profile
&
2.
Parallel
load
and
configure
memory
integrate
shard
HBase
DFS
27. Agenda
• What
is
YapMap?
• FiLng
Hadoop
into
your
architecture
• YapMap
Approach
– Crawling
– Processing
– Index
Genera/on
– Results
ü Opera/ons,
GeLng
Started
&
Ques/ons
28. Opera/ons
• Hardware
– Supermicro
with
8
core
low
power
chips,
low
power
ddr3
– WD
Black
2TB
drives
– DDR
Infiniband
using
IPoIB
for
index
loading
performance
• Soiware
– Started
on
Cloudera,
switched
to
MapR’s
M3
distribu/on
of
Hadoop
• GC
was
painful,
now
manageable
– HBase
now
supports
MSLAB
for
writes
and
off-‐heap
block
cache
to
support
larger
memory
usage
– Shard
servers
u/lize
large
pages
to
minimize
fragmenta/on
– Shard
servers
do
immediate
large
alloca/ons
to
minimize
GC
problems
29. GeLng
Started
• Amazon
Elas/c
Map
Reduce
– Common
Crawl
dataset
is
a
great
data
set
to
start
with
• Cheap
old-‐gen
cluster
if
you
want
to
run
things
like
HBase
– We
built
a
effec/ve
6
node
Hadoop/HBase
cluster
for
$1500
(Craigslist,
eBay)
– Mailing
lists
are
liWered
with
performance
and
interconnec/vity
challenges
when
using
cloud
compu/ng
resources
to
do
Hadoop
stuff
30. Ques/ons
• Why
not
Lucene/Solr/Elas/cSearch/KaWa/etc?
– Not
built
to
work
well
with
Hadoop
and
HBase
(Blur.io
is
first
to
tackle
this
head
on)
– Data
locality
between
threads
and
posts
to
do
document-‐at-‐once
scoring
• Why
not
store
indices
directly
in
HBase?
– Single
cell
storage
would
be
the
only
way
to
do
it
efficiently
– No
such
thing
as
a
single
cell
no-‐read
append
(HBASE-‐5993)
– No
single
cell
par/al
read
• Why
use
Riak
for
presenta/on
side?
– Hadoop
SPOF
– Even
with
newer
Hadoop
versions,
HBase
does
not
do
sub-‐second
row-‐level
HA
on
node
failure
(HBASE-‐2357)
– Riak
has
more
predictable
latency
• Why
did
you
switch
to
MapR?
– Index
load
performance
was
substan/ally
faster
– Snapshots
in
trial
copy
were
nice
for
those
30
days
– Less
impact
on
HBase
performance