Large Scale Modeling and Data Insights

Large
Scale
Modeling

Overview

Ferris
Jumah

Predic9on
Analy9cs
Innova9on
Summit
2013

November
15th,
2013

Large
Scale
Modeling

•  What
does
large
scale
modeling
mean
to
you?

“Building
models
that
consume
and
process
data

sets
so
large
that
it
is
diﬃcult
to
use
current

modeling
tools
and
methods”

LinkedIn
News

•  Any9me
a
user
lands
on
their
homepage,
a
few

items
from
our
news
product
are
recommended

to
them

•  This
is
powered
by
a
large
scale
recommenda9on

engine

•  For
every
user,
at
LinkedIn
Scale

3M+

Company
Pages

2
new

Members
per
second

184
M+

Monthly
Unique
Visitors

2.5
B+

Monthly
PageViews

The
World’s
Largest
Professional
Network

259,000,000
+

Use
It
All

•  Use
all
of
the
data
you
have

•  Why
not
store,
process,
and
model
all
of
it?

•  “The
accuracy
&
nature
of
answers
you
get
on

large
data
sets
can
be
completely
diﬀerent

from
what
you
see
on
small
samples”

•  Not
using
it
is
losing
compe99ve
edge

Norvig,
The
Unreasonable

Eﬀec9veness
of
Data,

2013

Classic
Jus9ﬁca9on

More
Data
Beats
Be^er
Algorithms

Banko
and
Brill,
2001

More
Data
Beats
Beêr
Algorithms

•  As
data
set
size
increases,
your
specific
model
and

the
tuning
maêrs
a
lot
less

•  Can
worry
less
about
sample
size,
biases,
and

generalizing

•  Spend
your
9me
on

•  Exploratory
Analysis

•  Feature
Engineering

Exploratory
Analysis

•  With
large
amounts
of
data,
insights
and

hypothesis
present
themselves

•  Group
By
And
Count

•  With
large
amounts
of
data,
you
can
worry
less
about

the
distribu9on
being
reﬂec9ve
of
the
popula9on

•  Summary
Sta9s9cs

•  Simple
Correla9ons

•  Constantly
Visualize

Exploratory
Analysis
Across
LinkedIn
Members

Exploratory
Analysis
Across
LinkedIn
Members

•  Grouped
by
name
le^er
length
and
9tle
and

counted

•  No9ced
that
name
length
is
heavily
correlated

with
industry

•  Able
to
start
bootstrapping
models

•  Quickly
validate
or
invalidate
a
model

hypothesis

•  Generalized
the
results
into
development
of

the
9tle
standardiza9on
models
used
today

Go
Deep

•  Massive
datasets
lend
themselves
well
to
very

granular
demographic
slicing
or
bucke9ng

•  Get
a
very
strong
sense
for
customer
segments

•  Reduce
the
size
of
your
data
without
losing
too
much

informa9on

•  No9ce
very
speciﬁc
trends
that
you
can
be
conﬁdent

are
real

•  Personalize
deeply

Go
Deep

Say
LinkedIn
wants
to
sell
me
something…

Keep
Going

•  When
opera9ng
with
massive
sets,
combine

several

•  Tells
you
more
than
each
would
individually

Large
Datasets

Allow
More

Crea9vity
with
Features

Mapping
LinkedIn
Skills,

+1
to
Edge
Weight

When
Listed
Concurrently

Can
Your
Infrastructure

Hang?

First
ques9on…..

Online
or
Oﬄine?

If
the
problem
domain
can
be
scoped
into
an
oﬄine

system,
it
usually
should
be

Appropriate
When

•  Data
is
best
modeled
in
transient
data
streams
rather

than
persistent
rela9ons

•  Data
relevance
or
freshness
fades
fast

•  Too
much
data
to
store
(infra,
latency
etc)
and
must
be

tossed

•  News,
Adver9sing,
Gaming
(A.I.),
Stock
Markets

Online
or
Offline?

Benefits

•  Instant
Gra9fica9on

–  Immediate
integra9on
of
data
into
modeling
outcomes

–  Yahoo
invented
S4
to
process
user
feedback
in
real-‐9me
to

op9mize
search
adver9sing
ranking
algorithms

•  Mine
more

–  In
some
systems
it’s
only
possible
to
use
all
of
your
data
in
an

online
senng
because
there
is
simply
too
much

•  Highly
relevant
now
(maêrs
for
news)

•  Personalized
+
Real
9me
=
Great
User
Experience

Online
or
Offline?

Challenges

•  YOLO
(You
Only
Learn
Once).

•  Specific
exper9se

•  Evaluate/Interpret
is
Harder

–  YOLO
makes
it
difficult
to
evaluate
why
a
model
is
performing

poorly,
and
inherently
related,
why
a
result
is
what
it
is

•  Difficult
to
maintain

– Data
changing,
adap9ng
to
new
features,
latency,

evalua9on

•  Infrastructure
that
can
support
it.
Suppor9ng
real
9me

learning
is
a
whole
different
ballgame

Big
Data

Tech
is
Young

Google
Trends
Hadoop
&
NOSQL

LinkedIn
Open
Source

Data
Tech

Developing
Bleeding
Edge

Tech
is
Great

….What
About
Using
It?

It
can
be
a
pain
to
use…..

As
a
user

High-‐level
infrastructure
needs

AB
tes9ng
plagorm
Data/schema
viewer

Workﬂow
manager
Access

Modeling
algorithms
implementa9on

Is
the
system
set
up
to
iterate

and
test
new
models
as
fast
as

possible?

High-‐level
LinkedIn
Data
Flow

Evalua9ng
Models

CROWDSOURCE!!!
Is
this
real?

Are
we

using

feedback?

Summary

•  Large-‐scale
modeling

•  Isn’t
easy
but
takes
advantage
of
the
large

amounts
of
data
we
are
storing

•  Sees
no9ceable
increases
in
solu9on
quality

•  More
data
beats
be^er
algorithms

•  Spend
more
9me
on
exploratory
analysis
and
feature

engineering

•  Beneﬁts
from
large
scale
data

•  Build
infrastructure
that
lets
you
iterate
and
AB
test

as
fast
as
possible

Large Scale Modeling and Data Insights

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Large Scale Modeling and Data Insights

Similar to Large Scale Modeling and Data Insights (20)

Recently uploaded

Recently uploaded (20)

Large Scale Modeling and Data Insights