This document discusses large scale modeling and data analysis. It defines large scale modeling as building models that can process very large datasets that are difficult for traditional tools. It provides examples of large scale recommendation models at LinkedIn and discusses how more data allows for better accuracy, deeper insights through exploration, and more flexible feature engineering. Challenges include ensuring infrastructure can handle the data volume and complexities of online versus offline modeling.
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
Large Scale Modeling and Data Insights
1. Large
Scale
Modeling
Overview
Ferris
Jumah
Predic9on
Analy9cs
Innova9on
Summit
2013
November
15th,
2013
2. Large
Scale
Modeling
• What
does
large
scale
modeling
mean
to
you?
“Building
models
that
consume
and
process
data
sets
so
large
that
it
is
difficult
to
use
current
modeling
tools
and
methods”
4. LinkedIn
News
• Any9me
a
user
lands
on
their
homepage,
a
few
items
from
our
news
product
are
recommended
to
them
• This
is
powered
by
a
large
scale
recommenda9on
engine
• For
every
user,
at
LinkedIn
Scale
5. 3M+
Company
Pages
2
new
Members
per
second
184
M+
Monthly
Unique
Visitors
2.5
B+
Monthly
PageViews
The
World’s
Largest
Professional
Network
259,000,000
+
6. Use
It
All
• Use
all
of
the
data
you
have
• Why
not
store,
process,
and
model
all
of
it?
• “The
accuracy
&
nature
of
answers
you
get
on
large
data
sets
can
be
completely
different
from
what
you
see
on
small
samples”
• Not
using
it
is
losing
compe99ve
edge
9. More
Data
Beats
Be^er
Algorithms
• As
data
set
size
increases,
your
specific
model
and
the
tuning
ma^ers
a
lot
less
• Can
worry
less
about
sample
size,
biases,
and
generalizing
• Spend
your
9me
on
• Exploratory
Analysis
• Feature
Engineering
10. Exploratory
Analysis
• With
large
amounts
of
data,
insights
and
hypothesis
present
themselves
• Group
By
And
Count
• With
large
amounts
of
data,
you
can
worry
less
about
the
distribu9on
being
reflec9ve
of
the
popula9on
• Summary
Sta9s9cs
• Simple
Correla9ons
• Constantly
Visualize
12. Exploratory
Analysis
Across
LinkedIn
Members
• Grouped
by
name
le^er
length
and
9tle
and
counted
• No9ced
that
name
length
is
heavily
correlated
with
industry
• Able
to
start
bootstrapping
models
• Quickly
validate
or
invalidate
a
model
hypothesis
• Generalized
the
results
into
development
of
the
9tle
standardiza9on
models
used
today
13. Go
Deep
• Massive
datasets
lend
themselves
well
to
very
granular
demographic
slicing
or
bucke9ng
• Get
a
very
strong
sense
for
customer
segments
• Reduce
the
size
of
your
data
without
losing
too
much
informa9on
• No9ce
very
specific
trends
that
you
can
be
confident
are
real
• Personalize
deeply
14. Go
Deep
Say
LinkedIn
wants
to
sell
me
something…
15.
16.
17. Keep
Going
• When
opera9ng
with
massive
sets,
combine
several
• Tells
you
more
than
each
would
individually
27. Online
or
Offline?
If
the
problem
domain
can
be
scoped
into
an
offline
system,
it
usually
should
be
Appropriate
When
• Data
is
best
modeled
in
transient
data
streams
rather
than
persistent
rela9ons
• Data
relevance
or
freshness
fades
fast
• Too
much
data
to
store
(infra,
latency
etc)
and
must
be
tossed
• News,
Adver9sing,
Gaming
(A.I.),
Stock
Markets
28. Online
or
Offline?
Benefits
• Instant
Gra9fica9on
– Immediate
integra9on
of
data
into
modeling
outcomes
– Yahoo
invented
S4
to
process
user
feedback
in
real-‐9me
to
op9mize
search
adver9sing
ranking
algorithms
• Mine
more
– In
some
systems
it’s
only
possible
to
use
all
of
your
data
in
an
online
senng
because
there
is
simply
too
much
• Highly
relevant
now
(ma^ers
for
news)
• Personalized
+
Real
9me
=
Great
User
Experience
29. Online
or
Offline?
Challenges
• YOLO
(You
Only
Learn
Once).
• Specific
exper9se
• Evaluate/Interpret
is
Harder
– YOLO
makes
it
difficult
to
evaluate
why
a
model
is
performing
poorly,
and
inherently
related,
why
a
result
is
what
it
is
• Difficult
to
maintain
– Data
changing,
adap9ng
to
new
features,
latency,
evalua9on
• Infrastructure
that
can
support
it.
Suppor9ng
real
9me
learning
is
a
whole
different
ballgame
40. Summary
• Large-‐scale
modeling
• Isn’t
easy
but
takes
advantage
of
the
large
amounts
of
data
we
are
storing
• Sees
no9ceable
increases
in
solu9on
quality
• More
data
beats
be^er
algorithms
• Spend
more
9me
on
exploratory
analysis
and
feature
engineering
• Benefits
from
large
scale
data
• Build
infrastructure
that
lets
you
iterate
and
AB
test
as
fast
as
possible