SlideShare a Scribd company logo
1 of 41
Download to read offline
Large	
  Scale	
  Modeling	
  	
  
Overview	
   	
  
Ferris	
  Jumah	
  
Predic9on	
  Analy9cs	
  Innova9on	
  Summit	
  2013	
  
November	
  15th,	
  2013	
  
Large	
  Scale	
  Modeling	
  	
  
•  What	
  does	
  large	
  scale	
  modeling	
  mean	
  to	
  you?	
  	
  
“Building	
  models	
  that	
  consume	
  and	
  process	
  data	
  
sets	
  so	
  large	
  that	
  it	
  is	
  difficult	
  to	
  use	
  current	
  
modeling	
  tools	
  and	
  methods”	
  
	
  
	
  
	
  
LinkedIn	
  News	
  
LinkedIn	
  News	
  
•  Any9me	
  a	
  user	
  lands	
  on	
  their	
  homepage,	
  a	
  few	
  
items	
  from	
  our	
  news	
  product	
  are	
  recommended	
  
to	
  them	
  
•  This	
  is	
  powered	
  by	
  a	
  large	
  scale	
  recommenda9on	
  
engine	
  
•  For	
  every	
  user,	
  at	
  LinkedIn	
  Scale	
  
	
  
	
  
	
  
	
  
3M+	
  	
  	
  	
  Company	
  Pages	
  
2	
  new	
  
Members	
  per	
  second	
  
184	
  M+	
  
Monthly	
  Unique	
  Visitors	
  
2.5	
  B+	
  
Monthly	
  PageViews	
  
The	
  World’s	
  Largest	
  Professional	
  Network	
  
259,000,000	
  +	
  
Use	
  It	
  All	
  
•  Use	
  all	
  of	
  the	
  data	
  you	
  have	
  
•  Why	
  not	
  store,	
  process,	
  and	
  model	
  all	
  of	
  it?	
  	
  
•  “The	
  accuracy	
  &	
  nature	
  of	
  answers	
  you	
  get	
  on	
  
large	
  data	
  sets	
  can	
  be	
  completely	
  different	
  
from	
  what	
  you	
  see	
  on	
  small	
  samples”	
  
•  Not	
  using	
  it	
  is	
  losing	
  compe99ve	
  edge	
  
	
  
	
  
	
  
Norvig,	
  The	
  Unreasonable	
  
Effec9veness	
  of	
  Data,	
  
2013	
  
Classic	
  Jus9fica9on	
  
More	
  Data	
  Beats	
  Be^er	
  Algorithms	
  
Banko	
  and	
  Brill,	
  2001	
  
More	
  Data	
  Beats	
  Be^er	
  Algorithms	
  
•  As	
  data	
  set	
  size	
  increases,	
  your	
  specific	
  model	
  and	
  
the	
  tuning	
  ma^ers	
  a	
  lot	
  less	
  	
  
	
  
•  Can	
  worry	
  less	
  about	
  sample	
  size,	
  biases,	
  and	
  
generalizing	
  
•  Spend	
  your	
  9me	
  on	
  	
  
•  Exploratory	
  Analysis	
  
•  Feature	
  Engineering	
  
	
  
Exploratory	
  Analysis	
  
•  With	
  large	
  amounts	
  of	
  data,	
  insights	
  and	
  
hypothesis	
  present	
  themselves	
  
	
  
•  Group	
  By	
  And	
  Count	
  
•  With	
  large	
  amounts	
  of	
  data,	
  you	
  can	
  worry	
  less	
  about	
  
the	
  distribu9on	
  being	
  reflec9ve	
  of	
  the	
  popula9on	
  
•  Summary	
  Sta9s9cs	
  	
  
•  Simple	
  Correla9ons	
  
•  Constantly	
  Visualize	
  
	
  
	
  
	
  
Exploratory	
  Analysis	
  Across	
  LinkedIn	
  Members	
  
Exploratory	
  Analysis	
  Across	
  LinkedIn	
  Members	
  
•  Grouped	
  by	
  name	
  le^er	
  length	
  and	
  9tle	
  and	
  
counted	
  
•  No9ced	
  that	
  name	
  length	
  is	
  heavily	
  correlated	
  
with	
  industry	
  
•  Able	
  to	
  start	
  bootstrapping	
  models	
  
•  Quickly	
  validate	
  or	
  invalidate	
  a	
  model	
  
hypothesis	
  
•  Generalized	
  the	
  results	
  into	
  development	
  of	
  
the	
  9tle	
  standardiza9on	
  models	
  used	
  today	
  
	
  
	
  
	
  
Go	
  Deep	
  
•  Massive	
  datasets	
  lend	
  themselves	
  well	
  to	
  very	
  
granular	
  demographic	
  slicing	
  or	
  bucke9ng	
  	
  
•  Get	
  a	
  very	
  strong	
  sense	
  for	
  customer	
  segments	
  
•  Reduce	
  the	
  size	
  of	
  your	
  data	
  without	
  losing	
  too	
  much	
  
informa9on	
  
•  No9ce	
  very	
  specific	
  trends	
  that	
  you	
  can	
  be	
  confident	
  
are	
  real	
  
•  Personalize	
  deeply	
  
	
  
	
  
	
  
Go	
  Deep	
  
	
   	
  Say	
  LinkedIn	
  wants	
  to	
  sell	
  me	
  something…	
  
	
  
	
  
	
  
Keep	
  Going	
  
•  When	
  opera9ng	
  with	
  massive	
  sets,	
  combine	
  
several	
  
•  Tells	
  you	
  more	
  than	
  each	
  would	
  individually	
  
Pigalls	
  S9ll	
  Apply	
  
Simpson’s	
  paradox	
  
Large	
  Datasets	
  	
  
Allow	
  More	
  	
  
Crea9vity	
  with	
  Features	
  
Mapping	
  LinkedIn	
  Skills,	
  	
  
+1	
  to	
  Edge	
  Weight	
  	
  
When	
  Listed	
  Concurrently	
  
Feature	
  Engineering	
  
Can	
  Your	
  Infrastructure	
  
Hang?	
  
First	
  ques9on…..	
  
Online	
  or	
  Offline?	
  
If	
  the	
  problem	
  domain	
  can	
  be	
  scoped	
  into	
  an	
  offline	
  
system,	
  it	
  usually	
  should	
  be	
  
	
  
Appropriate	
  When	
  
•  Data	
  is	
  best	
  modeled	
  in	
  transient	
  data	
  streams	
  rather	
  
than	
  persistent	
  rela9ons	
  
•  Data	
  relevance	
  or	
  freshness	
  fades	
  fast	
  
•  Too	
  much	
  data	
  to	
  store	
  (infra,	
  latency	
  etc)	
  and	
  must	
  be	
  
tossed	
  
•  News,	
  Adver9sing,	
  Gaming	
  (A.I.),	
  Stock	
  Markets	
  
Online	
  or	
  Offline?	
  
Benefits	
  
•  Instant	
  Gra9fica9on	
  
–  Immediate	
  integra9on	
  of	
  data	
  into	
  modeling	
  outcomes	
  
–  Yahoo	
  invented	
  S4	
  to	
  process	
  user	
  feedback	
  in	
  real-­‐9me	
  to	
  
op9mize	
  search	
  adver9sing	
  ranking	
  algorithms	
  
•  Mine	
  more	
  
–  In	
  some	
  systems	
  it’s	
  only	
  possible	
  to	
  use	
  all	
  of	
  your	
  data	
  in	
  an	
  
online	
  senng	
  because	
  there	
  is	
  simply	
  too	
  much	
  
•  Highly	
  relevant	
  now	
  (ma^ers	
  for	
  news)	
  
•  Personalized	
  +	
  Real	
  9me	
  =	
  Great	
  User	
  Experience	
  
Online	
  or	
  Offline?	
  
Challenges	
  
•  YOLO	
  (You	
  Only	
  Learn	
  Once).	
  	
  
•  Specific	
  exper9se	
  
•  Evaluate/Interpret	
  is	
  Harder	
  
–  YOLO	
  makes	
  it	
  difficult	
  to	
  evaluate	
  why	
  a	
  model	
  is	
  performing	
  
poorly,	
  and	
  inherently	
  related,	
  why	
  a	
  result	
  is	
  what	
  it	
  is	
  
•  Difficult	
  to	
  maintain	
  
– Data	
  changing,	
  adap9ng	
  to	
  new	
  features,	
  latency,	
  
evalua9on	
  
•  Infrastructure	
  that	
  can	
  support	
  it.	
  Suppor9ng	
  real	
  9me	
  
learning	
  is	
  a	
  whole	
  different	
  ballgame	
  
Big	
  Data	
  	
  
Tech	
  is	
  Young	
  
Google	
  Trends	
  Hadoop	
  &	
  NOSQL	
  
LinkedIn	
  Open	
  Source	
  
Data	
  Tech	
  
Developing	
  Bleeding	
  Edge	
  	
  
Tech	
  is	
  Great	
  
….What	
  About	
  Using	
  It?	
  
It	
  can	
  be	
  a	
  pain	
  to	
  use…..	
  
As	
  a	
  user	
  
High-­‐level	
  infrastructure	
  needs	
  
AB	
  tes9ng	
  plagorm	
   Data/schema	
  viewer	
  
Workflow	
  manager	
   Access	
  
Modeling	
  algorithms	
  implementa9on	
  
Is	
  the	
  system	
  set	
  up	
  to	
  iterate	
  
and	
  test	
  new	
  models	
  as	
  fast	
  as	
  
possible?	
  	
  
High-­‐level	
  LinkedIn	
  Data	
  Flow	
  
Evalua9ng	
  Models	
  
Evalua9ng	
  Models	
  
CROWDSOURCE!!!	
   Is	
  this	
  real?	
  
Are	
  we	
  	
  
using	
  	
  
feedback?	
  
Summary	
  
•  Large-­‐scale	
  modeling	
  	
  
•  Isn’t	
  easy	
  but	
  takes	
  advantage	
  of	
  the	
  large	
  
amounts	
  of	
  data	
  we	
  are	
  storing	
  
•  Sees	
  no9ceable	
  increases	
  in	
  solu9on	
  quality	
  
•  More	
  data	
  beats	
  be^er	
  algorithms	
  
•  Spend	
  more	
  9me	
  on	
  exploratory	
  analysis	
  and	
  feature	
  
engineering	
  
•  Benefits	
  from	
  large	
  scale	
  data	
  
•  Build	
  infrastructure	
  that	
  lets	
  you	
  iterate	
  and	
  AB	
  test	
  
as	
  fast	
  as	
  possible	
  
	
  
rumah@linkedin.com	
  
	
  

More Related Content

What's hot

The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHDataiku
 
Journey of The Connected Enterprise - Knowledge Graphs - Smart Data
Journey of The Connected Enterprise - Knowledge Graphs - Smart DataJourney of The Connected Enterprise - Knowledge Graphs - Smart Data
Journey of The Connected Enterprise - Knowledge Graphs - Smart DataBenjamin Nussbaum
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceAnnie Flippo
 
Building Better Models Faster Using Active Learning
Building Better Models Faster Using Active LearningBuilding Better Models Faster Using Active Learning
Building Better Models Faster Using Active LearningCrowdFlower
 
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning MeetupKnowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning MeetupBenjamin Nussbaum
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with DatoTuri, Inc.
 
Neo4j on Microsoft Azure
Neo4j on Microsoft AzureNeo4j on Microsoft Azure
Neo4j on Microsoft AzureNeo4j
 
DataCanvas: Big Data Analytic Flow in Cloud
DataCanvas: Big Data Analytic Flow in CloudDataCanvas: Big Data Analytic Flow in Cloud
DataCanvas: Big Data Analytic Flow in CloudLei Fang
 
Building a Recommendation Engine - A Balancing act
Building a Recommendation Engine - A Balancing actBuilding a Recommendation Engine - A Balancing act
Building a Recommendation Engine - A Balancing actElad Rosenheim
 
Anatomy of a Big Data Application (BDA)
Anatomy of a Big Data Application (BDA)Anatomy of a Big Data Application (BDA)
Anatomy of a Big Data Application (BDA)BloomReach
 
IBM Deep Learning Overview
IBM Deep Learning OverviewIBM Deep Learning Overview
IBM Deep Learning OverviewDavid Solomon
 
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)Elad Rosenheim
 
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
[AIIM17] It’s Harvest Time in the Information Garden - Dan AntionAIIM International
 
Machine learning in real-time - the next frontier
Machine learning in real-time - the next frontierMachine learning in real-time - the next frontier
Machine learning in real-time - the next frontierSnowplow Analytics
 
Objectivity/DB: A Multipurpose NoSQL Database
Objectivity/DB: A Multipurpose NoSQL DatabaseObjectivity/DB: A Multipurpose NoSQL Database
Objectivity/DB: A Multipurpose NoSQL DatabaseInfiniteGraph
 
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...Sri Ambati
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku
 
Making Sense of Graph Databases
Making Sense of Graph DatabasesMaking Sense of Graph Databases
Making Sense of Graph DatabasesInfiniteGraph
 

What's hot (20)

The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
Journey of The Connected Enterprise - Knowledge Graphs - Smart Data
Journey of The Connected Enterprise - Knowledge Graphs - Smart DataJourney of The Connected Enterprise - Knowledge Graphs - Smart Data
Journey of The Connected Enterprise - Knowledge Graphs - Smart Data
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
Building Better Models Faster Using Active Learning
Building Better Models Faster Using Active LearningBuilding Better Models Faster Using Active Learning
Building Better Models Faster Using Active Learning
 
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning MeetupKnowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 
Neo4j on Microsoft Azure
Neo4j on Microsoft AzureNeo4j on Microsoft Azure
Neo4j on Microsoft Azure
 
DataCanvas: Big Data Analytic Flow in Cloud
DataCanvas: Big Data Analytic Flow in CloudDataCanvas: Big Data Analytic Flow in Cloud
DataCanvas: Big Data Analytic Flow in Cloud
 
Building a Recommendation Engine - A Balancing act
Building a Recommendation Engine - A Balancing actBuilding a Recommendation Engine - A Balancing act
Building a Recommendation Engine - A Balancing act
 
Anatomy of a Big Data Application (BDA)
Anatomy of a Big Data Application (BDA)Anatomy of a Big Data Application (BDA)
Anatomy of a Big Data Application (BDA)
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
 
IBM Deep Learning Overview
IBM Deep Learning OverviewIBM Deep Learning Overview
IBM Deep Learning Overview
 
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
 
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
 
Machine learning in real-time - the next frontier
Machine learning in real-time - the next frontierMachine learning in real-time - the next frontier
Machine learning in real-time - the next frontier
 
Objectivity/DB: A Multipurpose NoSQL Database
Objectivity/DB: A Multipurpose NoSQL DatabaseObjectivity/DB: A Multipurpose NoSQL Database
Objectivity/DB: A Multipurpose NoSQL Database
 
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine Learning
 
Sentiment analysis for Business Analytics
Sentiment analysis for Business AnalyticsSentiment analysis for Business Analytics
Sentiment analysis for Business Analytics
 
Making Sense of Graph Databases
Making Sense of Graph DatabasesMaking Sense of Graph Databases
Making Sense of Graph Databases
 

Similar to Large Scale Modeling and Data Insights

Neo4j GraphTalks Oslo - Next Generation Solutions built on Neoej
Neo4j GraphTalks Oslo - Next Generation Solutions built on NeoejNeo4j GraphTalks Oslo - Next Generation Solutions built on Neoej
Neo4j GraphTalks Oslo - Next Generation Solutions built on NeoejNeo4j
 
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas SuravarapuGraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas SuravarapuNeo4j
 
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comTDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comDaqing Zhao
 
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsPower to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsLooker
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with DatabricksGrega Kespret
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine LearningRandy Shoup
 
What is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PMWhat is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PMProduct School
 
Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkPeter Skomoroch
 
GraphTalk Berlin - Einführung in Graphdatenbanken
GraphTalk Berlin - Einführung in GraphdatenbankenGraphTalk Berlin - Einführung in Graphdatenbanken
GraphTalk Berlin - Einführung in GraphdatenbankenNeo4j
 
The final frontier
The final frontierThe final frontier
The final frontierTerry Bunio
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Precisely
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationDATAVERSITY
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceData Science Milan
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Top Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwareTop Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwarePanorama Software
 
Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)Bill Chambers
 
Mastering your data with ca e rwin dm 09082010
Mastering your data with ca e rwin dm 09082010Mastering your data with ca e rwin dm 09082010
Mastering your data with ca e rwin dm 09082010ERwin Modeling
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsEmbarcadero Technologies
 

Similar to Large Scale Modeling and Data Insights (20)

Neo4j GraphTalks Oslo - Next Generation Solutions built on Neoej
Neo4j GraphTalks Oslo - Next Generation Solutions built on NeoejNeo4j GraphTalks Oslo - Next Generation Solutions built on Neoej
Neo4j GraphTalks Oslo - Next Generation Solutions built on Neoej
 
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas SuravarapuGraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
 
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comTDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
 
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsPower to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine Learning
 
What is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PMWhat is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PM
 
Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you think
 
GraphTalk Berlin - Einführung in Graphdatenbanken
GraphTalk Berlin - Einführung in GraphdatenbankenGraphTalk Berlin - Einführung in Graphdatenbanken
GraphTalk Berlin - Einführung in Graphdatenbanken
 
The final frontier
The final frontierThe final frontier
The final frontier
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial Intelligence
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Top Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwareTop Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama Software
 
Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)
 
Mastering your data with ca e rwin dm 09082010
Mastering your data with ca e rwin dm 09082010Mastering your data with ca e rwin dm 09082010
Mastering your data with ca e rwin dm 09082010
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data Assets
 

Recently uploaded

detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 

Recently uploaded (20)

detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 

Large Scale Modeling and Data Insights

  • 1. Large  Scale  Modeling     Overview     Ferris  Jumah   Predic9on  Analy9cs  Innova9on  Summit  2013   November  15th,  2013  
  • 2. Large  Scale  Modeling     •  What  does  large  scale  modeling  mean  to  you?     “Building  models  that  consume  and  process  data   sets  so  large  that  it  is  difficult  to  use  current   modeling  tools  and  methods”        
  • 4. LinkedIn  News   •  Any9me  a  user  lands  on  their  homepage,  a  few   items  from  our  news  product  are  recommended   to  them   •  This  is  powered  by  a  large  scale  recommenda9on   engine   •  For  every  user,  at  LinkedIn  Scale          
  • 5. 3M+        Company  Pages   2  new   Members  per  second   184  M+   Monthly  Unique  Visitors   2.5  B+   Monthly  PageViews   The  World’s  Largest  Professional  Network   259,000,000  +  
  • 6. Use  It  All   •  Use  all  of  the  data  you  have   •  Why  not  store,  process,  and  model  all  of  it?     •  “The  accuracy  &  nature  of  answers  you  get  on   large  data  sets  can  be  completely  different   from  what  you  see  on  small  samples”   •  Not  using  it  is  losing  compe99ve  edge        
  • 7. Norvig,  The  Unreasonable   Effec9veness  of  Data,   2013   Classic  Jus9fica9on  
  • 8. More  Data  Beats  Be^er  Algorithms   Banko  and  Brill,  2001  
  • 9. More  Data  Beats  Be^er  Algorithms   •  As  data  set  size  increases,  your  specific  model  and   the  tuning  ma^ers  a  lot  less       •  Can  worry  less  about  sample  size,  biases,  and   generalizing   •  Spend  your  9me  on     •  Exploratory  Analysis   •  Feature  Engineering    
  • 10. Exploratory  Analysis   •  With  large  amounts  of  data,  insights  and   hypothesis  present  themselves     •  Group  By  And  Count   •  With  large  amounts  of  data,  you  can  worry  less  about   the  distribu9on  being  reflec9ve  of  the  popula9on   •  Summary  Sta9s9cs     •  Simple  Correla9ons   •  Constantly  Visualize        
  • 11. Exploratory  Analysis  Across  LinkedIn  Members  
  • 12. Exploratory  Analysis  Across  LinkedIn  Members   •  Grouped  by  name  le^er  length  and  9tle  and   counted   •  No9ced  that  name  length  is  heavily  correlated   with  industry   •  Able  to  start  bootstrapping  models   •  Quickly  validate  or  invalidate  a  model   hypothesis   •  Generalized  the  results  into  development  of   the  9tle  standardiza9on  models  used  today        
  • 13. Go  Deep   •  Massive  datasets  lend  themselves  well  to  very   granular  demographic  slicing  or  bucke9ng     •  Get  a  very  strong  sense  for  customer  segments   •  Reduce  the  size  of  your  data  without  losing  too  much   informa9on   •  No9ce  very  specific  trends  that  you  can  be  confident   are  real   •  Personalize  deeply        
  • 14. Go  Deep      Say  LinkedIn  wants  to  sell  me  something…        
  • 15.
  • 16.
  • 17. Keep  Going   •  When  opera9ng  with  massive  sets,  combine   several   •  Tells  you  more  than  each  would  individually  
  • 18.
  • 19.
  • 20.
  • 23. Large  Datasets     Allow  More     Crea9vity  with  Features  
  • 24. Mapping  LinkedIn  Skills,     +1  to  Edge  Weight     When  Listed  Concurrently  
  • 26. Can  Your  Infrastructure   Hang?   First  ques9on…..  
  • 27. Online  or  Offline?   If  the  problem  domain  can  be  scoped  into  an  offline   system,  it  usually  should  be     Appropriate  When   •  Data  is  best  modeled  in  transient  data  streams  rather   than  persistent  rela9ons   •  Data  relevance  or  freshness  fades  fast   •  Too  much  data  to  store  (infra,  latency  etc)  and  must  be   tossed   •  News,  Adver9sing,  Gaming  (A.I.),  Stock  Markets  
  • 28. Online  or  Offline?   Benefits   •  Instant  Gra9fica9on   –  Immediate  integra9on  of  data  into  modeling  outcomes   –  Yahoo  invented  S4  to  process  user  feedback  in  real-­‐9me  to   op9mize  search  adver9sing  ranking  algorithms   •  Mine  more   –  In  some  systems  it’s  only  possible  to  use  all  of  your  data  in  an   online  senng  because  there  is  simply  too  much   •  Highly  relevant  now  (ma^ers  for  news)   •  Personalized  +  Real  9me  =  Great  User  Experience  
  • 29. Online  or  Offline?   Challenges   •  YOLO  (You  Only  Learn  Once).     •  Specific  exper9se   •  Evaluate/Interpret  is  Harder   –  YOLO  makes  it  difficult  to  evaluate  why  a  model  is  performing   poorly,  and  inherently  related,  why  a  result  is  what  it  is   •  Difficult  to  maintain   – Data  changing,  adap9ng  to  new  features,  latency,   evalua9on   •  Infrastructure  that  can  support  it.  Suppor9ng  real  9me   learning  is  a  whole  different  ballgame  
  • 30. Big  Data     Tech  is  Young  
  • 31. Google  Trends  Hadoop  &  NOSQL  
  • 32. LinkedIn  Open  Source   Data  Tech  
  • 33. Developing  Bleeding  Edge     Tech  is  Great   ….What  About  Using  It?  
  • 34. It  can  be  a  pain  to  use…..   As  a  user  
  • 35. High-­‐level  infrastructure  needs   AB  tes9ng  plagorm   Data/schema  viewer   Workflow  manager   Access   Modeling  algorithms  implementa9on  
  • 36. Is  the  system  set  up  to  iterate   and  test  new  models  as  fast  as   possible?    
  • 39. Evalua9ng  Models   CROWDSOURCE!!!   Is  this  real?   Are  we     using     feedback?  
  • 40. Summary   •  Large-­‐scale  modeling     •  Isn’t  easy  but  takes  advantage  of  the  large   amounts  of  data  we  are  storing   •  Sees  no9ceable  increases  in  solu9on  quality   •  More  data  beats  be^er  algorithms   •  Spend  more  9me  on  exploratory  analysis  and  feature   engineering   •  Benefits  from  large  scale  data   •  Build  infrastructure  that  lets  you  iterate  and  AB  test   as  fast  as  possible