SlideShare uma empresa Scribd logo
1 de 25
Hadoop	
  &	
  Cloud	
  @	
  Ne.lix:	
  
Taming	
  the	
  Social	
  Data	
  
Firehose	
  
	
  
	
  
	
  
06/13/2012	
  
	
  
	
  
Mohammad	
  Sabah	
  
Senior	
  Data	
  ScienFst	
  (@mohammad_sabah	
  	
  	
  	
  	
  	
  	
  	
  )	
  
Algorithms




      Everything is personalized   3
§ Plays
Data / User
              § Behavior
              § Geo-
                 Information
              § Time
              § Ratings
              § Searches
                               4
Big Data   §  25M+ subscribers
@Netflix   §  Ratings: 4M/day
           §  Searches: 3M/day
           §  Plays: 30M/day
           §  Impressions
           §  Device info
           §  Metadata
           §  Social
                                  5
Interesting   § 2B hours
Tidbit           streamed in Q4
                 2011
              § 75% select
                 movies based on
                 recommendations
              § Moral: We need
                 to scale
                 algorithms.
                                   6
7
Technology




             8
Modeling
           § Markov Chains
           §  Collaborative Filtering
           §  Large-scale Matching
           §  LSA
           §  Clustering
           §  Row Selection
           §  Query Categorization
           §  Auto-tagging
           §  Sentiment Analysis

                                         9
Markov Chain: Example I
    0.90                   0.08
                                                            0.80


                           0.15
      Bull Market                             Bear Market



                            0.02



               0.25                    0.25        0.05




                           Recession

                    0.50
                                                                   10
Markov Chain: Example II




            0.8


                                      0.3

                    0.3

                          0.4   0.3



              0.2                      0.7




                                             11
Markov Chain: Formal Definition

§  A Markov chain describes a discrete time
    stochastic process over a set of states
                S = {s1, s2, … sn}
according to a transition probability matrix P = {Pij}
  §  Pij = probability of moving to state j when at
      state i
§  Uses temporal ordering to estimated
    relatedness
§  The future only depends on today and not the
    past


                                                         12
The Math
§  Time Series Aggregation
   <u1, m1, t1>,<u1, m2, t2>,<u2, m3, t3>, …


   <u1> => <m1, t1>, <m2, t2>, <m3, t3>, …
§  Co-occurrence
   n(               ) = 24,000 n(        ) = 30,000


§  Transition Probability
              p(               ) = 0.8
                                                      13
Baseline Implementation & Inefficiencies

§  RDBMS/DW-based      §  SQL Limitation
§  Stored procedures   §  Expensive Copy
§  Once a week         §  Does not exploit
    (weekend)               inherent parallelism
                        §  Does not scale well
                            (region, models)
                        §  4B+ rows – run out of
                            memory/space
                        §  Convoluted Joins
                            (maintenance
                            nightmare!)
                                                    14
MapReduce Implementation - I
   §  Exploits the inherent parallelism in algorithm.
   §  Scale: 25M * 50K (* 50K) ~ 100B+ keys
   §  Time Series Aggregation
               U1, T1, M1    U1=><T1,M1>




                                           U1 => <T3, M5>,   U1 =>
                                           U1 => <T1, M1>,   <T1, M1>,
                                           …                 <T3, M5>,…
U1, T1, M1
U2, T2, M3
U3, T3, M1
               U1, T3, M5    U1=><T3,M5>                                  U1=><T1,M1>,…
                                                                          U2=><T2,M3>,…
U1, T3, M5
                                                                          U3=><T3,M4> …

                                           U2 => <T2, M3>,   U2=>
                                           …                 <T2, M3>,…



                U2, T2, M3   U2=><T2,M3>




Input                                        Shuffle         Reduce        Result
                Split         Map

                                                                                     15
MapReduce Implementation - II

   §  Transition Probability Matrix

                U1=>T1,M1,   M1,M2=>1
                …            M1,M3=>1


                                        M1,M3=>1
                                        M1,M3=>1   M1,M3=>3
                                        M1,M3=>1

U1 => T1,M1,…
                             M1,M3=>1                         M1,M2=>.2
U2 => T2,M1,…   U2=>T2,M1,
                …            M2,M3=>1                         M1,M3=>.3
U3 => T3,M3,…
                                                              M2,M3=>.5
                                        M2,M3>1
                                                   M2,M3=>2
                                        M2,M3=>1


                U3=>T3,M3    M2,M3=>1
                …            M1,M3=>1



  Input         Split         Map       Shuffle    Reduce     Result



                                                                      16
In a Nutshell
§  You end up with a N * N matrix

                    0     0.3   …   0.7

                    0.3   0     …   0.7

                    …

                    0.2   0.1   …   0




                                          17
But…there is a catch!




                        18
Solution!

§  Odds Ratio



§  Optimizations
  §  Decay
  §  Reward
  §  In-Window
  §  Noise


                    19
Markov Chain Migration Summary
 RDBMS/DW                   Hadoop
 Limited by SQL syntax and Can be arbitrarily complex
 semantics
 Expensive Data copy from Data copy avoided
 data source to data center
 Does not scale to new      Scales beautifully.
 models and regions
 Maintenance nightmare      Easy to maintain (written in
 (stored procedures +       high-level language e.g.
 convoluted joins)          Java, Pig)
 Resource constraints       No special handling
                            needed.

                                                           20
Other Algorithms & Challenges



                       Entity        Forms
                       Star Trek     strtrek, startrek, start
                                     trek, star trek, star treck
                       South Park    southpark, sothpark,
                                     south parl, souh park
                       Doctor Who    docter who, doctor wh,
                                     docot who, doctor who:
                       Prison Break prision break, prison
                                    brake, prison breal




                                                                   21
§  Think Parallel!
§  Optimize
§  ML + Hadoop
§  Visualize
§  Experiment
§  Bucket Test
§  Iterative Processing


                       22
Big Data +
      Hadoop +
   Machine Learning
         =>
Great Customer Experience!
                             23
I HAD AN IDEA



       I BUILT IT

  I PUSHED IT TO TEST

THE TEST WAS POSITIVE

I PUSHED IT LIVE!
     We’re hiring!
                        24
@mohammad_sabah
msabah@netflix.com

Mais conteúdo relacionado

Semelhante a Hadoop and Cloud at Netflix

Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012Ted Dunning
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time togetherTed Dunning
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODEL
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODELSEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODEL
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODELgrssieee
 
Time series data mining techniques
Time series data mining techniquesTime series data mining techniques
Time series data mining techniquesShanmukha S. Potti
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer InsightMapR Technologies
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 
What's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache MahoutWhat's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache MahoutMapR Technologies
 
Introducing MERLIN_3.0.pptx
Introducing MERLIN_3.0.pptxIntroducing MERLIN_3.0.pptx
Introducing MERLIN_3.0.pptxssuser716de5
 
Molecular models, threads and you
Molecular models, threads and youMolecular models, threads and you
Molecular models, threads and youJiahao Chen
 
Holistic modelling of mineral processing plants a practical approach
Holistic modelling of mineral processing plants   a practical approachHolistic modelling of mineral processing plants   a practical approach
Holistic modelling of mineral processing plants a practical approachBasdew Rooplal
 
meng.ppt
meng.pptmeng.ppt
meng.pptaozcan1
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicoreillidan2004
 
Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...
Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...
Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...United States Air Force Academy
 

Semelhante a Hadoop and Cloud at Netflix (20)

Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012
 
Ffst
FfstFfst
Ffst
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Shuronr
ShuronrShuronr
Shuronr
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time together
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODEL
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODELSEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODEL
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODEL
 
Time series data mining techniques
Time series data mining techniquesTime series data mining techniques
Time series data mining techniques
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
Crocotta R&D - Virtual Universe
Crocotta R&D - Virtual UniverseCrocotta R&D - Virtual Universe
Crocotta R&D - Virtual Universe
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
What's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache MahoutWhat's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache Mahout
 
Introducing MERLIN_3.0.pptx
Introducing MERLIN_3.0.pptxIntroducing MERLIN_3.0.pptx
Introducing MERLIN_3.0.pptx
 
Molecular models, threads and you
Molecular models, threads and youMolecular models, threads and you
Molecular models, threads and you
 
Hidden Markov Model
Hidden Markov Model Hidden Markov Model
Hidden Markov Model
 
Holistic modelling of mineral processing plants a practical approach
Holistic modelling of mineral processing plants   a practical approachHolistic modelling of mineral processing plants   a practical approach
Holistic modelling of mineral processing plants a practical approach
 
meng.ppt
meng.pptmeng.ppt
meng.ppt
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore
 
Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...
Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...
Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Hadoop and Cloud at Netflix

  • 1. Hadoop  &  Cloud  @  Ne.lix:   Taming  the  Social  Data   Firehose         06/13/2012       Mohammad  Sabah   Senior  Data  ScienFst  (@mohammad_sabah                )  
  • 2.
  • 3. Algorithms Everything is personalized 3
  • 4. § Plays Data / User § Behavior § Geo- Information § Time § Ratings § Searches 4
  • 5. Big Data §  25M+ subscribers @Netflix §  Ratings: 4M/day §  Searches: 3M/day §  Plays: 30M/day §  Impressions §  Device info §  Metadata §  Social 5
  • 6. Interesting § 2B hours Tidbit streamed in Q4 2011 § 75% select movies based on recommendations § Moral: We need to scale algorithms. 6
  • 7. 7
  • 9. Modeling § Markov Chains §  Collaborative Filtering §  Large-scale Matching §  LSA §  Clustering §  Row Selection §  Query Categorization §  Auto-tagging §  Sentiment Analysis 9
  • 10. Markov Chain: Example I 0.90 0.08 0.80 0.15 Bull Market Bear Market 0.02 0.25 0.25 0.05 Recession 0.50 10
  • 11. Markov Chain: Example II 0.8 0.3 0.3 0.4 0.3 0.2 0.7 11
  • 12. Markov Chain: Formal Definition §  A Markov chain describes a discrete time stochastic process over a set of states S = {s1, s2, … sn} according to a transition probability matrix P = {Pij} §  Pij = probability of moving to state j when at state i §  Uses temporal ordering to estimated relatedness §  The future only depends on today and not the past 12
  • 13. The Math §  Time Series Aggregation <u1, m1, t1>,<u1, m2, t2>,<u2, m3, t3>, … <u1> => <m1, t1>, <m2, t2>, <m3, t3>, … §  Co-occurrence n( ) = 24,000 n( ) = 30,000 §  Transition Probability p( ) = 0.8 13
  • 14. Baseline Implementation & Inefficiencies §  RDBMS/DW-based §  SQL Limitation §  Stored procedures §  Expensive Copy §  Once a week §  Does not exploit (weekend) inherent parallelism §  Does not scale well (region, models) §  4B+ rows – run out of memory/space §  Convoluted Joins (maintenance nightmare!) 14
  • 15. MapReduce Implementation - I §  Exploits the inherent parallelism in algorithm. §  Scale: 25M * 50K (* 50K) ~ 100B+ keys §  Time Series Aggregation U1, T1, M1 U1=><T1,M1> U1 => <T3, M5>, U1 => U1 => <T1, M1>, <T1, M1>, … <T3, M5>,… U1, T1, M1 U2, T2, M3 U3, T3, M1 U1, T3, M5 U1=><T3,M5> U1=><T1,M1>,… U2=><T2,M3>,… U1, T3, M5 U3=><T3,M4> … U2 => <T2, M3>, U2=> … <T2, M3>,… U2, T2, M3 U2=><T2,M3> Input Shuffle Reduce Result Split Map 15
  • 16. MapReduce Implementation - II §  Transition Probability Matrix U1=>T1,M1, M1,M2=>1 … M1,M3=>1 M1,M3=>1 M1,M3=>1 M1,M3=>3 M1,M3=>1 U1 => T1,M1,… M1,M3=>1 M1,M2=>.2 U2 => T2,M1,… U2=>T2,M1, … M2,M3=>1 M1,M3=>.3 U3 => T3,M3,… M2,M3=>.5 M2,M3>1 M2,M3=>2 M2,M3=>1 U3=>T3,M3 M2,M3=>1 … M1,M3=>1 Input Split Map Shuffle Reduce Result 16
  • 17. In a Nutshell §  You end up with a N * N matrix 0 0.3 … 0.7 0.3 0 … 0.7 … 0.2 0.1 … 0 17
  • 18. But…there is a catch! 18
  • 19. Solution! §  Odds Ratio §  Optimizations §  Decay §  Reward §  In-Window §  Noise 19
  • 20. Markov Chain Migration Summary RDBMS/DW Hadoop Limited by SQL syntax and Can be arbitrarily complex semantics Expensive Data copy from Data copy avoided data source to data center Does not scale to new Scales beautifully. models and regions Maintenance nightmare Easy to maintain (written in (stored procedures + high-level language e.g. convoluted joins) Java, Pig) Resource constraints No special handling needed. 20
  • 21. Other Algorithms & Challenges Entity Forms Star Trek strtrek, startrek, start trek, star trek, star treck South Park southpark, sothpark, south parl, souh park Doctor Who docter who, doctor wh, docot who, doctor who: Prison Break prision break, prison brake, prison breal 21
  • 22. §  Think Parallel! §  Optimize §  ML + Hadoop §  Visualize §  Experiment §  Bucket Test §  Iterative Processing 22
  • 23. Big Data + Hadoop + Machine Learning => Great Customer Experience! 23
  • 24. I HAD AN IDEA I BUILT IT I PUSHED IT TO TEST THE TEST WAS POSITIVE I PUSHED IT LIVE! We’re hiring! 24