SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Applying Apache Spark to Data Science
Challenges in Media & Entertainment
Peyman Mohajerian
Solution Architect Databricks
Overview
• Introduction to Spark- Unifying Framework
• Content Personalization: Recommendation, Streaming
• Social Media Analytics: GraphFrames
• Viewership Prediction: Topic Modeling, Sentiment
What is ?
Next Generation Big Data Processing Engine
• Started as a research project at UC Berkeley in 2009
• 600,000 lines of code (75% Scala)
• Last Release Spark 1.6 December 2015
• Next Release Spark 2.0
• Open Source License (Apache 2.0)
• Built by 1000+ developers from 200+ companies
4
Apache Spark: Flexible and Unified
5
{JSON
}
Data
Sources
Spark Core
DataFrames ML Pipelines
Spark
Streaming
Spark SQL MLlib GraphX
Live Data
RDDs DataFrames / SQL
Collections of Native JVM Objects Structured Binary Data (Tungsten)
• Compile-time type-safety
• Easy to express certain types of
logic
• Lots of existing code + users
• Lower level control of Spark
• Imperative
• Lower memory pressure (gc & space)
• Memory accounting (avoids OOMs)
• Faster sorting / hashing / serialization
• More opportunities for automatic optimization
• Declarative
8
Spark Job Execution
9
Why Spark ML
Provide general purpose ML algorithms on top of Spark
• Let Spark handle the distribution of data and queries; scalability
• Leverage its improvements (e.g. DataFrames, Datasets,
Tungsten)
Advantages of MLlib’s Design:
• Simplicity
• Scalability
• Streamlined end-to-end
• Compatibility
High-level functionality in MLlib
Learning tasks
Classification
Regression
Recommendation
Clustering
Frequent
itemsets
11
Workflow utilities
• Model import/export
• Pipelines
• DataFrames
• Cross validation
Data utilities
• Feature
extraction &
selection
• Statistics
• Linear algebra
ML Workflows are complex
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 2
Extract features
Extract features
Feature transform
1
Feature transform
2
Feature transform
3
Train model 2
Ensemble
12
Iterate on Your Models
Analyze
Data
Feature
Engineerin
g
TrainTune
Test
13
• Databricks Notebooks or Juptyer,
Zeppelin are great for iterative
model development using the
REPL
• Spark provides fast, scalable
infrastructure so you don’t have
to wait for your results
• Subsample during the early
model development phase, but
when in doubt use more data
• Better feature engineering can
produce as good or better results
than tuning the algorithm
The Advanced Analytics Gap
14
ADVANCED ANALYTICS SOLUTIONS
ANOMALY 

DETECTION
PREDICTIVE 

ANALYTICS
NEXT GEN
PRODUCT R&D
SILOED, UNSTRUCTURED, FAST-GROWING DATA
HADOOP /

DATA LAKES
CLOUD 

STORAGE
DATA
WAREHOUSES
DATA 

WAREHOUSES
HADOOP / 

DATA LAKES
YOUR
STORAGE
CLOUD 

STORAGE
ORCHESTRATED
SPARK IN THE
CLOUD
15
Just-in-Time Data Platform
INTEGRATED
WORKSPACE
DASHBOARDS
Reports
NOTEBOOKS
github, viz,
collaboration
ENTERPRISESECURITY

Accesscontrol,auditing,encryption
BI TOOLS
OPEN SOURCE
YOUR CUSTOM SPARK APPS
MANAGEMENT: Scalability, resilience, multi-tenancy
INTERFACES: BI tools & RESTful APIs
DATA INTEGRATION: Universal access without centralization
MANAGED SERVICES
PRODUCTION JOBS
+
Powered by Apache Spark
Optimized Model Optimization?
16
Media & Entertainment Use Cases
17
Content
Personalization
Churn & Cohort
Analysis
Social Network
Graph
Sentiment Analysis
Secure Managed Spark Platform
ETL | Data Cleansing JIT Data Warehouse
Advanced Analytics
Machine Learning / Graph Analysis
Pixel Data Social Media
Nielsen Rating
Image Data
Video Stream
Viewing Data
Survey DataWearable Data
CRM Data
Transactional
Content Personalization: Recommendations
• Broad Application
– Movie Streaming, Matching Sites, Mobile App, Music
• Key Trends
• Continuous Application- Near Real-time interactivity
• Best products suited for a user’s preference to maximize the revenue
• Provide a tailored and personalized view of pertinent data for each individual you
serve
18
Rating, Play,
Browse,..Media Channels
(devices,..)
Event Distribution
Content Serving
Recommendation..
Online
Learning
Offline
Learning
Social
Behavioral Feed
Dashboard
Content
Repo
Event Analytics
Content Personalization: Recommendations
• Key Considerations
– Cold Start- Content Based, Similarity Index
– Rating Based- User-Item: ALS Item-Item: K-Mean
– Social Graph- PageRank of Top Influencers Spark Graph Frames
– Continuous Application- Spark Streaming, Real-time Model, Model Serving
19
Input Stream
Interactive
Dashboard
Structural
Streaming
Movie Rating
Social Feed
Real-time ML
Model
Updates
Off-line ML
Model
Serving
System
Continuous Application
Storage Realtime & Batch
Continuous Applications: Structured Streaming
Structured Streaming
// Read JSON continuously from S3
logsDF = spark.readStream.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.writeStream.parquet("s3://out")
.start()
// Read JSON once from S3
logsDF = spark.read.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.write.parquet("s3://out")
Streaming Version
Batch Version
Social Media Analytics: GraphFrames
• Application
– Influencers in a Social Graph (PageRank), Distance Measure, Co-occurance, Clustering
• Key Trends
• Marketing Campaigns
• Finding Friends, Jobs, …
22
GraphFrames are built on top of Spark
DataFrames, vertices and edges are represented
as DataFrames, allowing us to store arbitrary data
with each vertex and edge.
Shortest Path: How fast communication propagates
Label Propagation Algorithm (LPA): Detect communities in a
graph.
Motif finding: Search for structural patterns in a graph.
Viewership Prediction: Topic Modeling
• Application
– Programming Decisions (e.g. House of Cards),
• Key Trends
• Consumer sentiment in realtime, augmenting Nielsen rating data with transcript analytics
• Netflex: Meta-tags with information about millions plays per day to determine what will be a hit,
what viewers like, and what keeps them watching
23
Text Tokenization
Remove Stopwords
Vector of Token Counts
Create LDA model with Online Variational Bayes
Review Topics
Model Tuning -
Create LDA model with Expectation Maximization
Visualize Results
Sentiment Analysis: Logistic Regresson
• Application
– Not just Twitter, any type of text e.g. viewers comments on a movie streams
24
Thanks
Peyman Mohajerian
peyman@databricks.com

Mais conteúdo relacionado

Mais procurados

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Databricks
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"Databricks
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkDatabricks
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit
 
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Databricks
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Spark Summit
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaSpark Summit
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowDatabricks
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingKeira Zhou
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDatabricks
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastDatabricks
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowDatabricks
 

Mais procurados (20)

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
 
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal Malohlava
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark Streaming
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit East
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 

Destaque

Opencell Media & Entertainement Webinar
Opencell Media & Entertainement WebinarOpencell Media & Entertainement Webinar
Opencell Media & Entertainement WebinarJean-Philippe Viegas
 
Digital Trends and Priorities in the Media and Entertainment Sector
Digital Trends and Priorities in the Media and Entertainment SectorDigital Trends and Priorities in the Media and Entertainment Sector
Digital Trends and Priorities in the Media and Entertainment SectorEconsultancy
 
A study on the Indian Media & Entertainment Industry
A study on the Indian Media & Entertainment IndustryA study on the Indian Media & Entertainment Industry
A study on the Indian Media & Entertainment IndustryDivya Liz George
 
Tracxn Media & Entertainment India - August 2015
Tracxn Media & Entertainment India - August 2015Tracxn Media & Entertainment India - August 2015
Tracxn Media & Entertainment India - August 2015Tracxn
 
2016 AWS Media & Entertainment Cloud Symposium - New York, NY: May 18, 2016
2016 AWS Media & Entertainment Cloud Symposium - New York, NY:  May 18, 20162016 AWS Media & Entertainment Cloud Symposium - New York, NY:  May 18, 2016
2016 AWS Media & Entertainment Cloud Symposium - New York, NY: May 18, 2016Amazon Web Services
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analyticsCapgemini
 
Digital Security
Digital SecurityDigital Security
Digital Securityaccenture
 
Media And Entertainment Industry
Media And Entertainment IndustryMedia And Entertainment Industry
Media And Entertainment Industryyugeshkumardubey
 
The ‘Data Driven Video Business’
The ‘Data Driven Video Business’The ‘Data Driven Video Business’
The ‘Data Driven Video Business’accenture
 
Modo migliore per recuperare le foto.
Modo migliore per recuperare le foto.Modo migliore per recuperare le foto.
Modo migliore per recuperare le foto.brevo2
 
DIGITAL MEDIA CONVERGENCE: Opportunity & Challenges For Incumbents
DIGITAL MEDIA CONVERGENCE: Opportunity & Challenges For Incumbents DIGITAL MEDIA CONVERGENCE: Opportunity & Challenges For Incumbents
DIGITAL MEDIA CONVERGENCE: Opportunity & Challenges For Incumbents ugkaz
 

Destaque (11)

Opencell Media & Entertainement Webinar
Opencell Media & Entertainement WebinarOpencell Media & Entertainement Webinar
Opencell Media & Entertainement Webinar
 
Digital Trends and Priorities in the Media and Entertainment Sector
Digital Trends and Priorities in the Media and Entertainment SectorDigital Trends and Priorities in the Media and Entertainment Sector
Digital Trends and Priorities in the Media and Entertainment Sector
 
A study on the Indian Media & Entertainment Industry
A study on the Indian Media & Entertainment IndustryA study on the Indian Media & Entertainment Industry
A study on the Indian Media & Entertainment Industry
 
Tracxn Media & Entertainment India - August 2015
Tracxn Media & Entertainment India - August 2015Tracxn Media & Entertainment India - August 2015
Tracxn Media & Entertainment India - August 2015
 
2016 AWS Media & Entertainment Cloud Symposium - New York, NY: May 18, 2016
2016 AWS Media & Entertainment Cloud Symposium - New York, NY:  May 18, 20162016 AWS Media & Entertainment Cloud Symposium - New York, NY:  May 18, 2016
2016 AWS Media & Entertainment Cloud Symposium - New York, NY: May 18, 2016
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics
 
Digital Security
Digital SecurityDigital Security
Digital Security
 
Media And Entertainment Industry
Media And Entertainment IndustryMedia And Entertainment Industry
Media And Entertainment Industry
 
The ‘Data Driven Video Business’
The ‘Data Driven Video Business’The ‘Data Driven Video Business’
The ‘Data Driven Video Business’
 
Modo migliore per recuperare le foto.
Modo migliore per recuperare le foto.Modo migliore per recuperare le foto.
Modo migliore per recuperare le foto.
 
DIGITAL MEDIA CONVERGENCE: Opportunity & Challenges For Incumbents
DIGITAL MEDIA CONVERGENCE: Opportunity & Challenges For Incumbents DIGITAL MEDIA CONVERGENCE: Opportunity & Challenges For Incumbents
DIGITAL MEDIA CONVERGENCE: Opportunity & Challenges For Incumbents
 

Semelhante a Media_Entertainment_Veriticals

Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkBurak Yavuz
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 Chester Chen
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 editionDavid Talby
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 

Semelhante a Media_Entertainment_Veriticals (20)

Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 

Media_Entertainment_Veriticals

  • 1. Applying Apache Spark to Data Science Challenges in Media & Entertainment Peyman Mohajerian Solution Architect Databricks
  • 2. Overview • Introduction to Spark- Unifying Framework • Content Personalization: Recommendation, Streaming • Social Media Analytics: GraphFrames • Viewership Prediction: Topic Modeling, Sentiment
  • 3. What is ? Next Generation Big Data Processing Engine
  • 4. • Started as a research project at UC Berkeley in 2009 • 600,000 lines of code (75% Scala) • Last Release Spark 1.6 December 2015 • Next Release Spark 2.0 • Open Source License (Apache 2.0) • Built by 1000+ developers from 200+ companies 4
  • 5. Apache Spark: Flexible and Unified 5 {JSON } Data Sources Spark Core DataFrames ML Pipelines Spark Streaming Spark SQL MLlib GraphX Live Data
  • 6.
  • 7. RDDs DataFrames / SQL Collections of Native JVM Objects Structured Binary Data (Tungsten) • Compile-time type-safety • Easy to express certain types of logic • Lots of existing code + users • Lower level control of Spark • Imperative • Lower memory pressure (gc & space) • Memory accounting (avoids OOMs) • Faster sorting / hashing / serialization • More opportunities for automatic optimization • Declarative
  • 9. 9
  • 10. Why Spark ML Provide general purpose ML algorithms on top of Spark • Let Spark handle the distribution of data and queries; scalability • Leverage its improvements (e.g. DataFrames, Datasets, Tungsten) Advantages of MLlib’s Design: • Simplicity • Scalability • Streamlined end-to-end • Compatibility
  • 11. High-level functionality in MLlib Learning tasks Classification Regression Recommendation Clustering Frequent itemsets 11 Workflow utilities • Model import/export • Pipelines • DataFrames • Cross validation Data utilities • Feature extraction & selection • Statistics • Linear algebra
  • 12. ML Workflows are complex Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 2 Extract features Extract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble 12
  • 13. Iterate on Your Models Analyze Data Feature Engineerin g TrainTune Test 13 • Databricks Notebooks or Juptyer, Zeppelin are great for iterative model development using the REPL • Spark provides fast, scalable infrastructure so you don’t have to wait for your results • Subsample during the early model development phase, but when in doubt use more data • Better feature engineering can produce as good or better results than tuning the algorithm
  • 14. The Advanced Analytics Gap 14 ADVANCED ANALYTICS SOLUTIONS ANOMALY 
 DETECTION PREDICTIVE 
 ANALYTICS NEXT GEN PRODUCT R&D SILOED, UNSTRUCTURED, FAST-GROWING DATA HADOOP /
 DATA LAKES CLOUD 
 STORAGE DATA WAREHOUSES
  • 15. DATA 
 WAREHOUSES HADOOP / 
 DATA LAKES YOUR STORAGE CLOUD 
 STORAGE ORCHESTRATED SPARK IN THE CLOUD 15 Just-in-Time Data Platform INTEGRATED WORKSPACE DASHBOARDS Reports NOTEBOOKS github, viz, collaboration ENTERPRISESECURITY
 Accesscontrol,auditing,encryption BI TOOLS OPEN SOURCE YOUR CUSTOM SPARK APPS MANAGEMENT: Scalability, resilience, multi-tenancy INTERFACES: BI tools & RESTful APIs DATA INTEGRATION: Universal access without centralization MANAGED SERVICES PRODUCTION JOBS + Powered by Apache Spark
  • 17. Media & Entertainment Use Cases 17 Content Personalization Churn & Cohort Analysis Social Network Graph Sentiment Analysis Secure Managed Spark Platform ETL | Data Cleansing JIT Data Warehouse Advanced Analytics Machine Learning / Graph Analysis Pixel Data Social Media Nielsen Rating Image Data Video Stream Viewing Data Survey DataWearable Data CRM Data Transactional
  • 18. Content Personalization: Recommendations • Broad Application – Movie Streaming, Matching Sites, Mobile App, Music • Key Trends • Continuous Application- Near Real-time interactivity • Best products suited for a user’s preference to maximize the revenue • Provide a tailored and personalized view of pertinent data for each individual you serve 18 Rating, Play, Browse,..Media Channels (devices,..) Event Distribution Content Serving Recommendation.. Online Learning Offline Learning Social Behavioral Feed Dashboard Content Repo Event Analytics
  • 19. Content Personalization: Recommendations • Key Considerations – Cold Start- Content Based, Similarity Index – Rating Based- User-Item: ALS Item-Item: K-Mean – Social Graph- PageRank of Top Influencers Spark Graph Frames – Continuous Application- Spark Streaming, Real-time Model, Model Serving 19 Input Stream Interactive Dashboard Structural Streaming Movie Rating Social Feed Real-time ML Model Updates Off-line ML Model Serving System Continuous Application Storage Realtime & Batch
  • 21. Structured Streaming // Read JSON continuously from S3 logsDF = spark.readStream.json("s3://logs") // Transform with DataFrame API and save logsDF.select("user", "url", "date") .writeStream.parquet("s3://out") .start() // Read JSON once from S3 logsDF = spark.read.json("s3://logs") // Transform with DataFrame API and save logsDF.select("user", "url", "date") .write.parquet("s3://out") Streaming Version Batch Version
  • 22. Social Media Analytics: GraphFrames • Application – Influencers in a Social Graph (PageRank), Distance Measure, Co-occurance, Clustering • Key Trends • Marketing Campaigns • Finding Friends, Jobs, … 22 GraphFrames are built on top of Spark DataFrames, vertices and edges are represented as DataFrames, allowing us to store arbitrary data with each vertex and edge. Shortest Path: How fast communication propagates Label Propagation Algorithm (LPA): Detect communities in a graph. Motif finding: Search for structural patterns in a graph.
  • 23. Viewership Prediction: Topic Modeling • Application – Programming Decisions (e.g. House of Cards), • Key Trends • Consumer sentiment in realtime, augmenting Nielsen rating data with transcript analytics • Netflex: Meta-tags with information about millions plays per day to determine what will be a hit, what viewers like, and what keeps them watching 23 Text Tokenization Remove Stopwords Vector of Token Counts Create LDA model with Online Variational Bayes Review Topics Model Tuning - Create LDA model with Expectation Maximization Visualize Results
  • 24. Sentiment Analysis: Logistic Regresson • Application – Not just Twitter, any type of text e.g. viewers comments on a movie streams 24