Warner Bros. processes billions of records each day Globally between its web assets, digital content distribution, OTT streaming services, online and mobile games, technical operations, anti-piracy programs, social media, and retail point of sale transactions. Combining these datasets with content metadata, Warner Bros. is able produce Consumer insights and affinity models that result in highly accurate Audience segments.
Big Data Day LA 2016 Keynote - Jeanne Holm/ City of LA
Similar to Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intelligence at Scale, Brian Kursar, VP Data Strategy & Architecture, Warner Bros
Similar to Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intelligence at Scale, Brian Kursar, VP Data Strategy & Architecture, Warner Bros (20)
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intelligence at Scale, Brian Kursar, VP Data Strategy & Architecture, Warner Bros
1. Big Data at WB
Brian Kursar – VP Data Strategy and Architecture
Big Data Day LA 2016 – West Los Angeles College
2. 2
COMPANY OVERVIEW
Warner Bros. Entertainment Inc., a Time Warner
Company, is a fully integrated, broad-based
entertainment company.
Warner Bros. is the global leader in the creation,
production, distribution, licensing and marketing of all
forms of entertainment and their related businesses.
We stand at the forefront of every aspect of the
entertainment industry, from feature films to television,
home entertainment, animation, comic books,
interactive entertainment and games, product and
brand licensing, and broadcasting.
4. Warner Bros.
Pictures
Group
• Crossed the
billion-dollar
mark for a 15th
year running
• Over 7000
Feature Films
DC
Entertainment
• Largest
English-
language
publisher of
comics in the
world, with
more than
1,200 titles
each year
Warner Bros.
Consumer
Products
• More than
3,700 active
licensees
Worldwide
Warner Home
Entertainment
• Industry-
leading 21
percent market
share
Warner Bros.
Television
Group
• #1 the last 10
out of 11 years,
and 23 out of
26 years!
• Produced 79
series (69
primetime) for
broadcast, first-
run syndication
and cable in the
2014-15
season
Warner Bros.
Studio Tours
• Harry Potter UK
Leavesden
Tour
• Hollywood
Studio Tour
WARNER BROS.
6. STORAGE DATA DISCOVERY AND
SELF SERVICE BI
PROCESS REAL-TIME VISUALIZATION
OUR STACK
CODE
7. Connecting the Dots Isn’t Easy
Harry Potter Franchise
• Theatrical Movies
• Home Entertainment (Blu-ray, Digital Downloads)
• Streaming (Netflix, Hulu, etc.)
• Television
• Video Games
• Consumer Products
• Theme Parks
“I want a 360 Franchise View”
9. Questions that need answers
What is the impact of a trailer drop or a social media comment on…
• Theatrical Box Office Sales
• Franchise Back Catalog Sales
• Consumer Products Sales
• Studio Tour Ticket Sales
11. 1 • FILTER
KEYWORDS
Define keyword
terms
associated with
Titles
CREED
#CREEDMOVIE
PAN
#PANMOVIE
…
2 • CLASSIFY
TITLES
Tag Filters
associated with
Titles
CREED OR
#CREEDMOVIE
=
CREED
3 • INDEX
DATA
Write Data to a
real-time
analysis engine
4 • ANALYZE
RESULTS
Aggregated data is
pushed to Tableau
to visualize
19. Video ID
Facebook Page, Amazon Product ID
Apollo
Rocky
Concept Based
Disambiguation
!=
Exclude terms like “band”, “assassin's”
“movie”,
“film”
Unique Social
IDs
“Boxing”
Characters
“Creed”
“Michael B. Jordan”
“Sylvester Stallone”
Top
Talent
26. Social Pipeline
Social API Apply L1 Title Filter Pull into Spark RDD
Apply L2 Concept
Title Filter Rules
Pre-Process
Title Ambiguity Score
and Rules
27. L2 Concept
Filter
tag.movie.title “Creed”
{
(
(fb.all.content CONTAINS_ANY “#Creedmovie,#Creedfilm”
OR
links.url CONTAINS_ANY “Uv554B7YHk4 ,-HPam119fhM”
OR
fb.topics.username == “creedmovie”)
Unique Values
Hash Tags, Video IDs, or
Anything on CreedMovie
FB Page
Title + Disambiguation concepts
OR
fb.all.content CONTAINS “Creed” NEAR_ANY
(Sylvester, Stallone, Michael B. Jordan, Michael Jordan,Tessa Thompson,
#SylvesterStallone, #MichaelBJordan, #TessaThompson,
Rocky, Apollo, Boxing
film, movie, theater, movies, warner bros, wb, warner brothers):10
)
Excluded Terms
AND NOT
fb.all.content CONTAINS_ANY
(assassin, assassins, band, piratebay, torrent, isohunt, megaupload)
)
}
28. Social Pipeline
Social API Apply L1 Title Filter Pull into Spark RDD
Apply L2 Concept
Title Filter Rules
Write to ElasticSearch
Write to Parquet
Visualizations and
Ad-Hoc Queries
Process in Spark for
Affinity Models
Pre-Process
Title Ambiguity Score
and Rules
33. Thank you
Brian Kursar – VP Data Strategy and Architecture
WB Data and Analytics
@briankursar
For more information on exciting opportunities in Big Data
happening at Warner Bros. please come by the Warner Bros.
Career Booth here at Big Data Day LA 2016 or check out our
careers site at WarnerBrosCareers.com.
Editor's Notes
Warner Bros. Entertainment Inc., a Time Warner Company, is a fully integrated, broad-based entertainment company.
Warner Bros. is the global leader in the creation, production, distribution, licensing and marketing of all forms of entertainment and their related businesses.
We stand at the forefront of every aspect of the entertainment industry, from feature films to television, home video/DVD, animation, comic books, interactive entertainment and games, product and brand licensing, and broadcasting.
Warner Bros. falls under the Time Warner Family of Companies.
Today, the WB Data and Analytics Team support six of the largest groups in Warner Bros. Warner Bros Pictures, our Theatrical group. DC Entertainment, Consumer Products, Home Entertainment, Warner Bros Television. as well as our two Studio Tour facilities both in Leavesden and here in Hollywood. And across all of these Business units, we have data.
And lots of it. Everything you can possibly imagine. Web Logs, Ratings, Social, Supply Chain, Point of Sale data from our retailers and exhibitors, sales data from our digital retailers data on Content Piracy, Production, and we even have our own Netflix-like OTT offerings which we are actively working to integrate into our platform. But the real trick lands on our Data Engineering Team that is tasked with the job of connecting this data to ensure that across our lines of Business that we are able to make some sense of it all. This is where the real magic happens.
A year ago we started building our platform. For storage we utilize Amazon S3 and Parquet. We also store a lot of our data in Teradata and Amazon Redshift. Recently we have been introducing a number of new components such as Kafka and Spark as well as Elastic Search. We like Spark for its ability to crunch our massive log files while Elastic Search has proven a great tool for a number of use cases. It acts as our NoSQL layer an enables use to do massive refinements across Billions of Records. It is also great for Social Media Analytics due to its search capability. We use Python and Scala to wrangle our data. Larger jobs we tend to use Scala on Spark as it is much easier to debug in Scala versus Python. For Self Serve and Data Discovery we utilize a few tools. Tableau is used primarily for visualizations, Microstrategy is heavily used with our Home Entertainment group. While Kibana is a tool we have recently brought in for log analysis. Finally, we recently started using D3 and Angular on Node.js to create big screen visualizations.
Even with all of these great tools, the fact is that connecting the dots isn’t easy. It requires a lot of domain expertise and some serious data engineering. For instance, if someone wanted to understand how to ask analytical questions across a Franchise, this is not an easy thing to do. Some Franchises have many different facets across many industries. Traditionally this data was residing in several disparate databases. There are multiple theatrical releases, Home Entertainment releases, Streaming & Television deals, Video Games, Consumer Products, Studio Tours and Theme parks.
And what about Social. Social has a tremendous impact on Sales, but how can connect this so we can quantify questions like this:
What is the impact of a trailer drop or a social media comment on…
Theatrical Box Office Sales
Franchise Back Catalog Sales
Consumer Products Sales
Studio Tour Ticket Sales
About a year ago, we began working on connecting Social Media Activity on our Theatrical and Home Entertainment Titles.
We took a stab at creating some filters using the title names and hashtags, wrote some general normalization tags, indexed the data to a search index and then pushed out the aggregate results to a tableau workbook for visualization.
Focusing on Pan and Creed, we created a tag cloud doing some basic noun phrase extraction so we could visualize what topics were being discussed around these two WB Titles. And here were the results.
Bread Pudding… Cooking…. Video Games… Not what exactly what we were expecting. You see
Creed, the movie, is not the same as Creed the Band, or Assassin's Creed the Video Game
Nor is Pan the movie in an way associated with Pan, the Spanish word for Bread. But once we understood the connection, it made sense why we were seeing Bread Pudding as a topic for our Pan. We quickly had to change our query and classification approach. This started with having to focus on how best to
Disambiguate, or essentially remove all noise not related to the content we were trying to track from our queries. But we had another problem.
We started grouping titles into three categories. Unique Titles such as Batman V Superman, Ambiguous Titles such as Creed and Very Ambiguous titles such as the film “Her”
In a pre-process step, we used Wikipedia to help us score the titles. Granted we had a few other tricks up our sleeve, essentially, any title coming up only once has a high probability of being unique whereas a title coming up in a Wikipedia search more than once has a high probability of being ambiguous. Then
We created a concept based disambiguation model.
In this example, we started collecting additional terms to further disambiguate the title “Creed”. We start with names of Top talent associated with the title.
Stallone, Michael B Jordan, then similarly characters… Rocky, Apollo, words associated with the type of content. Was it a movie, or a film? Did they see it in a theater? Additional related themes like Boxing. Then we added terms that if present with the term “Creed” were highly likely to be interactions that were not relevant. Terms like “band” or the word “assasin’s”. We also included other unique IDs from social content that we could tie back to Creed. We also added logic to pull in contextual data where the title would not be
This post could be talking about any movie in the world as it fails to mention any particular movie.
But if it is found on the Creed Movie Facebook page, then we can attribute the comment to the movie Creed. Quick question. What percentage of the comments on the Creed movie page do not contain the word “Creed’?
88% of Consumer engagement on the Creed FB page does
NOT contain the keyword “Creed”
Similarly, in this post, the YouTube video is associated to the Creed YouTube Trailer. > By extracting the unique Video ID, we can safely assume that every time this ID is featured in any post, we should classify it against the movie Creed.
Our Social pipeline for this is fairly straight forward. We call a Social API, passing a general filter against the API.
In a separate job we have generated the Disambiguation rules. The data is pulled into a Spark RDD and we apply the pre-processed Ambiguity Rules
And this looked quite a bit better than our previous results. So here we were able to
Here we pull these massive files from our P2P data provider and leverage spark to match across the title. Once we have made our matches we then write out to ElasticSearch and stream the data to our near real-time visualization.