SlideShare uma empresa Scribd logo
1 de 3
Big Data Science Challenges
Chandan Rajah – Chief Architect, Big Data
crajah@parallelai.com [ @chandanrajah ]
Why Big Data Science ?Big Data
Value & Vision
• Machine learning (clustering, classification, regression, pattern
mining, behaviour analysis, semantic analysis, topic extraction)
• Real time analytics & recommendations
• Central smelting pot
• Cost to data benefits
Volume & Variety
• 10 million subscribers;10 different touch points
• Petabytes of data; structured and unstructured
• Event logs, program data, content metadata, purchase history, etc.
• Too big for traditional data warehouse
Velocity & Veracity
• 140 MB/s approx. 12 TB/day
• Too fast; 95% of the data dropped
• Inconsistent data structure
• No single version of truth
Big Data Science ChallengesBig DataBig Data
Data Quality
Feature Extraction
Machine Learning
Visualisation
& Verification
Productizing
• Dirty unstructured data with inconsistent labels
• Start but no end events
• Field shifts between extracts
• XML fragmented data; 100k frags
• Data too big to run in R requires subsampling and effective implementation
• 100s of features; too big for Scala / Scalding tuple
• No clearly identifiable keys
• Algorithm implementation issues (e.g. parallelism, scalability, testability)
• Collaborative filtering, topic modelling, incremental clustering, sentiment
analysis
• Real time versus batch algorithm design
• Visualisation tool support
• Automated testing frameworks
• R -> Scala / Scalding not easy
• Disaster recovery & cross data centre
• On the fly analytics; data streams

Mais conteúdo relacionado

Mais procurados

Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksDatabricks
 
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiHow Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiDatabricks
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"Rob Winters
 
Getting Started with Big Data Analytics
Getting Started with Big Data AnalyticsGetting Started with Big Data Analytics
Getting Started with Big Data AnalyticsRob Winters
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlSpark Summit
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Dataconomy Media
 
Analytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret WeaponAnalytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret WeaponDatabricks
 
NYC Cassandra March 13- lighting talk
NYC Cassandra March 13- lighting talkNYC Cassandra March 13- lighting talk
NYC Cassandra March 13- lighting talkSanjay Sharma
 
Machine Learning with PyCaret
Machine Learning with PyCaretMachine Learning with PyCaret
Machine Learning with PyCaretDatabricks
 
Somuvadali 180712051740
Somuvadali 180712051740Somuvadali 180712051740
Somuvadali 180712051740somu-vadali
 
Building an IoT Kafka Pipeline in Under 5 Minutes
Building an IoT Kafka Pipeline in Under 5 MinutesBuilding an IoT Kafka Pipeline in Under 5 Minutes
Building an IoT Kafka Pipeline in Under 5 MinutesSingleStore
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasThoughtworks
 
Spark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun ConnollySpark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun ConnollySpark Summit
 
AI Data Acquisition and Governance: Considerations for Success
AI Data Acquisition and Governance: Considerations for SuccessAI Data Acquisition and Governance: Considerations for Success
AI Data Acquisition and Governance: Considerations for SuccessDatabricks
 
Contact Centers Powered by Esgyn
Contact Centers Powered by EsgynContact Centers Powered by Esgyn
Contact Centers Powered by EsgynRajender K Salgam
 
Real-time Microservices and In-Memory Data Grids
Real-time Microservices and In-Memory Data GridsReal-time Microservices and In-Memory Data Grids
Real-time Microservices and In-Memory Data GridsAli Hodroj
 
Big Data Ecosystem- Impetus Technologies
Big Data Ecosystem-  Impetus TechnologiesBig Data Ecosystem-  Impetus Technologies
Big Data Ecosystem- Impetus TechnologiesImpetus Technologies
 
Accelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & PrivaceraAccelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & PrivaceraDatabricks
 

Mais procurados (20)

Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with Databricks
 
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiHow Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
 
Data engineering
Data engineeringData engineering
Data engineering
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
Getting Started with Big Data Analytics
Getting Started with Big Data AnalyticsGetting Started with Big Data Analytics
Getting Started with Big Data Analytics
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren Nathan
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
 
Analytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret WeaponAnalytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret Weapon
 
NYC Cassandra March 13- lighting talk
NYC Cassandra March 13- lighting talkNYC Cassandra March 13- lighting talk
NYC Cassandra March 13- lighting talk
 
Machine Learning with PyCaret
Machine Learning with PyCaretMachine Learning with PyCaret
Machine Learning with PyCaret
 
Somuvadali 180712051740
Somuvadali 180712051740Somuvadali 180712051740
Somuvadali 180712051740
 
Building an IoT Kafka Pipeline in Under 5 Minutes
Building an IoT Kafka Pipeline in Under 5 MinutesBuilding an IoT Kafka Pipeline in Under 5 Minutes
Building an IoT Kafka Pipeline in Under 5 Minutes
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
 
Spark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun ConnollySpark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun Connolly
 
AI Data Acquisition and Governance: Considerations for Success
AI Data Acquisition and Governance: Considerations for SuccessAI Data Acquisition and Governance: Considerations for Success
AI Data Acquisition and Governance: Considerations for Success
 
Contact Centers Powered by Esgyn
Contact Centers Powered by EsgynContact Centers Powered by Esgyn
Contact Centers Powered by Esgyn
 
Real-time Microservices and In-Memory Data Grids
Real-time Microservices and In-Memory Data GridsReal-time Microservices and In-Memory Data Grids
Real-time Microservices and In-Memory Data Grids
 
Big Data Ecosystem- Impetus Technologies
Big Data Ecosystem-  Impetus TechnologiesBig Data Ecosystem-  Impetus Technologies
Big Data Ecosystem- Impetus Technologies
 
Accelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & PrivaceraAccelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & Privacera
 

Destaque

Destaque (19)

IW Big Data 2012
IW Big Data 2012IW Big Data 2012
IW Big Data 2012
 
Oh et al 2015_Plosone
Oh et al 2015_PlosoneOh et al 2015_Plosone
Oh et al 2015_Plosone
 
Peace Finder
Peace FinderPeace Finder
Peace Finder
 
Getting Started As General Counsel
Getting Started As General CounselGetting Started As General Counsel
Getting Started As General Counsel
 
Hackers
HackersHackers
Hackers
 
How to-run-giveaway-on-facebook-using-rafflecopter
How to-run-giveaway-on-facebook-using-rafflecopterHow to-run-giveaway-on-facebook-using-rafflecopter
How to-run-giveaway-on-facebook-using-rafflecopter
 
Atrakcje Gór Sowich by Filip
Atrakcje Gór Sowich by FilipAtrakcje Gór Sowich by Filip
Atrakcje Gór Sowich by Filip
 
Desgranando Granada
Desgranando GranadaDesgranando Granada
Desgranando Granada
 
A cozinha sertaneja
A cozinha sertanejaA cozinha sertaneja
A cozinha sertaneja
 
Andry
AndryAndry
Andry
 
1285448634 409715
1285448634 4097151285448634 409715
1285448634 409715
 
R. Klingbeil, 2013: Water Resources Challenges in the Middle East.
R. Klingbeil, 2013: Water Resources Challenges in the Middle East.R. Klingbeil, 2013: Water Resources Challenges in the Middle East.
R. Klingbeil, 2013: Water Resources Challenges in the Middle East.
 
Website Resume
Website ResumeWebsite Resume
Website Resume
 
Diffusion of innovations - extension
Diffusion of innovations - extensionDiffusion of innovations - extension
Diffusion of innovations - extension
 
3 types of innovation decision
3 types of innovation decision3 types of innovation decision
3 types of innovation decision
 
#01Benefits that Eximex and King Pac can offer 2015
#01Benefits that Eximex and King Pac can offer 2015#01Benefits that Eximex and King Pac can offer 2015
#01Benefits that Eximex and King Pac can offer 2015
 
To be
To beTo be
To be
 
Ma pradhanmantri bhayeko bhaye as pub nayapatrika म प्रम भएको भए
Ma pradhanmantri bhayeko bhaye as pub nayapatrika म प्रम भएको भएMa pradhanmantri bhayeko bhaye as pub nayapatrika म प्रम भएको भए
Ma pradhanmantri bhayeko bhaye as pub nayapatrika म प्रम भएको भए
 
Interklasa 2013 - SP 9 Dzierżoniów
Interklasa 2013 - SP 9 DzierżoniówInterklasa 2013 - SP 9 Dzierżoniów
Interklasa 2013 - SP 9 Dzierżoniów
 

Semelhante a Big Data Science Challenges in Media

Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...Dataconomy Media
 
Walking Around the Data Lake
Walking Around the Data LakeWalking Around the Data Lake
Walking Around the Data LakeAll Things Open
 
Big data in the enterprise: When to use what?
Big data in the enterprise: When to use what?Big data in the enterprise: When to use what?
Big data in the enterprise: When to use what?Jesus Rodriguez
 
From Data to Services at the Speed of Business
From Data to Services at the Speed of BusinessFrom Data to Services at the Speed of Business
From Data to Services at the Speed of BusinessAli Hodroj
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Martin Bém
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceCambridge Semantics
 
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...Data Con LA
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketDremio Corporation
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AIGary Allemann
 
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...confluent
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWSCaserta
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.
 
5 Amazing Reasons DBAs Need to Love Extended Events
5 Amazing Reasons DBAs Need to Love Extended Events5 Amazing Reasons DBAs Need to Love Extended Events
5 Amazing Reasons DBAs Need to Love Extended EventsJason Strate
 
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...TamrMarketing
 

Semelhante a Big Data Science Challenges in Media (20)

Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
 
Walking Around the Data Lake
Walking Around the Data LakeWalking Around the Data Lake
Walking Around the Data Lake
 
Big data in the enterprise: When to use what?
Big data in the enterprise: When to use what?Big data in the enterprise: When to use what?
Big data in the enterprise: When to use what?
 
From Data to Services at the Speed of Business
From Data to Services at the Speed of BusinessFrom Data to Services at the Speed of Business
From Data to Services at the Speed of Business
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AI
 
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
 
5 Amazing Reasons DBAs Need to Love Extended Events
5 Amazing Reasons DBAs Need to Love Extended Events5 Amazing Reasons DBAs Need to Love Extended Events
5 Amazing Reasons DBAs Need to Love Extended Events
 
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
 

Mais de Chandan Rajah

Business Change through Predictive Analytics
Business Change through Predictive AnalyticsBusiness Change through Predictive Analytics
Business Change through Predictive AnalyticsChandan Rajah
 
Business Change through Predictive Analytics
Business Change through Predictive AnalyticsBusiness Change through Predictive Analytics
Business Change through Predictive AnalyticsChandan Rajah
 
Data Disruption by Vertical Innovation
Data Disruption by Vertical InnovationData Disruption by Vertical Innovation
Data Disruption by Vertical InnovationChandan Rajah
 
Data Innovation in the UK
Data Innovation in the UKData Innovation in the UK
Data Innovation in the UKChandan Rajah
 
Data Disruption by Vertical Innovation in Media
Data Disruption by Vertical Innovation in MediaData Disruption by Vertical Innovation in Media
Data Disruption by Vertical Innovation in MediaChandan Rajah
 
Catalysing Sector Advantage
Catalysing Sector AdvantageCatalysing Sector Advantage
Catalysing Sector AdvantageChandan Rajah
 
Rise of the Machines
Rise of the MachinesRise of the Machines
Rise of the MachinesChandan Rajah
 
Health Innovation and the Digital Catapult
Health Innovation and the Digital CatapultHealth Innovation and the Digital Catapult
Health Innovation and the Digital CatapultChandan Rajah
 
Connected Farms ...and the Digital Catapult
Connected Farms ...and the Digital CatapultConnected Farms ...and the Digital Catapult
Connected Farms ...and the Digital CatapultChandan Rajah
 
Steps to the Big Data Science Epiphany
Steps to the Big Data Science EpiphanySteps to the Big Data Science Epiphany
Steps to the Big Data Science EpiphanyChandan Rajah
 
Data Innovation in the Digital Economy
Data Innovation in the Digital EconomyData Innovation in the Digital Economy
Data Innovation in the Digital EconomyChandan Rajah
 
Disruptive Data in Future Care
Disruptive Data in Future CareDisruptive Data in Future Care
Disruptive Data in Future CareChandan Rajah
 
Big Data Science at the Digital Catapult
Big Data Science at the Digital CatapultBig Data Science at the Digital Catapult
Big Data Science at the Digital CatapultChandan Rajah
 
Data Warehouse to Data Science
Data Warehouse to Data ScienceData Warehouse to Data Science
Data Warehouse to Data ScienceChandan Rajah
 
Business Impact of Predictive Analytics
Business Impact of Predictive AnalyticsBusiness Impact of Predictive Analytics
Business Impact of Predictive AnalyticsChandan Rajah
 
Social Triangulation with Big Data
Social Triangulation with Big DataSocial Triangulation with Big Data
Social Triangulation with Big DataChandan Rajah
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 

Mais de Chandan Rajah (19)

Business Change through Predictive Analytics
Business Change through Predictive AnalyticsBusiness Change through Predictive Analytics
Business Change through Predictive Analytics
 
Business Change through Predictive Analytics
Business Change through Predictive AnalyticsBusiness Change through Predictive Analytics
Business Change through Predictive Analytics
 
Data Disruption by Vertical Innovation
Data Disruption by Vertical InnovationData Disruption by Vertical Innovation
Data Disruption by Vertical Innovation
 
Data Innovation in the UK
Data Innovation in the UKData Innovation in the UK
Data Innovation in the UK
 
Data Disruption by Vertical Innovation in Media
Data Disruption by Vertical Innovation in MediaData Disruption by Vertical Innovation in Media
Data Disruption by Vertical Innovation in Media
 
Catalysing Sector Advantage
Catalysing Sector AdvantageCatalysing Sector Advantage
Catalysing Sector Advantage
 
Rise of the Machines
Rise of the MachinesRise of the Machines
Rise of the Machines
 
Health Innovation and the Digital Catapult
Health Innovation and the Digital CatapultHealth Innovation and the Digital Catapult
Health Innovation and the Digital Catapult
 
Connected Farms ...and the Digital Catapult
Connected Farms ...and the Digital CatapultConnected Farms ...and the Digital Catapult
Connected Farms ...and the Digital Catapult
 
Steps to the Big Data Science Epiphany
Steps to the Big Data Science EpiphanySteps to the Big Data Science Epiphany
Steps to the Big Data Science Epiphany
 
Data Innovation in the Digital Economy
Data Innovation in the Digital EconomyData Innovation in the Digital Economy
Data Innovation in the Digital Economy
 
Disruptive Data in Future Care
Disruptive Data in Future CareDisruptive Data in Future Care
Disruptive Data in Future Care
 
Big Data Science at the Digital Catapult
Big Data Science at the Digital CatapultBig Data Science at the Digital Catapult
Big Data Science at the Digital Catapult
 
Data Warehouse to Data Science
Data Warehouse to Data ScienceData Warehouse to Data Science
Data Warehouse to Data Science
 
Business Impact of Predictive Analytics
Business Impact of Predictive AnalyticsBusiness Impact of Predictive Analytics
Business Impact of Predictive Analytics
 
Social Triangulation with Big Data
Social Triangulation with Big DataSocial Triangulation with Big Data
Social Triangulation with Big Data
 
Hadoop and friends
Hadoop and friendsHadoop and friends
Hadoop and friends
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
IPTV Case Study
IPTV Case StudyIPTV Case Study
IPTV Case Study
 

Último

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 

Último (17)

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 

Big Data Science Challenges in Media

  • 1. Big Data Science Challenges Chandan Rajah – Chief Architect, Big Data crajah@parallelai.com [ @chandanrajah ]
  • 2. Why Big Data Science ?Big Data Value & Vision • Machine learning (clustering, classification, regression, pattern mining, behaviour analysis, semantic analysis, topic extraction) • Real time analytics & recommendations • Central smelting pot • Cost to data benefits Volume & Variety • 10 million subscribers;10 different touch points • Petabytes of data; structured and unstructured • Event logs, program data, content metadata, purchase history, etc. • Too big for traditional data warehouse Velocity & Veracity • 140 MB/s approx. 12 TB/day • Too fast; 95% of the data dropped • Inconsistent data structure • No single version of truth
  • 3. Big Data Science ChallengesBig DataBig Data Data Quality Feature Extraction Machine Learning Visualisation & Verification Productizing • Dirty unstructured data with inconsistent labels • Start but no end events • Field shifts between extracts • XML fragmented data; 100k frags • Data too big to run in R requires subsampling and effective implementation • 100s of features; too big for Scala / Scalding tuple • No clearly identifiable keys • Algorithm implementation issues (e.g. parallelism, scalability, testability) • Collaborative filtering, topic modelling, incremental clustering, sentiment analysis • Real time versus batch algorithm design • Visualisation tool support • Automated testing frameworks • R -> Scala / Scalding not easy • Disaster recovery & cross data centre • On the fly analytics; data streams