SlideShare uma empresa Scribd logo
1 de 28
Data Mining in Hadoop,
Making Sense of it in Mahout!


Hadoop World 2011
Michael Cutler @cotdp
Hello (Hadoop) World!
• Senior Research Engineer

• British Sky Broadcasting

• Lead the Hadoop initiative

• Fostering development
Topics
•   What is Data Mining?
•   Introducing Mahout
•   Using Mahout
•   Demo
•   Summary
•   Q&A
What is Data Mining?
It’s all about discovery...
• Grouping similar data records

• Identifying unusual records

• Detecting relationships between records

• Discovering previously unknown patterns
Trends...
• 1990’s approach;
 “Think carefully first and get it right!”

• 2000’s approach;
 “Think a little first, evolve it later...”

• 2010’s approach;
 “... if we capture everything, sense will come(?)”
Cost of Storage




http://www.mkomo.com/cost-per-gigabyte
Other Reasons...
• Increased generation of data

• Complex interconnected datasets

• You can be lazy about it...

Consequence:
  – More data to process than ever before
Traditional Approach...
• Collate your data into files

• 6pm take your Database offline

• Bulk load the previous 24hrs data

• Run data mining, analytics, reporting overnight

• Bring the database back up for 9am
Modern Approach
• Stream data straight into Hadoop

• No need for downtime

• Analysis updated periodically or real-time

• Scalable approach
Introducing
What is it?
Library of scalable machine learning algorithms;

• Classification

• Clustering

• Collaborative Filtering (Recommendations)

• Frequent Pattern mining ... and many more
How do you use it?
• It’s just a Java library

• Simple to get started

• Easy to extend and enhance

• Powerful command-line tools & examples
Classification
• Labels input data with one or more categories

• Trained with known data
Clustering
• Groups data based on their similarity

• Unsupervised – no training
Collaborative Filtering
• User-based recommendations
  – Analyse user data
  – Build neighbourhoods of users
  – Other people like you, liked <these>

• Item-based recommendations
  – Analyse domain data
  – Build relationships between items
  – If you liked this, what about <these>
Others
• Frequent Pattern mining




• High performance maths & utilities
Mahout is a toolbox
• Understand your data

• Determine what needs to be done

• Build a pipeline to compute results

• Think about performance from the start
Please Note
• Scalability through Map/Reduce jobs

• Like MR it is inherently Batch-driven

• Some are not implemented in MR yet

• Fast-paced development
Using Mahout
Building a Recommender
Objectives:

• Personalised

• Item-based recommendations

• Evolve with the times

• Implicit feedback through measurement
Problems with Recommenders
• “Cold start” problem

• “New stuff” problem

• Tainted profiles

• Stale profile data
When they go wrong...
Basic Strategy
• Pre-compute rarely-changing data

• Cache and serve them using traditional means

• Flag data when it needs refreshed

• Tailor the cache on-the-fly
Demo
Summary
• Mahout is exciting!

• Wide range of applications

• Scalable algorithms

• Scalable community
Questions?
Thank you!


Hadoop World 2011
Michael Cutler @cotdp

Mais conteúdo relacionado

Mais procurados

Hadoop big data online training
Hadoop big data online trainingHadoop big data online training
Hadoop big data online trainingMagnific Trainings
 
Hadoop training and certification
Hadoop training and certificationHadoop training and certification
Hadoop training and certificationMagnific Trainings
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Hadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingssHadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingssA1 Trainings
 

Mais procurados (18)

Hadoop big data online training
Hadoop big data online trainingHadoop big data online training
Hadoop big data online training
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Bigdata slide
Bigdata slideBigdata slide
Bigdata slide
 
Hadoop online training usa
Hadoop online training usaHadoop online training usa
Hadoop online training usa
 
Hadoop training australia
Hadoop training australiaHadoop training australia
Hadoop training australia
 
Hadppo training
Hadppo trainingHadppo training
Hadppo training
 
Hadoop training and certification
Hadoop training and certificationHadoop training and certification
Hadoop training and certification
 
Hadoop
HadoopHadoop
Hadoop
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Big data developer training
Big data developer trainingBig data developer training
Big data developer training
 
MahoutNew
MahoutNewMahoutNew
MahoutNew
 
Spark
SparkSpark
Spark
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Bigdata training
Bigdata trainingBigdata training
Bigdata training
 
Hadoop training in usa
Hadoop training in usaHadoop training in usa
Hadoop training in usa
 
Hadoop online training usa
Hadoop online training usaHadoop online training usa
Hadoop online training usa
 
Hadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingssHadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingss
 
Big data training australia
Big data training australiaBig data training australia
Big data training australia
 

Destaque

Airspace configuration using_air_traffic_complexity_metrics
Airspace configuration using_air_traffic_complexity_metricsAirspace configuration using_air_traffic_complexity_metrics
Airspace configuration using_air_traffic_complexity_metricsxiaofeng007
 
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...Mohamed Zaki
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
 
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다gilforum
 
Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철HoChul Lee
 
Dm ml study_roadmap
Dm ml study_roadmapDm ml study_roadmap
Dm ml study_roadmapKang Pilsung
 
Data Mining with R CH1 요약
Data Mining with R CH1 요약Data Mining with R CH1 요약
Data Mining with R CH1 요약Sung Yub Kim
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingOntotext
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoMatthew (정재화)
 
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사uEngine Solutions
 
Text data mining1
Text data mining1Text data mining1
Text data mining1KU Leuven
 
집단지성 프로그래밍 01-데이터마이닝 개요
집단지성 프로그래밍 01-데이터마이닝 개요집단지성 프로그래밍 01-데이터마이닝 개요
집단지성 프로그래밍 01-데이터마이닝 개요Kwang Woo NAM
 
마인즈랩 회사소개서 V2.3_한국어버전
마인즈랩 회사소개서 V2.3_한국어버전마인즈랩 회사소개서 V2.3_한국어버전
마인즈랩 회사소개서 V2.3_한국어버전Taejoon Yoo
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 

Destaque (17)

Airspace configuration using_air_traffic_complexity_metrics
Airspace configuration using_air_traffic_complexity_metricsAirspace configuration using_air_traffic_complexity_metrics
Airspace configuration using_air_traffic_complexity_metrics
 
Big data concept
Big data conceptBig data concept
Big data concept
 
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다
 
Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철
 
Text mining
Text miningText mining
Text mining
 
Dm ml study_roadmap
Dm ml study_roadmapDm ml study_roadmap
Dm ml study_roadmap
 
Data Mining with R CH1 요약
Data Mining with R CH1 요약Data Mining with R CH1 요약
Data Mining with R CH1 요약
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with Tajo
 
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
집단지성 프로그래밍 01-데이터마이닝 개요
집단지성 프로그래밍 01-데이터마이닝 개요집단지성 프로그래밍 01-데이터마이닝 개요
집단지성 프로그래밍 01-데이터마이닝 개요
 
마인즈랩 회사소개서 V2.3_한국어버전
마인즈랩 회사소개서 V2.3_한국어버전마인즈랩 회사소개서 V2.3_한국어버전
마인즈랩 회사소개서 V2.3_한국어버전
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 

Semelhante a Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting

5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
How and why you need to build a big data lab
How and why you need to build a big data labHow and why you need to build a big data lab
How and why you need to build a big data labChris Kernaghan
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupSri Kanajan
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoBitly
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedInkbajda
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist ToolboxAndrei Savu
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamRaymond Tay
 

Semelhante a Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting (20)

Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Machine Learning & Apache Mahout
Machine Learning & Apache MahoutMachine Learning & Apache Mahout
Machine Learning & Apache Mahout
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Machine learninginspark
Machine learninginsparkMachine learninginspark
Machine learninginspark
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
How and why you need to build a big data lab
How and why you need to build a big data labHow and why you need to build a big data lab
How and why you need to build a big data lab
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
 
Cassandra eu
Cassandra euCassandra eu
Cassandra eu
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beam
 

Mais de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting

  • 1. Data Mining in Hadoop, Making Sense of it in Mahout! Hadoop World 2011 Michael Cutler @cotdp
  • 2. Hello (Hadoop) World! • Senior Research Engineer • British Sky Broadcasting • Lead the Hadoop initiative • Fostering development
  • 3. Topics • What is Data Mining? • Introducing Mahout • Using Mahout • Demo • Summary • Q&A
  • 4. What is Data Mining?
  • 5. It’s all about discovery... • Grouping similar data records • Identifying unusual records • Detecting relationships between records • Discovering previously unknown patterns
  • 6. Trends... • 1990’s approach; “Think carefully first and get it right!” • 2000’s approach; “Think a little first, evolve it later...” • 2010’s approach; “... if we capture everything, sense will come(?)”
  • 8. Other Reasons... • Increased generation of data • Complex interconnected datasets • You can be lazy about it... Consequence: – More data to process than ever before
  • 9. Traditional Approach... • Collate your data into files • 6pm take your Database offline • Bulk load the previous 24hrs data • Run data mining, analytics, reporting overnight • Bring the database back up for 9am
  • 10. Modern Approach • Stream data straight into Hadoop • No need for downtime • Analysis updated periodically or real-time • Scalable approach
  • 12. What is it? Library of scalable machine learning algorithms; • Classification • Clustering • Collaborative Filtering (Recommendations) • Frequent Pattern mining ... and many more
  • 13. How do you use it? • It’s just a Java library • Simple to get started • Easy to extend and enhance • Powerful command-line tools & examples
  • 14. Classification • Labels input data with one or more categories • Trained with known data
  • 15. Clustering • Groups data based on their similarity • Unsupervised – no training
  • 16. Collaborative Filtering • User-based recommendations – Analyse user data – Build neighbourhoods of users – Other people like you, liked <these> • Item-based recommendations – Analyse domain data – Build relationships between items – If you liked this, what about <these>
  • 17. Others • Frequent Pattern mining • High performance maths & utilities
  • 18. Mahout is a toolbox • Understand your data • Determine what needs to be done • Build a pipeline to compute results • Think about performance from the start
  • 19. Please Note • Scalability through Map/Reduce jobs • Like MR it is inherently Batch-driven • Some are not implemented in MR yet • Fast-paced development
  • 21. Building a Recommender Objectives: • Personalised • Item-based recommendations • Evolve with the times • Implicit feedback through measurement
  • 22. Problems with Recommenders • “Cold start” problem • “New stuff” problem • Tainted profiles • Stale profile data
  • 23. When they go wrong...
  • 24. Basic Strategy • Pre-compute rarely-changing data • Cache and serve them using traditional means • Flag data when it needs refreshed • Tailor the cache on-the-fly
  • 25. Demo
  • 26. Summary • Mahout is exciting! • Wide range of applications • Scalable algorithms • Scalable community
  • 28. Thank you! Hadoop World 2011 Michael Cutler @cotdp

Notas do Editor

  1. Clustering Outliers AssociationIt’s not new, we’ve been doing it manually for years
  2. So why has it changed?
  3. 1990 ~ $10,0002000 ~ $102010 ~ $0.10Currently 5 cents per GB
  4. - It’s easier than ever before to generate or collect data- Complexity has increased- Storage and processing power is relatively cheap
  5. Call data records, web logs etc.Rinse, RepeatProblem is as the volume of data has grown you need to go about it in a better way
  6. Files,Hbase etc. Dashboards
  7. Collaborative filtering for user-based and item-based recommendations Various clustering algorithms
  8. Two JAR’s “core” and “math”Basic implementations for everythingYou can string together many use-cases just using the examples and CLI
  9. Examples:Detecting spam emailOptical character recognition
  10. You feed in the dataGive it a similarity metricSet a limit on the number of clusters
  11. Colors Blue &amp; Red appear together three timesPurple, Orange and Green appear only twice
  12. How do you recommend to users you know nothing about If nobody has stumbled onto it, how do you recommend it? Outlier behaviour skewing results Tastes can change over time or seasonally
  13. On the face of it, the fact it recommended SAW based on a Kids movie just means that parents are likely to watch SAW
  14. Item-to-item relationships rarely changeHistorical data and trends rarely changeEasy to compute for new items