SlideShare uma empresa Scribd logo
1 de 24
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Sukhendu Chakraborty
DataMesh Team @ {rr}
Big Data Analytics made easy
using Apache Hive to R Connector
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Our cloud-based platform supports both real-time processes
and analytical use cases, utilizing technologies to name a
few: Crunch, Hive, HBase, Avro, Kafka, R
Someone clicks on a {rr} recommendation
every 21 milliseconds
Did You Know?
Our data capacity includes a 1.5 PB Hadoop infrastructure,
which enables us to employ 100+ algorithms in real-time
In the US, we serve 7000 requests per second with an average
response time of 50 ms
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
What is R?
• A letter in English alphabet
• An open-source statistical language for
data analytics
– Simple: Easy to install and program
– Popular: One of the most widely used open
sourced statistical tools
– Powerful: Rich set of packages (> 4000) to
perform statistical analysis and plotting
– More info: http://cran.us.r-project.org/
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
But…
• Performance issues
– Typically single threaded
– All the data needs to be in memory
– Not scalable
• Need to know the internals to make it
perform well
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
What’s out there
• Rhadoop/RMR
– Uses Hadoop MR to distribute data in the Hadoop cluster
– No transparency: Limited data preparation support
• RHIPE
– Similar to Rhadoop
– Protobuf dependency
• RHive
– Lets you run HIVE queries from R functions
– Users need to know HQL
– Needs Rserve + rJava
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
R @ {rr} - so far
{rr} cluster R client
HIVE queries
Data access
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
• Transparency Layer
• Pluggable Query generation
• R as an analytical platform
– Data cleanup
– Ad-hoc analytics
– Data preparation
– Distributed analytics using Hadoop
– Result summarization and publishing
R HIVE connector
HIVE (UC 1)
MR (UC 2)
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
OO programming in R
• S4 class system - classes and objects
• Methods and multiple dispatch
• Object validity checking
• Extensible: setGenerics()
• Quick overview: http://www.r-
project.org/conferences/useR-2004/Keynotes/Leisch.pdf
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Use Case I:
Rollups in HIVE
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Use Case II:
Distributed Analytics
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
R @ {rr}
{rr} cluster R client
R HIVE
connector
Data access
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Future Work
• Extend the connector to handle other data
sources
• Add custom Analytical functions
• Asynchronous execution
• Performance tuning
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Thank You
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Questions?

Mais conteúdo relacionado

Mais procurados

20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Allen Day, PhD
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 

Mais procurados (20)

Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
 
Welcome to Apache Hadoop's Teenage Years, Arun Murthy Keynote
Welcome to Apache Hadoop's Teenage Years, Arun Murthy KeynoteWelcome to Apache Hadoop's Teenage Years, Arun Murthy Keynote
Welcome to Apache Hadoop's Teenage Years, Arun Murthy Keynote
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with Hadoop
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
 
Smart Cities: An APAC Necessity
Smart Cities: An APAC Necessity Smart Cities: An APAC Necessity
Smart Cities: An APAC Necessity
 
Data Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJData Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJ
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Microsoft and Revolution Analytics -- what's the add-value? 20150629
Microsoft and Revolution Analytics -- what's the add-value? 20150629Microsoft and Revolution Analytics -- what's the add-value? 20150629
Microsoft and Revolution Analytics -- what's the add-value? 20150629
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
 
HDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New FeaturesHDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New Features
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
 

Semelhante a Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014

Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
DataWorks Summit
 
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Revolution Analytics
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Willy Marroquin (WillyDevNET)
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
Revolution Analytics
 

Semelhante a Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014 (20)

Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
Slidedeck Datenanalysen auf Enterprise-Niveau mit Oracle R Enterprise - DOAG2014
Slidedeck Datenanalysen auf Enterprise-Niveau mit Oracle R Enterprise - DOAG2014Slidedeck Datenanalysen auf Enterprise-Niveau mit Oracle R Enterprise - DOAG2014
Slidedeck Datenanalysen auf Enterprise-Niveau mit Oracle R Enterprise - DOAG2014
 
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
 
Trafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoopTrafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoop
 
Hadoop summit 2016
Hadoop summit 2016Hadoop summit 2016
Hadoop summit 2016
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
 
Spark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's KeynoteSpark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's Keynote
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...
 
Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in Hadoop
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 

Mais de StampedeCon

Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 

Mais de StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014

  • 1. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Sukhendu Chakraborty DataMesh Team @ {rr} Big Data Analytics made easy using Apache Hive to R Connector
  • 2. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 3. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Our cloud-based platform supports both real-time processes and analytical use cases, utilizing technologies to name a few: Crunch, Hive, HBase, Avro, Kafka, R Someone clicks on a {rr} recommendation every 21 milliseconds Did You Know? Our data capacity includes a 1.5 PB Hadoop infrastructure, which enables us to employ 100+ algorithms in real-time In the US, we serve 7000 requests per second with an average response time of 50 ms
  • 4. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 5. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. What is R? • A letter in English alphabet • An open-source statistical language for data analytics – Simple: Easy to install and program – Popular: One of the most widely used open sourced statistical tools – Powerful: Rich set of packages (> 4000) to perform statistical analysis and plotting – More info: http://cran.us.r-project.org/
  • 6. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. But… • Performance issues – Typically single threaded – All the data needs to be in memory – Not scalable • Need to know the internals to make it perform well
  • 7. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. What’s out there • Rhadoop/RMR – Uses Hadoop MR to distribute data in the Hadoop cluster – No transparency: Limited data preparation support • RHIPE – Similar to Rhadoop – Protobuf dependency • RHive – Lets you run HIVE queries from R functions – Users need to know HQL – Needs Rserve + rJava
  • 8. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. R @ {rr} - so far {rr} cluster R client HIVE queries Data access
  • 9. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. • Transparency Layer • Pluggable Query generation • R as an analytical platform – Data cleanup – Ad-hoc analytics – Data preparation – Distributed analytics using Hadoop – Result summarization and publishing R HIVE connector HIVE (UC 1) MR (UC 2)
  • 10. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. OO programming in R • S4 class system - classes and objects • Methods and multiple dispatch • Object validity checking • Extensible: setGenerics() • Quick overview: http://www.r- project.org/conferences/useR-2004/Keynotes/Leisch.pdf
  • 11. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Use Case I: Rollups in HIVE
  • 12. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 13. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 14. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 15. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 16. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 17. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Use Case II: Distributed Analytics
  • 18. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 19. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 20. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 21. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. R @ {rr} {rr} cluster R client R HIVE connector Data access
  • 22. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Future Work • Extend the connector to handle other data sources • Add custom Analytical functions • Asynchronous execution • Performance tuning
  • 23. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Thank You
  • 24. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Questions?

Notas do Editor

  1. Nuggets or Data Points 1.5PB not as big as yahoo or facebook – huge from a retail industry perspective