Mais conteúdo relacionado Semelhante a Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014 (20) Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 20141. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Sukhendu Chakraborty
DataMesh Team @ {rr}
Big Data Analytics made easy
using Apache Hive to R Connector
3. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Our cloud-based platform supports both real-time processes
and analytical use cases, utilizing technologies to name a
few: Crunch, Hive, HBase, Avro, Kafka, R
Someone clicks on a {rr} recommendation
every 21 milliseconds
Did You Know?
Our data capacity includes a 1.5 PB Hadoop infrastructure,
which enables us to employ 100+ algorithms in real-time
In the US, we serve 7000 requests per second with an average
response time of 50 ms
5. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
What is R?
• A letter in English alphabet
• An open-source statistical language for
data analytics
– Simple: Easy to install and program
– Popular: One of the most widely used open
sourced statistical tools
– Powerful: Rich set of packages (> 4000) to
perform statistical analysis and plotting
– More info: http://cran.us.r-project.org/
6. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
But…
• Performance issues
– Typically single threaded
– All the data needs to be in memory
– Not scalable
• Need to know the internals to make it
perform well
7. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
What’s out there
• Rhadoop/RMR
– Uses Hadoop MR to distribute data in the Hadoop cluster
– No transparency: Limited data preparation support
• RHIPE
– Similar to Rhadoop
– Protobuf dependency
• RHive
– Lets you run HIVE queries from R functions
– Users need to know HQL
– Needs Rserve + rJava
8. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
R @ {rr} - so far
{rr} cluster R client
HIVE queries
Data access
9. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
• Transparency Layer
• Pluggable Query generation
• R as an analytical platform
– Data cleanup
– Ad-hoc analytics
– Data preparation
– Distributed analytics using Hadoop
– Result summarization and publishing
R HIVE connector
HIVE (UC 1)
MR (UC 2)
10. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
OO programming in R
• S4 class system - classes and objects
• Methods and multiple dispatch
• Object validity checking
• Extensible: setGenerics()
• Quick overview: http://www.r-
project.org/conferences/useR-2004/Keynotes/Leisch.pdf
21. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
R @ {rr}
{rr} cluster R client
R HIVE
connector
Data access
22. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Future Work
• Extend the connector to handle other data
sources
• Add custom Analytical functions
• Asynchronous execution
• Performance tuning
Notas do Editor Nuggets or Data Points
1.5PB not as big as yahoo or facebook – huge from a retail industry perspective