SlideShare a Scribd company logo
1 of 27
Download to read offline
HBase at Mendeley
Dan Harvey
Data Mining Engineer
dan.harvey@mendeley.com
Overview
➔ What is Mendeley
➔ Why we chose HBase
➔ How we're using HBase
➔ Challenges
Mendeley helps researchers work smarter
Mendeley extracts
research data..
Install
Mendeley Desktop
Mendeley helps researchers work smarter
..and aggregates research
data in the cloud
Mendeley extracts
research data..
Mendeley helps researchers work smarter
Mendeley in numbers
➔ 600,000+ users
➔ 50+ million user documents
➔ Since January 2009
➔ 30 million unique documents
➔ De-duplicated from user and other imports
➔ 5TB of papers
Data Mining Team
➔ Catalogue
➔ Importing
➔ Web Crawling
➔ De-duplication
➔ Statistics
➔ Related and recommended research
➔ Search
Starting off
➔ Users data in MySQL
➔ Normalised document tables
➔ Quite a few joins..
➔ Stuck with MySQL for data mining
➔ Clustering and de-duplication
➔ Got us to launch the article pages
But..
➔ Re-process everything often
➔ Algorithms with global counts
➔ Modifying algorithms affect everything
➔ Iterating over tables was slow
➔ Could not easily scale processing
➔ Needed to shard for more documents
➔ Daily stats took > 24h to process...
What we needed
➔ Scale to 100s of millions of documents
➔ ~80 million papers
➔ ~120 million books
➔ ~2-3 billion references
➔ More projects using data and processing
➔ Update the data more often
➔ Rapidly prototype and develop
➔ Cost effective
So much choice..
But they mostly miss out good scalable processing.
And many more...
HBase and Hadoop
➔ Scalable storage
➔ Scalable processing
➔ Designed to work with map reduce
➔ Fast scans
➔ Incremental updates
➔ Flexible schema
Where HBase fits in
How we store data
➔ Mostly documents
➔ Column Families for different data
➔ Metadata / raw pdf files
➔ More efficient scans
➔ Protocol Buffers for metadata
➔ Easy to manage 100+ fields
➔ Faster serialisation
Example Schema
Row Column family Qualifier
sha1_hash metadata document
date_added
date_modified
source
content pdf
full_text
entity_extraction
canonical_id version_live
● All data for documents in one table
How we process data
➔ Java Map Reduce
➔ More control over data flows
➔ Allows us to do more complex work
➔ Pig
➔ Don't have to think in map reduce
➔ Twitter's Elephant Bird decodes protocol buffers
➔ Enables rapid prototyping
➔ Less efficient than using java map reduce
➔ Quick example...
Example
➔ Trending keywords over time
➔ For a give keyword, how many documents per year?
➔ Multiple map/reduce tasks
➔ 100s of line of java...
Pig Example
-- Load the document bag
rawDocs = LOAD 'hbase://canonical_documents'
USING HbaseLoader('metadata:document')
AS (protodoc);
-- De-serialise protocol buffer
docs = FOREACH rawDocs GENERATE
DocumentProtobufBytesToTuple(protodoc)AS doc;
-- Get keyword, year tuples
tagYear = FOREACH docs GENERATE
FLATTEN (doc.(year, keywords_bag))
AS keyword, doc::year AS year;
-- Group unique (keyword, year) tuples
yearTag = GROUP tagYear BY (keyword, year);
-- Create (keyword, year, count) tuples
yearTagCount = FOREACH yearTag GENERATE
FLATTEN(group) AS (keyword, year),
COUNT(tagYear) AS count;
-- Group the counts by keyword
tagYearCounts = GROUP yearTagCount BY keyword;
-- Group the counts by keyword
tagYearCounts = FOREACH tagYearCounts GENERATE
group AS keyword,
yearTagCount.(year, count) AS years;
STORE tagYearCounts INTO 'tag_year_counts';
Challenges
➔ MySQL hard to export from
➔ Many joins slow things down
➔ Don't normalise if you don't have to!
➔ HBase needs memory
➔ Stability issues if you give it too little
Challenges: Hardware
➔ Knowing where to start is hard...
➔ 2x quad core Intel cpu
➔ 4x 1TB disks
➔ Memory
➔ Started with 8GB, then 16GB
➔ Upgrading to 24GB soon
➔ Currently 15 nodes
www.mendeley.com

More Related Content

What's hot

Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Joydeep Sen Sarma
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 
Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2Dave Gardner
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
HBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at XiaomiHBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at XiaomiMichael Stack
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdiMohit Jaggi
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalabilityjbellis
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...DataStax
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsCassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsDataStax
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 

What's hot (20)

Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
HBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at XiaomiHBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at Xiaomi
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdi
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
 
Hadoop on-mesos
Hadoop on-mesosHadoop on-mesos
Hadoop on-mesos
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsCassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 

Viewers also liked

Google Apps and Plagiarism
Google Apps and PlagiarismGoogle Apps and Plagiarism
Google Apps and PlagiarismJon Corippo
 
ISTC 201 - Plagiarism and Proper Citation
ISTC 201 - Plagiarism and Proper CitationISTC 201 - Plagiarism and Proper Citation
ISTC 201 - Plagiarism and Proper CitationLaksamee Putnam
 
Google analytics ppt
Google analytics pptGoogle analytics ppt
Google analytics pptmaddinpiya
 
5 Fantasy Google Translator
5 Fantasy Google Translator5 Fantasy Google Translator
5 Fantasy Google TranslatorJing-mei Huang
 
Project Voldemort: Big data loading
Project Voldemort: Big data loadingProject Voldemort: Big data loading
Project Voldemort: Big data loadingDan Harvey
 
How to set up campaign in google adwords by Tanuja Talekar
How to set up campaign in google adwords by Tanuja TalekarHow to set up campaign in google adwords by Tanuja Talekar
How to set up campaign in google adwords by Tanuja TalekarTanuja Talekar
 
Scientific writing pro : Office word & Mendeley (dani r firman)
Scientific writing pro : Office word & Mendeley (dani r firman)Scientific writing pro : Office word & Mendeley (dani r firman)
Scientific writing pro : Office word & Mendeley (dani r firman)Dani Firman
 
Webmaster tool by Neha Nayak
Webmaster tool by Neha NayakWebmaster tool by Neha Nayak
Webmaster tool by Neha NayakNeha Nayak
 
Google analytics by Neha Nayak
Google analytics by Neha NayakGoogle analytics by Neha Nayak
Google analytics by Neha NayakNeha Nayak
 
Top 10 Google Analytics Reports
Top 10 Google Analytics ReportsTop 10 Google Analytics Reports
Top 10 Google Analytics ReportsSally Falkow
 
Google Analytics 101 for Business - How to Get Started With Google Analytics
Google Analytics 101 for Business - How to Get Started With Google AnalyticsGoogle Analytics 101 for Business - How to Get Started With Google Analytics
Google Analytics 101 for Business - How to Get Started With Google AnalyticsJeff Sauer
 
An introduction to Google Analytics
An introduction to Google AnalyticsAn introduction to Google Analytics
An introduction to Google AnalyticsJoris Roebben
 
Google Analytics 101 | 2015
Google Analytics 101 |  2015Google Analytics 101 |  2015
Google Analytics 101 | 2015Insivia
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
APAC Big Data Strategy RadhaKrishna Hiremane
APAC Big Data  Strategy RadhaKrishna  HiremaneAPAC Big Data  Strategy RadhaKrishna  Hiremane
APAC Big Data Strategy RadhaKrishna HiremaneIntelAPAC
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit
 

Viewers also liked (20)

thesis-despoina
thesis-despoinathesis-despoina
thesis-despoina
 
Plagcitation fa2012
Plagcitation fa2012Plagcitation fa2012
Plagcitation fa2012
 
Google Apps and Plagiarism
Google Apps and PlagiarismGoogle Apps and Plagiarism
Google Apps and Plagiarism
 
ISTC 201 - Plagiarism and Proper Citation
ISTC 201 - Plagiarism and Proper CitationISTC 201 - Plagiarism and Proper Citation
ISTC 201 - Plagiarism and Proper Citation
 
Google analytics ppt
Google analytics pptGoogle analytics ppt
Google analytics ppt
 
5 Fantasy Google Translator
5 Fantasy Google Translator5 Fantasy Google Translator
5 Fantasy Google Translator
 
Project Voldemort: Big data loading
Project Voldemort: Big data loadingProject Voldemort: Big data loading
Project Voldemort: Big data loading
 
How to set up campaign in google adwords by Tanuja Talekar
How to set up campaign in google adwords by Tanuja TalekarHow to set up campaign in google adwords by Tanuja Talekar
How to set up campaign in google adwords by Tanuja Talekar
 
Scientific writing pro : Office word & Mendeley (dani r firman)
Scientific writing pro : Office word & Mendeley (dani r firman)Scientific writing pro : Office word & Mendeley (dani r firman)
Scientific writing pro : Office word & Mendeley (dani r firman)
 
Webmaster tool by Neha Nayak
Webmaster tool by Neha NayakWebmaster tool by Neha Nayak
Webmaster tool by Neha Nayak
 
Google Analytics Overview
Google Analytics OverviewGoogle Analytics Overview
Google Analytics Overview
 
Google analytics by Neha Nayak
Google analytics by Neha NayakGoogle analytics by Neha Nayak
Google analytics by Neha Nayak
 
Top 10 Google Analytics Reports
Top 10 Google Analytics ReportsTop 10 Google Analytics Reports
Top 10 Google Analytics Reports
 
Google Analytics 101 for Business - How to Get Started With Google Analytics
Google Analytics 101 for Business - How to Get Started With Google AnalyticsGoogle Analytics 101 for Business - How to Get Started With Google Analytics
Google Analytics 101 for Business - How to Get Started With Google Analytics
 
An introduction to Google Analytics
An introduction to Google AnalyticsAn introduction to Google Analytics
An introduction to Google Analytics
 
Google Analytics 101 | 2015
Google Analytics 101 |  2015Google Analytics 101 |  2015
Google Analytics 101 | 2015
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
APAC Big Data Strategy RadhaKrishna Hiremane
APAC Big Data  Strategy RadhaKrishna  HiremaneAPAC Big Data  Strategy RadhaKrishna  Hiremane
APAC Big Data Strategy RadhaKrishna Hiremane
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 

Similar to HBase at Mendeley

NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeNDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeTom Kerkhove
 
NDC Sydney - Analyzing StackExchange with Azure Data Lake
NDC Sydney - Analyzing StackExchange with Azure Data LakeNDC Sydney - Analyzing StackExchange with Azure Data Lake
NDC Sydney - Analyzing StackExchange with Azure Data LakeTom Kerkhove
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache HadoopHortonworks
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataTrieu Nguyen
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014spinningmatt
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series DatabasePramit Choudhary
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connectorDenny Lee
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 

Similar to HBase at Mendeley (20)

NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeNDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
 
NDC Sydney - Analyzing StackExchange with Azure Data Lake
NDC Sydney - Analyzing StackExchange with Azure Data LakeNDC Sydney - Analyzing StackExchange with Azure Data Lake
NDC Sydney - Analyzing StackExchange with Azure Data Lake
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
מיכאל
מיכאלמיכאל
מיכאל
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 

Recently uploaded

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Recently uploaded (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

HBase at Mendeley

  • 1. HBase at Mendeley Dan Harvey Data Mining Engineer dan.harvey@mendeley.com
  • 2. Overview ➔ What is Mendeley ➔ Why we chose HBase ➔ How we're using HBase ➔ Challenges
  • 4. Mendeley extracts research data.. Install Mendeley Desktop Mendeley helps researchers work smarter
  • 5. ..and aggregates research data in the cloud Mendeley extracts research data.. Mendeley helps researchers work smarter
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. Mendeley in numbers ➔ 600,000+ users ➔ 50+ million user documents ➔ Since January 2009 ➔ 30 million unique documents ➔ De-duplicated from user and other imports ➔ 5TB of papers
  • 12. Data Mining Team ➔ Catalogue ➔ Importing ➔ Web Crawling ➔ De-duplication ➔ Statistics ➔ Related and recommended research ➔ Search
  • 13. Starting off ➔ Users data in MySQL ➔ Normalised document tables ➔ Quite a few joins.. ➔ Stuck with MySQL for data mining ➔ Clustering and de-duplication ➔ Got us to launch the article pages
  • 14. But.. ➔ Re-process everything often ➔ Algorithms with global counts ➔ Modifying algorithms affect everything ➔ Iterating over tables was slow ➔ Could not easily scale processing ➔ Needed to shard for more documents ➔ Daily stats took > 24h to process...
  • 15. What we needed ➔ Scale to 100s of millions of documents ➔ ~80 million papers ➔ ~120 million books ➔ ~2-3 billion references ➔ More projects using data and processing ➔ Update the data more often ➔ Rapidly prototype and develop ➔ Cost effective
  • 16. So much choice.. But they mostly miss out good scalable processing. And many more...
  • 17. HBase and Hadoop ➔ Scalable storage ➔ Scalable processing ➔ Designed to work with map reduce ➔ Fast scans ➔ Incremental updates ➔ Flexible schema
  • 19. How we store data ➔ Mostly documents ➔ Column Families for different data ➔ Metadata / raw pdf files ➔ More efficient scans ➔ Protocol Buffers for metadata ➔ Easy to manage 100+ fields ➔ Faster serialisation
  • 20. Example Schema Row Column family Qualifier sha1_hash metadata document date_added date_modified source content pdf full_text entity_extraction canonical_id version_live ● All data for documents in one table
  • 21. How we process data ➔ Java Map Reduce ➔ More control over data flows ➔ Allows us to do more complex work ➔ Pig ➔ Don't have to think in map reduce ➔ Twitter's Elephant Bird decodes protocol buffers ➔ Enables rapid prototyping ➔ Less efficient than using java map reduce ➔ Quick example...
  • 22. Example ➔ Trending keywords over time ➔ For a give keyword, how many documents per year? ➔ Multiple map/reduce tasks ➔ 100s of line of java...
  • 23. Pig Example -- Load the document bag rawDocs = LOAD 'hbase://canonical_documents' USING HbaseLoader('metadata:document') AS (protodoc); -- De-serialise protocol buffer docs = FOREACH rawDocs GENERATE DocumentProtobufBytesToTuple(protodoc)AS doc; -- Get keyword, year tuples tagYear = FOREACH docs GENERATE FLATTEN (doc.(year, keywords_bag)) AS keyword, doc::year AS year;
  • 24. -- Group unique (keyword, year) tuples yearTag = GROUP tagYear BY (keyword, year); -- Create (keyword, year, count) tuples yearTagCount = FOREACH yearTag GENERATE FLATTEN(group) AS (keyword, year), COUNT(tagYear) AS count; -- Group the counts by keyword tagYearCounts = GROUP yearTagCount BY keyword; -- Group the counts by keyword tagYearCounts = FOREACH tagYearCounts GENERATE group AS keyword, yearTagCount.(year, count) AS years; STORE tagYearCounts INTO 'tag_year_counts';
  • 25. Challenges ➔ MySQL hard to export from ➔ Many joins slow things down ➔ Don't normalise if you don't have to! ➔ HBase needs memory ➔ Stability issues if you give it too little
  • 26. Challenges: Hardware ➔ Knowing where to start is hard... ➔ 2x quad core Intel cpu ➔ 4x 1TB disks ➔ Memory ➔ Started with 8GB, then 16GB ➔ Upgrading to 24GB soon ➔ Currently 15 nodes