SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Cloud Elephants and
 Witches: A Big Data
 Tale from Mendeley




         Kris Jack, PhD
   Data Mining Team Lead
Overview

➔
    What's Mendeley?

➔
    The curse that comes with success

➔
    A framework for scaling up (Hadoop + MapReduce)

➔
    Moving to the cloud (AWS)

➔
    Conclusions
What's Mendeley?
What is Mendeley?


...a large data technology
startup company




                       ...and it's on a mission to
                            change the way that
                                 research is done!
Mendeley          Last.fm
                                                   3) Last.fm builds your music
                works like this:                   profile and recommends you
                                                   music you also could like... and
1) Install “Audioscrobbler”                        it’s the world‘s biggest open
                                                   music database




                              2) Listen to music
Mendeley   Last.fm


music libraries                  research libraries


artists                          researchers


songs                            papers


genres                           disciplines
Mendeley provides tools to help users...


...organise
their research
Mendeley provides tools to help users...
                 ...collaborate with
                     one another
...organise
their research
Mendeley provides tools to help users...
                 ...collaborate with
                     one another
...organise                            ...discover new
their research                                research
Mendeley provides tools to help users...
                 ...collaborate with
                     one another
...organise                            ...discover new
their research                                research
The curse that comes
        with success
In the beginning, there was...

➔
    MySQL:
      ➔
        Normalised tables for storing and serving:
        ➔
          User data
        ➔
          Article data
    ➔
      The system was happy


➔
    With this, we launched
    the article catalogue
    ➔
      Lots of number crunching
    ➔
      Many joins for basic stats
Here's where the curse of success comes

➔
  More articles came
➔
  More users came


➔
    The system became unhappy


➔
    Keeping data fresh was a burden
    ➔
      Algorithms relied on global counts
    ➔
      Iterating over tables was slow
    ➔
      Needed to shard tables to grow catalogue

➔
    In short, our system didn't scale
1.6 million+ users; the 20 largest userbases:
                    University of Cambridge
                         Stanford University
                                           MIT
                         University of Michigan
                               Harvard University
                               University of Oxford
                              Sao Paulo University
                            Imperial College London
                              University of Edinburgh
                                    Cornell University
                      University of California at Berkeley
                                              RWTH Aachen
                                       Columbia University
                                                   Georgia Tech
                                       University of Wisconsin
                                                    UC San Diego
                                      University of California at LA
                                                University of Florida
                                           University of North Carolina
50m
                  Real-time data on 28m unique papers:

           Thomson Reuters’
          Web of Knowledge
          (dating from 1934)



      Mendeley after
         16 months:



      >150 million
individual articles,
          (>25TB)
We had serious needs

➔
    Scale up to the millions (billions for some items)
➔
    Keep data fresh
➔
    Support newly planned services
    ➔
        Search
    ➔
        Recommendations
➔
    Business context
    ➔
        Agile development (rapid prototyping)
    ➔
        Cost effective
    ➔
        Going viral
A framework for scaling up
(Hadoop and MapReduce)
What is Hadoop?

The Apache Hadoop project develops open-source
software for reliable, scalable, distributed
computing
                            www.hadoop.apache.org
Hadoop

➔
    Designed to operate on a cluster of computers
    ➔
        1...thousands
    ➔
        Commodity hardware (low cost units)
➔
    Each node offers local computation and storage
➔
    Provides framework for working with petabytes of data

➔
    When learning about Hadoop, you need to learn about:
    ➔
        HDFS
    ➔
        MapReduce
HDFS

➔
    Hadoop Distributed File System
➔
    Based on Google File System
➔
    Replicates data storage (reliability, x3, across racks)
➔
    Designed to handle very large files (e.g. 64MB)
➔
    Provides high-throughput
➔
    File access through Java and Thrift APIs, CL and Wepapp

➔
    Name node is a single point of failure (availability issue)
MapReduce


➔
    MapReduce is a programming model
➔
    Allows distributed processing of large data sets
➔
    Based on Google's MapReduce
➔
    Inspired by functional programming
➔
    Take the program to the data, not the data to the program
MapReduce Example:
  Article Readers by Country
 doc_id1, reader_id1, usa, 2010, …           HDFS
 doc_id2, reader_id2, austria, 2012, …       Large file (150M entries)
 doc_id1, reader_id3, china, 2010, …         Flattened data
                   .
                                             Stored across nodes
                   .
                   .


             Map
      (pivot countries                   doc_id1, {usa, china, usa, uk, china, china...}
         by doc id)                      doc_id2, {austria, austria, china, china, uk …}
                                         ...


     doc_id1, usa, 0.27                Reduce
     doc_id1, china, 0.09      (calc. document stats)
     doc_id1, uk, 0.09
     doc_id2, austria, 0.99
               .
               .
               .
Hadoop



➔
    HDFS for storing data
➔
    MapReduce for processing data

➔
    Together, bring the program to the data
Hadoop's Users
We make a lot of use of HDFS and MapReduce

➔
    Catalogue Stats
➔
    Recommendations (Mahout)
➔
    Log Analysis (business analytics)
➔
    Top Articles
➔
    … and more

➔
    Quick, reliable and scalable
Beware that these benefits have costs

➔
    Migrating to a new system (data consistency)
➔
    Setup costs
    ➔
        Learn black magic to configure
    ➔
        Hardware for cluster
➔
    Administrative costs
    ➔
        High learning curve to administrate Hadoop
    ➔
        Still an immature technology
    ➔
        You may need to debug the source code
➔
    Tips
    ➔
        Get involved in the community (e.g. meetups, forums)
    ➔
        Use good commodity hardware
    ➔
        Consider moving to the cloud...
Moving to the cloud
            (AWS)
What is AWS?

Amazon Web Services (AWS) delivers a set of
services that together form a reliable, scalable,
and inexpensive computing platform “in the
cloud”
                             www.aws.amazon.com
Why move to AWS?

➔
    The cost of running your own cluster can be high
    ➔
        Monetary (e.g. hardware)
    ➔
        Time (e.g. training, setup, administration)
➔
  AWS takes on these problems, renting their
services to you based on your usage
Article Recommendations

➔
    Aim: help researchers to find interest articles
    ➔
        Combat information deluge
    ➔
        Keep up-to-date with recent movements
➔
    1.6M users
➔
    50M articles
➔
  Batch process for generating regular
recommendations (using Mahout)
Article Recommendations in EMR

➔
    Use Amazon's Elastic Map Reduce (EMR)
➔
    Upload input data (user libraries)
➔
    Upload Mahout jar
➔
    Spin up cluster
➔
    Run the job
    ➔
        You decide the number of nodes (cost vs time)
    ➔
        You decide the spec of the nodes (cost vs quality)
➔
    Retrieve the output
Catalogue Search

➔
    50 million articles
➔
    50GB index in Solr
➔
    Variable load (over 24 hours)
    ➔
        1AM is quieter (100 q/s), 1PM is busier (150 q/s)
At 1AM, 150 queries/second
            1PM, 100 queries/second



                                            AWS Instance

         ?, ?, ?...
           queries
           (100/s)
           (150/s)           AWS elastic
                            load balancer                  AWS Instance


                                            AWS Instance




Catalogue Search in Context of Variable Load

➔
    Amazon's Elastic Load Balancer
➔
    Only pay for nodes when you need them
    ➔
        Spin up when load is high
    ➔
        Tear down load is low
➔
    Cost effective and scalable
Problems we've faced

➔
    Lack of control can be an issue
    ➔
        Trade-off administration and control
➔
    Orchestration issues
    ➔
        We have many services to coordinate
    ➔
        Cloud formation & Elastic Beanstalk
➔
    Migrating live services is hard work
Conclusions
Conclusions

➔
  Mendeley has created one of the world's largest
scientific databases
➔
 Storing and processing this data is a large scale
challenge
➔
  Hadoop, through HDFS and MapReduce, provides a
framework for large scale data processing
➔
 Be aware of administration costs when doing this in
house
Conclusions

➔
  AWS can make scaling up efficient and cost
effective
➔
    Tap into the rich big data community out there
➔
 We plan to have make no more substantial
hardware purchases, instead use AWS
➔
  Scaling up isn't a trivial problem, to save pain,
plan for it from the outset
Conclusions

➔
 Magic elephants that live in clouds can lift the
curses of evil witches
www.mendeley.com

Mais conteúdo relacionado

Mais procurados

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology LandscapeShivanandaVSeeri
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 

Mais procurados (20)

Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hive sq lfor-hadoop
Hive sq lfor-hadoopHive sq lfor-hadoop
Hive sq lfor-hadoop
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Pptx present
Pptx presentPptx present
Pptx present
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 

Semelhante a Cloud Elephants and Witches: A Big Data Tale from Mendeley

eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableStefan Kupstaitis-Dunkler
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series DatabasePramit Choudhary
 
Mendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersMendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersKris Jack
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardDocker, Inc.
 

Semelhante a Cloud Elephants and Witches: A Big Data Tale from Mendeley (20)

eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Mendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersMendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchers
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 

Mais de Kris Jack

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyKris Jack
 
Machine Learning @ Mendeley
Machine Learning @ MendeleyMachine Learning @ Mendeley
Machine Learning @ MendeleyKris Jack
 
Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Kris Jack
 
Mendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemMendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemKris Jack
 
Mendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data ChallengesMendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data ChallengesKris Jack
 
Scientific Article Recommendation with Mahout
Scientific Article Recommendation with MahoutScientific Article Recommendation with Mahout
Scientific Article Recommendation with MahoutKris Jack
 
Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at MendeleyMahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at MendeleyKris Jack
 
improving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesimproving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesKris Jack
 
Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...Kris Jack
 
A Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionA Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionKris Jack
 
From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...Kris Jack
 
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...Kris Jack
 
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleMendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleKris Jack
 
Mendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic LiteratureMendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic LiteratureKris Jack
 
Recommendation Engines for Scientific Literature
Recommendation Engines for Scientific LiteratureRecommendation Engines for Scientific Literature
Recommendation Engines for Scientific LiteratureKris Jack
 

Mais de Kris Jack (15)

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
Machine Learning @ Mendeley
Machine Learning @ MendeleyMachine Learning @ Mendeley
Machine Learning @ Mendeley
 
Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?
 
Mendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemMendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender System
 
Mendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data ChallengesMendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data Challenges
 
Scientific Article Recommendation with Mahout
Scientific Article Recommendation with MahoutScientific Article Recommendation with Mahout
Scientific Article Recommendation with Mahout
 
Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at MendeleyMahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
 
improving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesimproving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similarities
 
Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...
 
A Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionA Computational Model of Staged Language Acquisition
A Computational Model of Staged Language Acquisition
 
From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...
 
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
 
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleMendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scale
 
Mendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic LiteratureMendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic Literature
 
Recommendation Engines for Scientific Literature
Recommendation Engines for Scientific LiteratureRecommendation Engines for Scientific Literature
Recommendation Engines for Scientific Literature
 

Último

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 

Último (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 

Cloud Elephants and Witches: A Big Data Tale from Mendeley

  • 1. Cloud Elephants and Witches: A Big Data Tale from Mendeley Kris Jack, PhD Data Mining Team Lead
  • 2. Overview ➔ What's Mendeley? ➔ The curse that comes with success ➔ A framework for scaling up (Hadoop + MapReduce) ➔ Moving to the cloud (AWS) ➔ Conclusions
  • 4. What is Mendeley? ...a large data technology startup company ...and it's on a mission to change the way that research is done!
  • 5. Mendeley Last.fm 3) Last.fm builds your music works like this: profile and recommends you music you also could like... and 1) Install “Audioscrobbler” it’s the world‘s biggest open music database 2) Listen to music
  • 6. Mendeley Last.fm music libraries research libraries artists researchers songs papers genres disciplines
  • 7. Mendeley provides tools to help users... ...organise their research
  • 8. Mendeley provides tools to help users... ...collaborate with one another ...organise their research
  • 9. Mendeley provides tools to help users... ...collaborate with one another ...organise ...discover new their research research
  • 10.
  • 11. Mendeley provides tools to help users... ...collaborate with one another ...organise ...discover new their research research
  • 12. The curse that comes with success
  • 13. In the beginning, there was... ➔ MySQL: ➔ Normalised tables for storing and serving: ➔ User data ➔ Article data ➔ The system was happy ➔ With this, we launched the article catalogue ➔ Lots of number crunching ➔ Many joins for basic stats
  • 14. Here's where the curse of success comes ➔ More articles came ➔ More users came ➔ The system became unhappy ➔ Keeping data fresh was a burden ➔ Algorithms relied on global counts ➔ Iterating over tables was slow ➔ Needed to shard tables to grow catalogue ➔ In short, our system didn't scale
  • 15. 1.6 million+ users; the 20 largest userbases: University of Cambridge Stanford University MIT University of Michigan Harvard University University of Oxford Sao Paulo University Imperial College London University of Edinburgh Cornell University University of California at Berkeley RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida University of North Carolina
  • 16. 50m Real-time data on 28m unique papers: Thomson Reuters’ Web of Knowledge (dating from 1934) Mendeley after 16 months: >150 million individual articles, (>25TB)
  • 17. We had serious needs ➔ Scale up to the millions (billions for some items) ➔ Keep data fresh ➔ Support newly planned services ➔ Search ➔ Recommendations ➔ Business context ➔ Agile development (rapid prototyping) ➔ Cost effective ➔ Going viral
  • 18. A framework for scaling up (Hadoop and MapReduce)
  • 19. What is Hadoop? The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing www.hadoop.apache.org
  • 20. Hadoop ➔ Designed to operate on a cluster of computers ➔ 1...thousands ➔ Commodity hardware (low cost units) ➔ Each node offers local computation and storage ➔ Provides framework for working with petabytes of data ➔ When learning about Hadoop, you need to learn about: ➔ HDFS ➔ MapReduce
  • 21. HDFS ➔ Hadoop Distributed File System ➔ Based on Google File System ➔ Replicates data storage (reliability, x3, across racks) ➔ Designed to handle very large files (e.g. 64MB) ➔ Provides high-throughput ➔ File access through Java and Thrift APIs, CL and Wepapp ➔ Name node is a single point of failure (availability issue)
  • 22. MapReduce ➔ MapReduce is a programming model ➔ Allows distributed processing of large data sets ➔ Based on Google's MapReduce ➔ Inspired by functional programming ➔ Take the program to the data, not the data to the program
  • 23. MapReduce Example: Article Readers by Country doc_id1, reader_id1, usa, 2010, … HDFS doc_id2, reader_id2, austria, 2012, … Large file (150M entries) doc_id1, reader_id3, china, 2010, … Flattened data . Stored across nodes . . Map (pivot countries doc_id1, {usa, china, usa, uk, china, china...} by doc id) doc_id2, {austria, austria, china, china, uk …} ... doc_id1, usa, 0.27 Reduce doc_id1, china, 0.09 (calc. document stats) doc_id1, uk, 0.09 doc_id2, austria, 0.99 . . .
  • 24. Hadoop ➔ HDFS for storing data ➔ MapReduce for processing data ➔ Together, bring the program to the data
  • 26. We make a lot of use of HDFS and MapReduce ➔ Catalogue Stats ➔ Recommendations (Mahout) ➔ Log Analysis (business analytics) ➔ Top Articles ➔ … and more ➔ Quick, reliable and scalable
  • 27. Beware that these benefits have costs ➔ Migrating to a new system (data consistency) ➔ Setup costs ➔ Learn black magic to configure ➔ Hardware for cluster ➔ Administrative costs ➔ High learning curve to administrate Hadoop ➔ Still an immature technology ➔ You may need to debug the source code ➔ Tips ➔ Get involved in the community (e.g. meetups, forums) ➔ Use good commodity hardware ➔ Consider moving to the cloud...
  • 28. Moving to the cloud (AWS)
  • 29. What is AWS? Amazon Web Services (AWS) delivers a set of services that together form a reliable, scalable, and inexpensive computing platform “in the cloud” www.aws.amazon.com
  • 30. Why move to AWS? ➔ The cost of running your own cluster can be high ➔ Monetary (e.g. hardware) ➔ Time (e.g. training, setup, administration) ➔ AWS takes on these problems, renting their services to you based on your usage
  • 31. Article Recommendations ➔ Aim: help researchers to find interest articles ➔ Combat information deluge ➔ Keep up-to-date with recent movements ➔ 1.6M users ➔ 50M articles ➔ Batch process for generating regular recommendations (using Mahout)
  • 32. Article Recommendations in EMR ➔ Use Amazon's Elastic Map Reduce (EMR) ➔ Upload input data (user libraries) ➔ Upload Mahout jar ➔ Spin up cluster ➔ Run the job ➔ You decide the number of nodes (cost vs time) ➔ You decide the spec of the nodes (cost vs quality) ➔ Retrieve the output
  • 33. Catalogue Search ➔ 50 million articles ➔ 50GB index in Solr ➔ Variable load (over 24 hours) ➔ 1AM is quieter (100 q/s), 1PM is busier (150 q/s)
  • 34. At 1AM, 150 queries/second 1PM, 100 queries/second AWS Instance ?, ?, ?... queries (100/s) (150/s) AWS elastic load balancer AWS Instance AWS Instance Catalogue Search in Context of Variable Load ➔ Amazon's Elastic Load Balancer ➔ Only pay for nodes when you need them ➔ Spin up when load is high ➔ Tear down load is low ➔ Cost effective and scalable
  • 35. Problems we've faced ➔ Lack of control can be an issue ➔ Trade-off administration and control ➔ Orchestration issues ➔ We have many services to coordinate ➔ Cloud formation & Elastic Beanstalk ➔ Migrating live services is hard work
  • 37. Conclusions ➔ Mendeley has created one of the world's largest scientific databases ➔ Storing and processing this data is a large scale challenge ➔ Hadoop, through HDFS and MapReduce, provides a framework for large scale data processing ➔ Be aware of administration costs when doing this in house
  • 38. Conclusions ➔ AWS can make scaling up efficient and cost effective ➔ Tap into the rich big data community out there ➔ We plan to have make no more substantial hardware purchases, instead use AWS ➔ Scaling up isn't a trivial problem, to save pain, plan for it from the outset
  • 39. Conclusions ➔ Magic elephants that live in clouds can lift the curses of evil witches