SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
How Rackspace Query
Terabytes of Log Data
 Uses MapReduce, Hadoop

    Case Study, by Schubert Zhang 2009-04-30
Rackspace
• Rackspace has more than 50K devices and 7 data
  centers.

• The mail system and logging servers are currently in 3 of
  the Rackspace data centers.

• The system stores over 800 million objects (an object = a
  user event such as receiving an email or logging into
  IMAP) within Solr and 9.6 (records?) billion within
  Hadoop, which equals 6.3 TB compressed.

• Several hundred gigabytes of email log data is
  generated each day. (seems 140GB after cleared up)
Background on Mailtrust
• Email hosting company
• Founded in 1999, merged with Rackspace in 2007,
  previous name: Webmail.us
• 80K business customers, 700K mailboxes.
• 2 hosted mail products: Noteworthy, MS Exchange
• The Noteworthy System:
   – Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds,
     shared calendaring, Outlook sync, Blackberry sync.
   – ~600 servers, commodity hardware, designed to work around
     frequent failures.
• The MS Exchange System:
   – MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync.
   – ~100 servers, higher-end hardware, SAN & DAS storage.
Problems
•   Hundreds of gigabytes of new data each day streaming in from over 600 hyperactive
    servers.
•   Log processing system.
     –   (1) Flat text files stored on each machine.
           •   Had to be manually searched by engineers logging into each individual machine.
     –   (2) Relational database solution that just couldn't compete. MySQL.
           •   Inserts quickly became the bottleneck.
           •   A lot of index churn.
           •   Data was then broken into Merge Tables based on time so index updates weren't a problem.
           •   Load and operational problems.
     –   (3) Hadoop based solution that works wisely and has virtually unlimited scalability potential.
           •   Hadoop
           •   Lucene and Solr.
•   The familiar faced problem now: Lots and lots of data streaming in.
     –   Where do you store all that data?
     –   How do you do anything useful with it?
     –   How to retrieve the wanted data from the data sea.

•   Examine mail logs in order to troubleshoot problems for our customers.
•   The query/search should be fast and accurate.
Now the new system
• The advantage of their new system is that they can now
  look at their data in anyway they want:
   – Nightly MapReduce jobs collect statistics about their mail system
     such as spam counts by domain, bytes transferred and number
     of logins.
   – When they wanted to find out which part of the the world their
     customers logged in from, a quick MapReduce job was created
     and they had the answer within a few hours. Not really possible
     in your typical ETL system.
• "Now whenever we think of complex question about our
  customers’ usage patterns, we can pull the answer from
  our logs within hours via MapReduce. This is powerful
  stuff."
The Platform
•   Hadoop MapReduce
•   Hadoop Distributed File System (HDFS)
•   Lucene
•   Solr
•   Tomcat
The Architecture
• Raw logs get streamed from hundreds of mail servers to
  the Hadoop Distributed File System (”HDFS”) in real time.

• MapReduce jobs are scheduled run to index the new
  data using Apache Lucene and Solr.

• Once the indexes have been built, they are compressed
  and stored away in HDFS.

• Each Hadoop datanode runs a Tomcat servlet container,
  which hosts a number of Solr instances that pull and
  merge the new indexes, and provide really fast search
  results to our support team.
The System Evolution
              Logging v1.0
• Logs were stored in flat text files on the local disk of
  each mail server and were kept for 14 days.

• Our support techs did not have login access to the
  servers, so in order to search the logs they would have
  to escalate a ticket to our engineers. The engineers
  would then have to ssh into each mail server and grep
  /var/log/maillog.

• Problems: Once we grew much past a dozen servers,
  this manual process of logging into each server become
  too time consuming for our engineers.
Logging v1.1
•   Sped up the search process by writing a script that would search
    multiple servers via one command run from a centralized server.

•   Remote still grep.

•   Problems: The support techs still had to escalate a ticket to the
    engineers in order to perform a search. As the number of customers
    and servers increased, this began to take too much of our
    engineers' scarce time. Also, storing and searching the logs on a
    live server was negatively affecting the performance of the servers.
    To make matters worse, the engineering team had grown and we
    started running into the problem where two engineers would perform
    a search at the same time, which really slowed things down.
Logging v2.0
•   a web-based tool where they could search the logs.
•   It allowed searching by the sender or recipient's email address, domain name or IP
    address.
•   All of these were indexed fields in a MySQL database. The centralized log server

•   Each day's logs were stored in a separate table, so that we could cleanup old data by
    simply dropping and recreating MySQL tables.
•   Log data was only kept for 3 days in order to keep the MySQL database down to a
    reasonable size.
•   Wildcard text searches (i.e. MySQL "LIKE" statements) were not allowed because the
    data set was very large and these queries would be horribly slow.

•   Problems: We quickly realized that we had a bottleneck with the MySQL inserts. As
    the tables grew, indexing each entry as it was inserted became slow. Within the first
    hours of testing, the inserts began slowing and could not keep up with the rate at
    which data was received. Version 2.0 of the logging system was never used in
    production.
Logging v2.1
•   Fixed the MySQL INSERT bottleneck by queuing up the log entries
    in local text files on the centralized log server and periodically bulk
    loading them into the database. As syslog-ng received logs on its 6
    ports, the data would be streamed to 6 separate text files. Every 10
    minutes a script would rotate those text files and execute a MySQL
    LOAD to load the data into the database. This was magnitudes
    faster than inserting the log data one record at a time.

•   Problems: The LOADs would get progressively slower as the
    database grew because MySQL indexing performance decreases as
    the table you are inserting into gets larger. This version was fast
    enough to be released into production, but we knew the system
    would not scale too far without additional work.
Logging v2.2
•   Introduced Merge Tables in order to speed up loading the log data into the database.
•   every 10 minutes our script would create a new database table and then load the text
    logs into the empty table.
•   After the data was loaded, the script would modify a set of Merge Tables that
    combined all of the 10-minute tables together.
•   The web search tool was modified to allow searching within the different time ranges.
    Corresponding Merge Tables existed for each of those time ranges, and were
    modified every 10 minutes as new tables were created.

•   Problems: the database LOAD operations would take 2-3 minutes to run. the server
    was now always under a heavy cpu and disk IO load.
•   Searches were being performed more frequently and were becoming slow. We
    started to see some strange problems such as random errors while trying to create
    new tables or modify the Merge Tables. These errors progressively became more
    frequent, resulting in missing log data. The support team began to lose confidence in
    the system's accuracy.
•   the logging system had no redundancy.

•   We needed a new solution that would be fast, reliable and could scale indefinitely
    with our growth. We needed something truly scalable.
Logging v3+
• Avoid limiting our abilities to build new features down the
  road.
• For example, we wanted to build a tool that would allow
  our customers to search their logs directly.

• It scales out it's workload horizontally by adding servers
  and distributing the data and MapReduce jobs amongst
  the servers.

• In about 3 months we build a fresh new log processing
  system using Hadoop, Lucene and Solr.

• Put the log search tool in the hands of our customers.
Stu Hood’s Detailed Comments
•   The loading of data is streaming, but the indexing is not. We write to a file in Hadoop until it
    reaches a size below the block size, or until it times out, and then we close and move it to where it
    will be processed.
•   Our processing jobs run every 10 minutes or so, meaning that the logs become available for
    Customer Care after about 15. We’ve executed around 150K jobs on this cluster with 3 restarts.

•   We create the indexes on local disk in our reducer, and compress them into HDFS after they are
    complete.
•   When we pull the index to make it available for search, we decompress it to local disk and merge
    it using the Lucene IndexWriter.addIndexes method before calling /commit on the Solr instance.
    The Nutch project created an IndexReader that can do read-only access on HDFS, but for speed
    reasons, we decided not to take that approach.
•   Since we are indexing to local disk, we use an embedded SolrCore, in the same JVM as the
    reducer.

•   We have 10 Hadoop data nodes, with 3.5TB hard drives each. = 35TB
•   We are currently indexing an average of 140GBytes per day.

•   The merged indexes are not replicated at all… only one Solr node has a copy of each index, so
    failover involves a brief downtime for queries. If we lose a node, other nodes (consistent hashing)
    become responsible and merge the indexes from the copies we always have in Hadoop.
Future
• Creating reports or doing ad-hoc queries.
• More wanted MapReduce jobs to do
  wanted things.
References
• How Rackspace Now Uses MapReduce
  and Hadoop to Query Terabytes of Data
• MapReduce at Rackspace

Mais conteúdo relacionado

Mais procurados

Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceUwe Printz
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeFlurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeTrieu Nguyen
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
RedisConf18 - Application of Redis in IOT Edge Devices
RedisConf18 - Application of Redis in IOT Edge DevicesRedisConf18 - Application of Redis in IOT Edge Devices
RedisConf18 - Application of Redis in IOT Edge DevicesRedis Labs
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0Heiko Loewe
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapiragluent.
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
 
Microservices - Is it time to breakup?
Microservices - Is it time to breakup? Microservices - Is it time to breakup?
Microservices - Is it time to breakup? Dave Nielsen
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketCloudera, Inc.
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 

Mais procurados (20)

Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeFlurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
RedisConf18 - Application of Redis in IOT Edge Devices
RedisConf18 - Application of Redis in IOT Edge DevicesRedisConf18 - Application of Redis in IOT Edge Devices
RedisConf18 - Application of Redis in IOT Edge Devices
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Microservices - Is it time to breakup?
Microservices - Is it time to breakup? Microservices - Is it time to breakup?
Microservices - Is it time to breakup?
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 

Destaque

Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data miningMai Mustafa
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
PageRank and Markov Chain
PageRank and Markov ChainPageRank and Markov Chain
PageRank and Markov ChainGenioAladino
 

Destaque (7)

Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Google PageRank
Google PageRankGoogle PageRank
Google PageRank
 
3 apache-avro
3 apache-avro3 apache-avro
3 apache-avro
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data mining
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
PageRank and Markov Chain
PageRank and Markov ChainPageRank and Markov Chain
PageRank and Markov Chain
 

Semelhante a Case Study - How Rackspace Query Terabytes Of Data

Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKAndrew Trossman
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftJie Li
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoopch adnan
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
high performance databases
high performance databaseshigh performance databases
high performance databasesmahdi_92
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Data at Scale - Michael Peacock, Cloud Connect 2012
Data at Scale - Michael Peacock, Cloud Connect 2012Data at Scale - Michael Peacock, Cloud Connect 2012
Data at Scale - Michael Peacock, Cloud Connect 2012Michael Peacock
 

Semelhante a Case Study - How Rackspace Query Terabytes Of Data (20)

Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELK
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
high performance databases
high performance databaseshigh performance databases
high performance databases
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Hadoop
HadoopHadoop
Hadoop
 
Data at Scale - Michael Peacock, Cloud Connect 2012
Data at Scale - Michael Peacock, Cloud Connect 2012Data at Scale - Michael Peacock, Cloud Connect 2012
Data at Scale - Michael Peacock, Cloud Connect 2012
 

Mais de Schubert Zhang

Engineering Culture and Infrastructure
Engineering Culture and InfrastructureEngineering Culture and Infrastructure
Engineering Culture and InfrastructureSchubert Zhang
 
Simple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSimple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSchubert Zhang
 
Scrum Agile Development
Scrum Agile DevelopmentScrum Agile Development
Scrum Agile DevelopmentSchubert Zhang
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processingSchubert Zhang
 
Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Schubert Zhang
 
Big Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aBig Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aSchubert Zhang
 
HBase Coprocessor Introduction
HBase Coprocessor IntroductionHBase Coprocessor Introduction
HBase Coprocessor IntroductionSchubert Zhang
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验Schubert Zhang
 
Wild Thinking of BigdataBase
Wild Thinking of BigdataBaseWild Thinking of BigdataBase
Wild Thinking of BigdataBaseSchubert Zhang
 
RockStor - A Cloud Object System based on Hadoop
RockStor -  A Cloud Object System based on HadoopRockStor -  A Cloud Object System based on Hadoop
RockStor - A Cloud Object System based on HadoopSchubert Zhang
 
Hadoop compress-stream
Hadoop compress-streamHadoop compress-stream
Hadoop compress-streamSchubert Zhang
 
Ganglia轻度使用指南
Ganglia轻度使用指南Ganglia轻度使用指南
Ganglia轻度使用指南Schubert Zhang
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionSchubert Zhang
 

Mais de Schubert Zhang (20)

Blockchain in Action
Blockchain in ActionBlockchain in Action
Blockchain in Action
 
科普区块链
科普区块链科普区块链
科普区块链
 
Engineering Culture and Infrastructure
Engineering Culture and InfrastructureEngineering Culture and Infrastructure
Engineering Culture and Infrastructure
 
Simple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSimple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluation
 
Scrum Agile Development
Scrum Agile DevelopmentScrum Agile Development
Scrum Agile Development
 
Career Advice
Career AdviceCareer Advice
Career Advice
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processing
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算
 
Big Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aBig Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223a
 
HBase Coprocessor Introduction
HBase Coprocessor IntroductionHBase Coprocessor Introduction
HBase Coprocessor Introduction
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验
 
Wild Thinking of BigdataBase
Wild Thinking of BigdataBaseWild Thinking of BigdataBase
Wild Thinking of BigdataBase
 
RockStor - A Cloud Object System based on Hadoop
RockStor -  A Cloud Object System based on HadoopRockStor -  A Cloud Object System based on Hadoop
RockStor - A Cloud Object System based on Hadoop
 
Fans of running gump
Fans of running gumpFans of running gump
Fans of running gump
 
Hadoop compress-stream
Hadoop compress-streamHadoop compress-stream
Hadoop compress-stream
 
Ganglia轻度使用指南
Ganglia轻度使用指南Ganglia轻度使用指南
Ganglia轻度使用指南
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solution
 
Big data and cloud
Big data and cloudBig data and cloud
Big data and cloud
 

Último

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Último (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Case Study - How Rackspace Query Terabytes Of Data

  • 1. How Rackspace Query Terabytes of Log Data Uses MapReduce, Hadoop Case Study, by Schubert Zhang 2009-04-30
  • 2. Rackspace • Rackspace has more than 50K devices and 7 data centers. • The mail system and logging servers are currently in 3 of the Rackspace data centers. • The system stores over 800 million objects (an object = a user event such as receiving an email or logging into IMAP) within Solr and 9.6 (records?) billion within Hadoop, which equals 6.3 TB compressed. • Several hundred gigabytes of email log data is generated each day. (seems 140GB after cleared up)
  • 3. Background on Mailtrust • Email hosting company • Founded in 1999, merged with Rackspace in 2007, previous name: Webmail.us • 80K business customers, 700K mailboxes. • 2 hosted mail products: Noteworthy, MS Exchange • The Noteworthy System: – Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds, shared calendaring, Outlook sync, Blackberry sync. – ~600 servers, commodity hardware, designed to work around frequent failures. • The MS Exchange System: – MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync. – ~100 servers, higher-end hardware, SAN & DAS storage.
  • 4. Problems • Hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers. • Log processing system. – (1) Flat text files stored on each machine. • Had to be manually searched by engineers logging into each individual machine. – (2) Relational database solution that just couldn't compete. MySQL. • Inserts quickly became the bottleneck. • A lot of index churn. • Data was then broken into Merge Tables based on time so index updates weren't a problem. • Load and operational problems. – (3) Hadoop based solution that works wisely and has virtually unlimited scalability potential. • Hadoop • Lucene and Solr. • The familiar faced problem now: Lots and lots of data streaming in. – Where do you store all that data? – How do you do anything useful with it? – How to retrieve the wanted data from the data sea. • Examine mail logs in order to troubleshoot problems for our customers. • The query/search should be fast and accurate.
  • 5. Now the new system • The advantage of their new system is that they can now look at their data in anyway they want: – Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins. – When they wanted to find out which part of the the world their customers logged in from, a quick MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system. • "Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This is powerful stuff."
  • 6. The Platform • Hadoop MapReduce • Hadoop Distributed File System (HDFS) • Lucene • Solr • Tomcat
  • 7. The Architecture • Raw logs get streamed from hundreds of mail servers to the Hadoop Distributed File System (”HDFS”) in real time. • MapReduce jobs are scheduled run to index the new data using Apache Lucene and Solr. • Once the indexes have been built, they are compressed and stored away in HDFS. • Each Hadoop datanode runs a Tomcat servlet container, which hosts a number of Solr instances that pull and merge the new indexes, and provide really fast search results to our support team.
  • 8. The System Evolution Logging v1.0 • Logs were stored in flat text files on the local disk of each mail server and were kept for 14 days. • Our support techs did not have login access to the servers, so in order to search the logs they would have to escalate a ticket to our engineers. The engineers would then have to ssh into each mail server and grep /var/log/maillog. • Problems: Once we grew much past a dozen servers, this manual process of logging into each server become too time consuming for our engineers.
  • 9. Logging v1.1 • Sped up the search process by writing a script that would search multiple servers via one command run from a centralized server. • Remote still grep. • Problems: The support techs still had to escalate a ticket to the engineers in order to perform a search. As the number of customers and servers increased, this began to take too much of our engineers' scarce time. Also, storing and searching the logs on a live server was negatively affecting the performance of the servers. To make matters worse, the engineering team had grown and we started running into the problem where two engineers would perform a search at the same time, which really slowed things down.
  • 10. Logging v2.0 • a web-based tool where they could search the logs. • It allowed searching by the sender or recipient's email address, domain name or IP address. • All of these were indexed fields in a MySQL database. The centralized log server • Each day's logs were stored in a separate table, so that we could cleanup old data by simply dropping and recreating MySQL tables. • Log data was only kept for 3 days in order to keep the MySQL database down to a reasonable size. • Wildcard text searches (i.e. MySQL "LIKE" statements) were not allowed because the data set was very large and these queries would be horribly slow. • Problems: We quickly realized that we had a bottleneck with the MySQL inserts. As the tables grew, indexing each entry as it was inserted became slow. Within the first hours of testing, the inserts began slowing and could not keep up with the rate at which data was received. Version 2.0 of the logging system was never used in production.
  • 11. Logging v2.1 • Fixed the MySQL INSERT bottleneck by queuing up the log entries in local text files on the centralized log server and periodically bulk loading them into the database. As syslog-ng received logs on its 6 ports, the data would be streamed to 6 separate text files. Every 10 minutes a script would rotate those text files and execute a MySQL LOAD to load the data into the database. This was magnitudes faster than inserting the log data one record at a time. • Problems: The LOADs would get progressively slower as the database grew because MySQL indexing performance decreases as the table you are inserting into gets larger. This version was fast enough to be released into production, but we knew the system would not scale too far without additional work.
  • 12. Logging v2.2 • Introduced Merge Tables in order to speed up loading the log data into the database. • every 10 minutes our script would create a new database table and then load the text logs into the empty table. • After the data was loaded, the script would modify a set of Merge Tables that combined all of the 10-minute tables together. • The web search tool was modified to allow searching within the different time ranges. Corresponding Merge Tables existed for each of those time ranges, and were modified every 10 minutes as new tables were created. • Problems: the database LOAD operations would take 2-3 minutes to run. the server was now always under a heavy cpu and disk IO load. • Searches were being performed more frequently and were becoming slow. We started to see some strange problems such as random errors while trying to create new tables or modify the Merge Tables. These errors progressively became more frequent, resulting in missing log data. The support team began to lose confidence in the system's accuracy. • the logging system had no redundancy. • We needed a new solution that would be fast, reliable and could scale indefinitely with our growth. We needed something truly scalable.
  • 13. Logging v3+ • Avoid limiting our abilities to build new features down the road. • For example, we wanted to build a tool that would allow our customers to search their logs directly. • It scales out it's workload horizontally by adding servers and distributing the data and MapReduce jobs amongst the servers. • In about 3 months we build a fresh new log processing system using Hadoop, Lucene and Solr. • Put the log search tool in the hands of our customers.
  • 14. Stu Hood’s Detailed Comments • The loading of data is streaming, but the indexing is not. We write to a file in Hadoop until it reaches a size below the block size, or until it times out, and then we close and move it to where it will be processed. • Our processing jobs run every 10 minutes or so, meaning that the logs become available for Customer Care after about 15. We’ve executed around 150K jobs on this cluster with 3 restarts. • We create the indexes on local disk in our reducer, and compress them into HDFS after they are complete. • When we pull the index to make it available for search, we decompress it to local disk and merge it using the Lucene IndexWriter.addIndexes method before calling /commit on the Solr instance. The Nutch project created an IndexReader that can do read-only access on HDFS, but for speed reasons, we decided not to take that approach. • Since we are indexing to local disk, we use an embedded SolrCore, in the same JVM as the reducer. • We have 10 Hadoop data nodes, with 3.5TB hard drives each. = 35TB • We are currently indexing an average of 140GBytes per day. • The merged indexes are not replicated at all… only one Solr node has a copy of each index, so failover involves a brief downtime for queries. If we lose a node, other nodes (consistent hashing) become responsible and merge the indexes from the copies we always have in Hadoop.
  • 15. Future • Creating reports or doing ad-hoc queries. • More wanted MapReduce jobs to do wanted things.
  • 16. References • How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data • MapReduce at Rackspace