SlideShare uma empresa Scribd logo
1 de 76
Flexible Indexing in Hadoop
         Dmitriy Ryaboy @squarecog
        Analytics Infrastructure @ Twitter
    Hadoop Summit, San Jose, CA June 2012
@JoinTheFlock | Hadoop Summit, June 14 2012   2
@JoinTheFlock | Hadoop Summit, June 14 2012   3
Hadoop is great at plowing
through data


                                                              @JoinTheFlock | Hadoop Summit, June 14 2012   4
       Image source: http://en.wikipedia.org/wiki/File:Snowplow_in_the_morning.jpg
And we do plow
   10s of Thousands of Jobs per day

100 TB (uncompressed) ingested daily

Many users and diverse use cases




                                       @JoinTheFlock | Hadoop Summit, June 14 2012   5
Looking for needles in
haystacks.




                                                         @JoinTheFlock | Hadoop Summit, June 14 2012   6

        Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
Looking for needles in
haystacks.




With snowplows.
                                                         @JoinTheFlock | Hadoop Summit, June 14 2012   6

        Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
A Pig Script
 event_logs = load '/logs/lots_of_data'
                     using ThriftPigLoader('thrift.gen.LogEvent');
 filtered_logs = filter event_logs by event == 'something_rare';


 -- Then do stuff.




90% of the mappers in this job output no data.
We can do better...


                                                   @JoinTheFlock | Hadoop Summit, June 14 2012   7
Find smaller haystacks.




                                                                     @JoinTheFlock | Hadoop Summit, June 14 2012   8
     Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
Use subpartitions!




                     @JoinTheFlock | Hadoop Summit, June 14 2012   9
Use subpartitions!
• tablename/year/month/day/hour/bucket




                                         @JoinTheFlock | Hadoop Summit, June 14 2012   9
Use subpartitions!
• tablename/year/month/day/hour/bucket
• Only so many things you can partition by




                                             @JoinTheFlock | Hadoop Summit, June 14 2012   9
Use subpartitions!
• tablename/year/month/day/hour/bucket
• Only so many things you can partition by
• Up-front planning required




                                             @JoinTheFlock | Hadoop Summit, June 14 2012   9
Use subpartitions!
• tablename/year/month/day/hour/bucket
• Only so many things you can partition by
• Up-front planning required
• Rewrite or duplicate for different query patterns




                                              @JoinTheFlock | Hadoop Summit, June 14 2012   9
Keep the data sorted!




                        @JoinTheFlock | Hadoop Summit, June 14 2012   10
Keep the data sorted!
• Painful to maintain




                        @JoinTheFlock | Hadoop Summit, June 14 2012   10
Keep the data sorted!
• Painful to maintain
• Only one sort order at a time




                                  @JoinTheFlock | Hadoop Summit, June 14 2012   10
Keep the data sorted!
• Painful to maintain
• Only one sort order at a time
• Rewrite or duplicate for different query patterns




                                              @JoinTheFlock | Hadoop Summit, June 14 2012   10
Trojan Layouts*




                  * http://infosys.uni-saarland.de/publications/JQD11.pdf
                                     @JoinTheFlock | Hadoop Summit, June 14 2012   11
Trojan Layouts*
• Identify interesting column groupings




                             * http://infosys.uni-saarland.de/publications/JQD11.pdf
                                                @JoinTheFlock | Hadoop Summit, June 14 2012   11
Trojan Layouts*
• Identify interesting column groupings
• Use different column groupings per HDFS block replica




                             * http://infosys.uni-saarland.de/publications/JQD11.pdf
                                                @JoinTheFlock | Hadoop Summit, June 14 2012   11
Trojan Layouts*
• Identify interesting column groupings
• Use different column groupings per HDFS block replica
• Requires changes to NN




                             * http://infosys.uni-saarland.de/publications/JQD11.pdf
                                                @JoinTheFlock | Hadoop Summit, June 14 2012   11
Trojan Layouts*
• Identify interesting column groupings
• Use different column groupings per HDFS block replica
• Requires changes to NN
• ... and increases load on NN




                             * http://infosys.uni-saarland.de/publications/JQD11.pdf
                                                @JoinTheFlock | Hadoop Summit, June 14 2012   11
HBase!




         @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!
• Good solution in many cases!




                                 @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!
• Good solution in many cases!
• Maintenance overhead




                                 @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!
• Good solution in many cases!
• Maintenance overhead
• All data must live in HBase




                                 @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!
• Good solution in many cases!
• Maintenance overhead
• All data must live in HBase
• Full table scans slower than MR




                                    @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!
• Good solution in many cases!
• Maintenance overhead
• All data must live in HBase
• Full table scans slower than MR
• Again with the up-front design




                                    @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!
• Good solution in many cases!
• Maintenance overhead
• All data must live in HBase
• Full table scans slower than MR
• Again with the up-front design
  • Secondary Indexes can help




                                    @JoinTheFlock | Hadoop Summit, June 14 2012   12
Hive!




        @JoinTheFlock | Hadoop Summit, June 14 2012   13
Hive!
• That kind of works, actually.




                                  @JoinTheFlock | Hadoop Summit, June 14 2012   13
Hive
Generic Interface for defining indexing behavior.


Reference implementation: “compact” index
 value -> list of HDFS blocks; drop unneeded blocks.


Other indexes available (bitmap in 0.8)


It’ll even update indexes as you add partitions.




                                              @JoinTheFlock | Hadoop Summit, June 14 2012   14
WIN!
Done, Right?




               @JoinTheFlock | Hadoop Summit, June 14 2012   15
Hive
Good news if your data is in Hive!


Bad news if your world is a little bigger.


Indexing is tightly coupled to Hive.


No interoperability with the rest of the Hadoop stack.




                                             @JoinTheFlock | Hadoop Summit, June 14 2012   16
Democracy of Tools




                                                                                 @JoinTheFlock | Hadoop Summit, June 14 2012   17
   Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
Democracy of Tools
• Pig




                                                                                      @JoinTheFlock | Hadoop Summit, June 14 2012   17
        Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
Democracy of Tools
• Pig
• Raw Map-Reduce




                                                                                   @JoinTheFlock | Hadoop Summit, June 14 2012   17
     Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
Democracy of Tools
• Pig
• Raw Map-Reduce
• Cascading DSLs (Scalding, Cascalog, Py-Cascading)




                                                                                    @JoinTheFlock | Hadoop Summit, June 14 2012   17
      Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
Democracy of Tools
• Pig
• Raw Map-Reduce
• Cascading DSLs (Scalding, Cascalog, Py-Cascading)
• Mahout




                                                                                    @JoinTheFlock | Hadoop Summit, June 14 2012   17
      Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
Democracy of Tools
• Pig
• Raw Map-Reduce
• Cascading DSLs (Scalding, Cascalog, Py-Cascading)
• Mahout
• Maybe even Hive



                                                                                    @JoinTheFlock | Hadoop Summit, June 14 2012   17
      Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
Design Goals




               @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals




               @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required




                                 @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required
• As low in the stack as possible




                                 @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required
• As low in the stack as possible
 • In fact, pretty sure we could get Hive to use this...




                                        @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required
• As low in the stack as possible
 • In fact, pretty sure we could get Hive to use this...
• No unnecessary copies of data




                                        @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required
• As low in the stack as possible
 • In fact, pretty sure we could get Hive to use this...
• No unnecessary copies of data
• Allow post-factum indexing




                                        @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required
• As low in the stack as possible
 • In fact, pretty sure we could get Hive to use this...
• No unnecessary copies of data
• Allow post-factum indexing
• Graceful degradation




                                        @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required
• As low in the stack as possible
 • In fact, pretty sure we could get Hive to use this...
• No unnecessary copies of data
• Allow post-factum indexing
• Graceful degradation
• Flexible on-disk representation


                                        @JoinTheFlock | Hadoop Summit, June 14 2012   18
Elephant-Twin
Twitter’s library for creating indexes in Hadoop
https://github.com/twitter/elephant-twin
https://github.com/twitter/elephant-twin-lzo




                                               @JoinTheFlock | Hadoop Summit, June 14 2012   19
Block-Level Indexes
For each value, record the block it occurs in


“Block” can be HDFS block (100s of MBs)
Or LZO block (100s of KBs)
Or SequenceFile block
Or RCFile block ...


Ignore irrelevant blocks
Scan relevant blocks using original InputFormat




                                                @JoinTheFlock | Hadoop Summit, June 14 2012   20
Record-Level Indexes
For each value, record some representation of the record


Can be value + offset, as in bitmap indexes
Can be transformed projection of records, as in Lucene indexes


Some queries can be answered directly from index.




                                              @JoinTheFlock | Hadoop Summit, June 14 2012   21
Indexing:
                 MR
                               Index
                 job
   InputFormat


                 Data



                        @JoinTheFlock | Hadoop Summit, June 14 2012   22
Creating an Index
     public abstract class AbstractBlockIndexingJob {
    protected abstract List<String> getInput();
    protected abstract String getIndex();
    protected abstract String getInputFormat();
    protected abstract String getValueClass();
    protected abstract String getColumnName();
    protected abstract Job setMapper(Job job);
}

public abstract class AbstractLuceneIndexingJob {
  // Similar.
}




                                            @JoinTheFlock | Hadoop Summit, June 14 2012   23
Creating an Index
Mapper transforms the records: emit <DocId, Value>
                     Key                           Value
                 Block Offset                 Column Value
                   Tweet Id                       Text


Block helper:
public abstract class BlockIndexingMapper<KIN, VIN> extends
Mapper<KIN, VIN, TextLongPairWritable, LongPairWritable> {}


Lucene helper:
public abstract class AbstractIndexingMapper<KIN, VIN, KOUT, VOUT>
extends Mapper<KIN, VIN, KOUT, VOUT>
  abstract protected boolean filter(KIN k, VIN v);
  abstract protected KOUT buildOutputKey(KIN k, VIN v);

                                          @JoinTheFlock | Hadoop Summit, June 14 2012   24
Creating an Index
Reducer writes appropriately processed indexes and metadata.


MapFile block index:
public class MapFileIndexingReducer
    extends Reducer<TextLongPairWritable, LongPairWritable,
                    Text, ListLongPair>

Lucene index:
public abstract class AbstractLuceneIndexingReducer<KIN, VIN>
    extends Reducer<KIN, VIN, NullWritable, NullWritable> {
  protected abstract Document buildDocument(KIN k, VIN v);
}




                                          @JoinTheFlock | Hadoop Summit, June 14 2012   25
Creating an Index: Metadata
struct FileIndexDescriptor {
    1: DocType docType
    2: IndexType indexType
    3: i32 indexVersion
    4: string sourcePath
    5: FileChecksum checksum
    6: list<IndexedField> indexedFields
}
struct ETwinIndexDescriptor {
    1: list<FileIndexDescriptor> fileIndexDescriptors
    2: i32 indexPart
    3: optional map<string, string> options
}
                                              @JoinTheFlock | Hadoop Summit, June 14 2012   26
MR
       job     searchKey



                    IndexedInputFormat

Retrieval:
                                Index




             Data



                           @JoinTheFlock | Hadoop Summit, June 14 2012   27
InputFormat
  public class BlockIndexedFileInputFormat<K, V> extends
FileInputFormat<K, V> {

    // Indexing jobs call this function to set up indexing job
related parameters.
    public static void setIndexOptions(Job job,
      String inputformatClass, String valueClass,
      String indexDir, String columnName)

    // Searching jobs call this function to set up searching job
related parameters.
    public static void setSearchOptions(Job job,
      String inputformatClass, String valueClass,
      String indexDir, BinaryExpression filter)
}




                                         @JoinTheFlock | Hadoop Summit, June 14 2012   28
BinaryExpression
  public BinaryExpression(
  Expression lhs, Expression rhs, OpType opType)

public static enum OpType {
    OP_PLUS (" + "),
    OP_MINUS(" - "),
    ...
    OP_EQ(" == "),
    OP_NE(" != "),
    ...
    OP_AND(" and "),
    OP_OR(" or "),
    ...
    TERM_COL(" Column "),
    TERM_CONST(" Constant ");
}



                                         @JoinTheFlock | Hadoop Summit, June 14 2012   29
Pig Integration
    event_logs = load '/logs/lots_of_data'
    using ThriftPigLoader(
	       'thrift.gen.LogEvent');
	
    filtered_logs = filter event_logs by event == 'something_rare';
    -- Then do stuff.




                                               @JoinTheFlock | Hadoop Summit, June 14 2012   30
Pig Integration
    register elephant-twin-1.0.jar
    event_logs = load '/logs/lots_of_data'
    using IndexedLZOPigLoader(
	      'ThriftPigLoader',
	      'thrift.gen.LogEvent',
	      '/user/dmitriy/etwin');
	
    -- Pig will automatically push this down into the Loader and InputFormat
    filtered_logs = filter event_logs by event == 'something_rare';




                                                      @JoinTheFlock | Hadoop Summit, June 14 2012   31
Optimization: merge neighbors
     HDFS Block 1        HDFS Block 2




                     @JoinTheFlock | Hadoop Summit, June 14 2012   32
Optimization: merge neighbors
           HDFS Block 1                       HDFS Block 2




Merge neighbors, share the scan.
(Limit expansion to size of HDFS block)


                                          @JoinTheFlock | Hadoop Summit, June 14 2012   33
Optimization: merge neighbors
            HDFS Block 1                           HDFS Block 2




Scans are faster than random reads.. allow gaps?
Turns out, not that much faster. Better to jump.


                                              @JoinTheFlock | Hadoop Summit, June 14 2012   34
Optimization: combine small splits
              HDFS Block 1                            HDFS Block 2




      match                                             match                          match




                                Generated Split


Combine small relevant spans into single splits.
Try to take locality into account.



                                                  @JoinTheFlock | Hadoop Summit, June 14 2012   35
Applicability
Most keys occur in very few blocks!
Most frequent key only occurs in half the blocks.




                                             @JoinTheFlock | Hadoop Summit, June 14 2012   36
Results
Applicable Jobs take 5-10x fewer resources


Ad-hoc jobs particularly likely to benefit


“Real” indexes still faster..
 -- but can be represented using the same abstraction




                                             @JoinTheFlock | Hadoop Summit, June 14 2012   37
Future Work




                                                                                @JoinTheFlock | Hadoop Summit, June 14 2012   38
   Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
Future Work


  • Regex matching on keys




                                                                                 @JoinTheFlock | Hadoop Summit, June 14 2012   38
    Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
Future Work


  • Regex matching on keys
  • Better Pig pushdown support




                                                                                 @JoinTheFlock | Hadoop Summit, June 14 2012   38
    Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
Future Work


  • Regex matching on keys
  • Better Pig pushdown support
  • MultiIndexInputFormat




                                                                                 @JoinTheFlock | Hadoop Summit, June 14 2012   38
    Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
Future Work


  • Regex matching on keys
  • Better Pig pushdown support
  • MultiIndexInputFormat
  • Traditional indexes under ETwin




                                                                                 @JoinTheFlock | Hadoop Summit, June 14 2012   38
    Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
Future Work


  • Regex matching on keys
  • Better Pig pushdown support
  • MultiIndexInputFormat
  • Traditional indexes under ETwin
  • Index maintenance (via HCatalog?)




                                                                                 @JoinTheFlock | Hadoop Summit, June 14 2012   38
    Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
Questions?
@squarecog


Sounds like fun? We are hiring.



                                  @JoinTheFlock | Hadoop Summit, June 14 2012   39

Mais conteúdo relacionado

Mais procurados

Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handyPraveen Sripati
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Edureka!
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsKognitio
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010Jonathan Seidman
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IEdureka!
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 

Mais procurados (20)

Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
SQL in Hadoop
SQL in HadoopSQL in Hadoop
SQL in Hadoop
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 

Destaque

Les grands enjeux de la banque de demain
Les grands enjeux de la banque de demainLes grands enjeux de la banque de demain
Les grands enjeux de la banque de demainEmmanuel Fraysse
 
Référentiel Client Unique
Référentiel Client Unique Référentiel Client Unique
Référentiel Client Unique Soft Computing
 
7 astuces pour attirer l'attention d'un influenceur sur Linkedin et sur Twitter
7 astuces pour attirer l'attention d'un influenceur sur Linkedin et sur Twitter7 astuces pour attirer l'attention d'un influenceur sur Linkedin et sur Twitter
7 astuces pour attirer l'attention d'un influenceur sur Linkedin et sur TwitterSocial Media For You
 
Mc5.marketing multicanal
Mc5.marketing multicanalMc5.marketing multicanal
Mc5.marketing multicanallenaignf
 
Hadoop Hbase - Introduction
Hadoop Hbase - IntroductionHadoop Hbase - Introduction
Hadoop Hbase - IntroductionBlandine Larbret
 
MapReduce: Traitement de données distribué à grande échelle simplifié
MapReduce: Traitement de données distribué à grande échelle simplifiéMapReduce: Traitement de données distribué à grande échelle simplifié
MapReduce: Traitement de données distribué à grande échelle simplifiéMathieu Dumoulin
 
Junior Connect : la conquête de l'engagement
Junior Connect : la conquête de l'engagementJunior Connect : la conquête de l'engagement
Junior Connect : la conquête de l'engagementIpsos France
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Introduction aux algorithmes map reduce
Introduction aux algorithmes map reduceIntroduction aux algorithmes map reduce
Introduction aux algorithmes map reduceMathieu Dumoulin
 
Les community managers en France 2012
Les community managers en France 2012 Les community managers en France 2012
Les community managers en France 2012 HelloWork
 
Carnet de témoignages #2 : les community managers dans les entreprises franca...
Carnet de témoignages #2 : les community managers dans les entreprises franca...Carnet de témoignages #2 : les community managers dans les entreprises franca...
Carnet de témoignages #2 : les community managers dans les entreprises franca...HelloWork
 
infographie : les Français et Facebook
infographie : les Français et Facebookinfographie : les Français et Facebook
infographie : les Français et FacebookRaphaël Sougakoff
 

Destaque (14)

Les grands enjeux de la banque de demain
Les grands enjeux de la banque de demainLes grands enjeux de la banque de demain
Les grands enjeux de la banque de demain
 
Référentiel Client Unique
Référentiel Client Unique Référentiel Client Unique
Référentiel Client Unique
 
Etude sur le Big Data
Etude sur le Big DataEtude sur le Big Data
Etude sur le Big Data
 
7 astuces pour attirer l'attention d'un influenceur sur Linkedin et sur Twitter
7 astuces pour attirer l'attention d'un influenceur sur Linkedin et sur Twitter7 astuces pour attirer l'attention d'un influenceur sur Linkedin et sur Twitter
7 astuces pour attirer l'attention d'un influenceur sur Linkedin et sur Twitter
 
Mc5.marketing multicanal
Mc5.marketing multicanalMc5.marketing multicanal
Mc5.marketing multicanal
 
Hadoop Hbase - Introduction
Hadoop Hbase - IntroductionHadoop Hbase - Introduction
Hadoop Hbase - Introduction
 
MapReduce: Traitement de données distribué à grande échelle simplifié
MapReduce: Traitement de données distribué à grande échelle simplifiéMapReduce: Traitement de données distribué à grande échelle simplifié
MapReduce: Traitement de données distribué à grande échelle simplifié
 
Junior Connect : la conquête de l'engagement
Junior Connect : la conquête de l'engagementJunior Connect : la conquête de l'engagement
Junior Connect : la conquête de l'engagement
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction aux algorithmes map reduce
Introduction aux algorithmes map reduceIntroduction aux algorithmes map reduce
Introduction aux algorithmes map reduce
 
Les community managers en France 2012
Les community managers en France 2012 Les community managers en France 2012
Les community managers en France 2012
 
Carnet de témoignages #2 : les community managers dans les entreprises franca...
Carnet de témoignages #2 : les community managers dans les entreprises franca...Carnet de témoignages #2 : les community managers dans les entreprises franca...
Carnet de témoignages #2 : les community managers dans les entreprises franca...
 
infographie : les Français et Facebook
infographie : les Français et Facebookinfographie : les Français et Facebook
infographie : les Français et Facebook
 
Digital in 2017 Global Overview
Digital in 2017 Global OverviewDigital in 2017 Global Overview
Digital in 2017 Global Overview
 

Semelhante a Hadoop Indexing Flexibility

Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Arohi Khandelwal
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Dallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: HadoopDallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: Hadooplamont_lockwood
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java ProfessionalsEdureka!
 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in MohaliE2MATRIX
 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in LudhianaE2MATRIX
 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in AmritsarE2MATRIX
 
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...Cloudera, Inc.
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabsWhizlabs
 
Hadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of HadoopHadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of HadoopDr Neelesh Jain
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkBTI360
 
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopEvert Lammerts
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPKrishna Sujeer
 

Semelhante a Hadoop Indexing Flexibility (20)

big data
big databig data
big data
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Dallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: HadoopDallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: Hadoop
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
NASA HDF/HDF-EOS Data Access Challenges
NASA HDF/HDF-EOS Data Access ChallengesNASA HDF/HDF-EOS Data Access Challenges
NASA HDF/HDF-EOS Data Access Challenges
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in Mohali
 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in Amritsar
 
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs
 
Hadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of HadoopHadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of Hadoop
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOP
 

Último

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Último (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Hadoop Indexing Flexibility

  • 1. Flexible Indexing in Hadoop Dmitriy Ryaboy @squarecog Analytics Infrastructure @ Twitter Hadoop Summit, San Jose, CA June 2012
  • 2. @JoinTheFlock | Hadoop Summit, June 14 2012 2
  • 3. @JoinTheFlock | Hadoop Summit, June 14 2012 3
  • 4. Hadoop is great at plowing through data @JoinTheFlock | Hadoop Summit, June 14 2012 4 Image source: http://en.wikipedia.org/wiki/File:Snowplow_in_the_morning.jpg
  • 5. And we do plow 10s of Thousands of Jobs per day 100 TB (uncompressed) ingested daily Many users and diverse use cases @JoinTheFlock | Hadoop Summit, June 14 2012 5
  • 6. Looking for needles in haystacks. @JoinTheFlock | Hadoop Summit, June 14 2012 6 Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
  • 7. Looking for needles in haystacks. With snowplows. @JoinTheFlock | Hadoop Summit, June 14 2012 6 Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
  • 8. A Pig Script event_logs = load '/logs/lots_of_data' using ThriftPigLoader('thrift.gen.LogEvent'); filtered_logs = filter event_logs by event == 'something_rare'; -- Then do stuff. 90% of the mappers in this job output no data. We can do better... @JoinTheFlock | Hadoop Summit, June 14 2012 7
  • 9. Find smaller haystacks. @JoinTheFlock | Hadoop Summit, June 14 2012 8 Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
  • 10. Use subpartitions! @JoinTheFlock | Hadoop Summit, June 14 2012 9
  • 11. Use subpartitions! • tablename/year/month/day/hour/bucket @JoinTheFlock | Hadoop Summit, June 14 2012 9
  • 12. Use subpartitions! • tablename/year/month/day/hour/bucket • Only so many things you can partition by @JoinTheFlock | Hadoop Summit, June 14 2012 9
  • 13. Use subpartitions! • tablename/year/month/day/hour/bucket • Only so many things you can partition by • Up-front planning required @JoinTheFlock | Hadoop Summit, June 14 2012 9
  • 14. Use subpartitions! • tablename/year/month/day/hour/bucket • Only so many things you can partition by • Up-front planning required • Rewrite or duplicate for different query patterns @JoinTheFlock | Hadoop Summit, June 14 2012 9
  • 15. Keep the data sorted! @JoinTheFlock | Hadoop Summit, June 14 2012 10
  • 16. Keep the data sorted! • Painful to maintain @JoinTheFlock | Hadoop Summit, June 14 2012 10
  • 17. Keep the data sorted! • Painful to maintain • Only one sort order at a time @JoinTheFlock | Hadoop Summit, June 14 2012 10
  • 18. Keep the data sorted! • Painful to maintain • Only one sort order at a time • Rewrite or duplicate for different query patterns @JoinTheFlock | Hadoop Summit, June 14 2012 10
  • 19. Trojan Layouts* * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
  • 20. Trojan Layouts* • Identify interesting column groupings * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
  • 21. Trojan Layouts* • Identify interesting column groupings • Use different column groupings per HDFS block replica * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
  • 22. Trojan Layouts* • Identify interesting column groupings • Use different column groupings per HDFS block replica • Requires changes to NN * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
  • 23. Trojan Layouts* • Identify interesting column groupings • Use different column groupings per HDFS block replica • Requires changes to NN • ... and increases load on NN * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
  • 24. HBase! @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 25. HBase! • Good solution in many cases! @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 26. HBase! • Good solution in many cases! • Maintenance overhead @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 27. HBase! • Good solution in many cases! • Maintenance overhead • All data must live in HBase @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 28. HBase! • Good solution in many cases! • Maintenance overhead • All data must live in HBase • Full table scans slower than MR @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 29. HBase! • Good solution in many cases! • Maintenance overhead • All data must live in HBase • Full table scans slower than MR • Again with the up-front design @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 30. HBase! • Good solution in many cases! • Maintenance overhead • All data must live in HBase • Full table scans slower than MR • Again with the up-front design • Secondary Indexes can help @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 31. Hive! @JoinTheFlock | Hadoop Summit, June 14 2012 13
  • 32. Hive! • That kind of works, actually. @JoinTheFlock | Hadoop Summit, June 14 2012 13
  • 33. Hive Generic Interface for defining indexing behavior. Reference implementation: “compact” index value -> list of HDFS blocks; drop unneeded blocks. Other indexes available (bitmap in 0.8) It’ll even update indexes as you add partitions. @JoinTheFlock | Hadoop Summit, June 14 2012 14
  • 34. WIN! Done, Right? @JoinTheFlock | Hadoop Summit, June 14 2012 15
  • 35. Hive Good news if your data is in Hive! Bad news if your world is a little bigger. Indexing is tightly coupled to Hive. No interoperability with the rest of the Hadoop stack. @JoinTheFlock | Hadoop Summit, June 14 2012 16
  • 36. Democracy of Tools @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
  • 37. Democracy of Tools • Pig @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
  • 38. Democracy of Tools • Pig • Raw Map-Reduce @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
  • 39. Democracy of Tools • Pig • Raw Map-Reduce • Cascading DSLs (Scalding, Cascalog, Py-Cascading) @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
  • 40. Democracy of Tools • Pig • Raw Map-Reduce • Cascading DSLs (Scalding, Cascalog, Py-Cascading) • Mahout @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
  • 41. Democracy of Tools • Pig • Raw Map-Reduce • Cascading DSLs (Scalding, Cascalog, Py-Cascading) • Mahout • Maybe even Hive @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
  • 42. Design Goals @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 43. Design Goals @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 44. Design Goals • Minimal Job/Script modification required @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 45. Design Goals • Minimal Job/Script modification required • As low in the stack as possible @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 46. Design Goals • Minimal Job/Script modification required • As low in the stack as possible • In fact, pretty sure we could get Hive to use this... @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 47. Design Goals • Minimal Job/Script modification required • As low in the stack as possible • In fact, pretty sure we could get Hive to use this... • No unnecessary copies of data @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 48. Design Goals • Minimal Job/Script modification required • As low in the stack as possible • In fact, pretty sure we could get Hive to use this... • No unnecessary copies of data • Allow post-factum indexing @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 49. Design Goals • Minimal Job/Script modification required • As low in the stack as possible • In fact, pretty sure we could get Hive to use this... • No unnecessary copies of data • Allow post-factum indexing • Graceful degradation @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 50. Design Goals • Minimal Job/Script modification required • As low in the stack as possible • In fact, pretty sure we could get Hive to use this... • No unnecessary copies of data • Allow post-factum indexing • Graceful degradation • Flexible on-disk representation @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 51. Elephant-Twin Twitter’s library for creating indexes in Hadoop https://github.com/twitter/elephant-twin https://github.com/twitter/elephant-twin-lzo @JoinTheFlock | Hadoop Summit, June 14 2012 19
  • 52. Block-Level Indexes For each value, record the block it occurs in “Block” can be HDFS block (100s of MBs) Or LZO block (100s of KBs) Or SequenceFile block Or RCFile block ... Ignore irrelevant blocks Scan relevant blocks using original InputFormat @JoinTheFlock | Hadoop Summit, June 14 2012 20
  • 53. Record-Level Indexes For each value, record some representation of the record Can be value + offset, as in bitmap indexes Can be transformed projection of records, as in Lucene indexes Some queries can be answered directly from index. @JoinTheFlock | Hadoop Summit, June 14 2012 21
  • 54. Indexing: MR Index job InputFormat Data @JoinTheFlock | Hadoop Summit, June 14 2012 22
  • 55. Creating an Index public abstract class AbstractBlockIndexingJob { protected abstract List<String> getInput(); protected abstract String getIndex(); protected abstract String getInputFormat(); protected abstract String getValueClass(); protected abstract String getColumnName(); protected abstract Job setMapper(Job job); } public abstract class AbstractLuceneIndexingJob { // Similar. } @JoinTheFlock | Hadoop Summit, June 14 2012 23
  • 56. Creating an Index Mapper transforms the records: emit <DocId, Value> Key Value Block Offset Column Value Tweet Id Text Block helper: public abstract class BlockIndexingMapper<KIN, VIN> extends Mapper<KIN, VIN, TextLongPairWritable, LongPairWritable> {} Lucene helper: public abstract class AbstractIndexingMapper<KIN, VIN, KOUT, VOUT> extends Mapper<KIN, VIN, KOUT, VOUT> abstract protected boolean filter(KIN k, VIN v); abstract protected KOUT buildOutputKey(KIN k, VIN v); @JoinTheFlock | Hadoop Summit, June 14 2012 24
  • 57. Creating an Index Reducer writes appropriately processed indexes and metadata. MapFile block index: public class MapFileIndexingReducer extends Reducer<TextLongPairWritable, LongPairWritable, Text, ListLongPair> Lucene index: public abstract class AbstractLuceneIndexingReducer<KIN, VIN> extends Reducer<KIN, VIN, NullWritable, NullWritable> { protected abstract Document buildDocument(KIN k, VIN v); } @JoinTheFlock | Hadoop Summit, June 14 2012 25
  • 58. Creating an Index: Metadata struct FileIndexDescriptor { 1: DocType docType 2: IndexType indexType 3: i32 indexVersion 4: string sourcePath 5: FileChecksum checksum 6: list<IndexedField> indexedFields } struct ETwinIndexDescriptor { 1: list<FileIndexDescriptor> fileIndexDescriptors 2: i32 indexPart 3: optional map<string, string> options } @JoinTheFlock | Hadoop Summit, June 14 2012 26
  • 59. MR job searchKey IndexedInputFormat Retrieval: Index Data @JoinTheFlock | Hadoop Summit, June 14 2012 27
  • 60. InputFormat public class BlockIndexedFileInputFormat<K, V> extends FileInputFormat<K, V> { // Indexing jobs call this function to set up indexing job related parameters. public static void setIndexOptions(Job job, String inputformatClass, String valueClass, String indexDir, String columnName) // Searching jobs call this function to set up searching job related parameters. public static void setSearchOptions(Job job, String inputformatClass, String valueClass, String indexDir, BinaryExpression filter) } @JoinTheFlock | Hadoop Summit, June 14 2012 28
  • 61. BinaryExpression public BinaryExpression( Expression lhs, Expression rhs, OpType opType) public static enum OpType { OP_PLUS (" + "), OP_MINUS(" - "), ... OP_EQ(" == "), OP_NE(" != "), ... OP_AND(" and "), OP_OR(" or "), ... TERM_COL(" Column "), TERM_CONST(" Constant "); } @JoinTheFlock | Hadoop Summit, June 14 2012 29
  • 62. Pig Integration event_logs = load '/logs/lots_of_data' using ThriftPigLoader( 'thrift.gen.LogEvent'); filtered_logs = filter event_logs by event == 'something_rare'; -- Then do stuff. @JoinTheFlock | Hadoop Summit, June 14 2012 30
  • 63. Pig Integration register elephant-twin-1.0.jar event_logs = load '/logs/lots_of_data' using IndexedLZOPigLoader( 'ThriftPigLoader', 'thrift.gen.LogEvent', '/user/dmitriy/etwin'); -- Pig will automatically push this down into the Loader and InputFormat filtered_logs = filter event_logs by event == 'something_rare'; @JoinTheFlock | Hadoop Summit, June 14 2012 31
  • 64. Optimization: merge neighbors HDFS Block 1 HDFS Block 2 @JoinTheFlock | Hadoop Summit, June 14 2012 32
  • 65. Optimization: merge neighbors HDFS Block 1 HDFS Block 2 Merge neighbors, share the scan. (Limit expansion to size of HDFS block) @JoinTheFlock | Hadoop Summit, June 14 2012 33
  • 66. Optimization: merge neighbors HDFS Block 1 HDFS Block 2 Scans are faster than random reads.. allow gaps? Turns out, not that much faster. Better to jump. @JoinTheFlock | Hadoop Summit, June 14 2012 34
  • 67. Optimization: combine small splits HDFS Block 1 HDFS Block 2 match match match Generated Split Combine small relevant spans into single splits. Try to take locality into account. @JoinTheFlock | Hadoop Summit, June 14 2012 35
  • 68. Applicability Most keys occur in very few blocks! Most frequent key only occurs in half the blocks. @JoinTheFlock | Hadoop Summit, June 14 2012 36
  • 69. Results Applicable Jobs take 5-10x fewer resources Ad-hoc jobs particularly likely to benefit “Real” indexes still faster.. -- but can be represented using the same abstraction @JoinTheFlock | Hadoop Summit, June 14 2012 37
  • 70. Future Work @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
  • 71. Future Work • Regex matching on keys @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
  • 72. Future Work • Regex matching on keys • Better Pig pushdown support @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
  • 73. Future Work • Regex matching on keys • Better Pig pushdown support • MultiIndexInputFormat @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
  • 74. Future Work • Regex matching on keys • Better Pig pushdown support • MultiIndexInputFormat • Traditional indexes under ETwin @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
  • 75. Future Work • Regex matching on keys • Better Pig pushdown support • MultiIndexInputFormat • Traditional indexes under ETwin • Index maintenance (via HCatalog?) @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
  • 76. Questions? @squarecog Sounds like fun? We are hiring. @JoinTheFlock | Hadoop Summit, June 14 2012 39

Notas do Editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n