SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Gera Shegalov @PJUG, Jan 15, 2013
/home/gera: whoami

■ Saarland University
■ 1st intern in Immortal DB @ Microsoft Research
■ JMS, RDBMS HA @ Oracle




■ Hadoop MapReduce / Hadoop Core
■ Founding member of Apache Drill
■ Open enterprise-grade distribution for Hadoop
 ● Easy, dependable and fast
 ● Open source with standards-based extensions


■ MapR is deployed at 1000’s of companies
 ● From small Internet startups to Fortune 100


■ MapR customers analyze massive amounts of data:
 ● Hundreds of billions of events daily
 ● 90% of the world’s Internet population monthly
 ● $1 trillion in retail purchases annually


■ MapR in the Cloud:
 ● partnered with Google: Hadoop on Google Compute Engine
 ● partnered with Amazon: M3/M5 options for Elastic Map Reduce
Agenda
■ What?
 ● What exactly does Drill do?


■ Why?
 ● Why do we need Apache Drill?


■ Who?
 ● Who is doing this?


■ How?
 ● How does Drill work inside?


■ Conclusion
 ● How can you help?
 ● Where can you find out more?
Apache Drill Overview

■ Drill overview
  ● Low latency interactive queries
  ● Standard ANSI SQL support
  ● Domain Specific Languages / Your own QL

■ Open-Source
  ● Apache Incubator
  ● 100’s involved across US and Europe
  ● Community consensus on API, functionality
Big Data Processing
                Batch              Interactive         Stream
                processing         analysis            processing

                                     Milliseconds to
Query runtime   Minutes to hours                         Never-ending
                                         minutes

Data volume        TBs to PBs         GBs to PBs       Continuous stream

Programming
                  MapReduce             Queries              DAG
model

                                      Analysts and
Users              Developers                             Developers
                                       developers
Google
                  MapReduce              Dremel
project
Open source        Hadoop
                                      Apache Drill       Storm and S4
project           MapReduce
Latency Matters

■ Ad-hoc analysis with interactive tools


■ Real-time dashboards




■ Event/trend detection and analysis
  ●   Network intrusions
  ●   Fraud
  ●   Failures
Nested Query Languages

■ DrQL
  ●   SQL-like query language for nested data

  ●   Compatible with Google BigQuery/Dremel
      ● BigQuery applications should work with Drill



  ●   Designed to support efficient column-based processing
      ● No record assembly during query processing




■ Mongo Query Language
  ●   {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}

■ Other languages/programming models can plug in
Nested Data Model
■ The data model in Dremel is Protocol Buffers
  ●   Nested
  ●   Schema
■ Apache Drill is designed to support multiple data models
  ●   Schema: Protocol Buffers, Apache Avro, …
  ●   Schema-less: JSON, BSON, …
■ Flat records are supported as a special case of nested data
  ●   CSV, TSV, …
               Avro IDL                               JSON
      enum Gender {                      {
        MALE, FEMALE                         "name": "Srivas",
      }                                      "gender": "Male",
                                             "followers": 100
      record User {                      }
        string name;                     {
        Gender gender;                       "name": "Raina",
        long followers;                      "gender": "Female",
      }                                      "followers": 200,
                                             "zip": "94305"
                                         }
Extensibility
■ Nested query languages
  ● Pluggable model

  ● DrQL

  ● Mongo Query Language

  ● Cascading



■ Distributed execution engine
  ● Extensible model (eg, Dryad)

  ● Low-latency

  ● Fault tolerant



■ Nested data formats
  ● Pluggable model

  ● Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO,

    Avro, JSON, CSV)
  ● Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)



■ Scalable data sources
  ● Pluggable model

  ● Hadoop

  ● HBase
Design Principles

  Flexible                            Easy
  ●   Pluggable query languages       ●   Unzip and run
  ●   Extensible execution engine     ●   Zero configuration
  ●   Pluggable data formats          ●   Reverse DNS not needed
      ● Column-based and row-based    ●   IP addresses can change
      ● Schema and schema-less        ●   Clear and concise log messages
  ●   Pluggable data sources
  ●   N(ot)O(nly) Hadoop


  Dependable                          Fast
  ●   No SPOF                         ●   Minimum Java core
  ●   Instant recovery from crashes   ●   C/C++ core with Java support
                                          ● Google C++ style guide
                                      ●   Min latency and max throughput
                                          (limited only by hardware)
Architecture
Execution Engine
Operator layer is serialization-aware
   Processes individual records

Execution layer is not serialization-aware
   Processes batches of records (blobs/JSON trees)
   Responsible for communication, dependencies and fault tolerance
DrQL Example
local-logs = donuts.json:
                                                     SELECT
{                                                     ppu,
     "id": "0003",                                    typeCount =
     "type": "donut",
                                                        COUNT(*) OVER PARTITION BY ppu,
     "name": "Old Fashioned",
                                                      quantity =
     "ppu": 0.55,
     "sales": 300,                                      SUM(sales) OVER PARTITION BY ppu,
     "batters":                                        sales =
       {                                                 SUM(ppu*sales) OVER PARTITION BY
         "batter":                                    ppu
           [                                         FROM local-logs donuts
             { "id": "1001", "type": "Regular" },
             { "id": "1002", "type": "Chocolate" }   WHERE donuts.ppu < 1.00
           ]                                         ORDER BY dountuts.ppu DESC;
       },
     "topping":
       [
         { "id": "5001", "type": "None" },
         { "id": "5002", "type": "Glazed" },
         { "id": "5003", "type": "Chocolate" },
         { "id": "5004", "type": "Maple" }
       ]
 }
Query Components

■ User Query (DrQL) components:
  ● SELECT

  ● FROM

  ● WHERE

  ● GROUP BY

  ● HAVING

  ● (JOIN)




■ Logical operators:
  ● Scan

  ● Filter

  ● Aggregate

  ● (Join)
Logical Plan
Logical Plan Syntax:
Operators & Expressions
        query:[
         {
           op:"sequence",
           do:[
           {
             op: "scan",
             memo: "initial_scan",
             ref: "donuts",
             source: "local-logs",
             selection: {data: "activity"}
           },
           {
             op: "transform",
             transforms: [
               { ref: "donuts.quanity", expr: "donuts.sales"}
             ]
           },
           {
             op: "filter",
             expr: "donuts.ppu < 1.00"
           },
           ---
Logical Streaming Example

                     0
                     1
                     2
                     3
                     4

{ @id: <refnum>, op: “window-frame”,
 input: <input>,
 keys: [                               0
   <name>,...                          01
 ],                                    012
 ref: <name>,                          123
 before: 2,                            234
 after: here
}
Representing a DAG




          { @id: 19, op: "aggregate",
            input: 18,
            type: <simple|running|repeat>,
            keys: [<name>,...],
            aggregations: [
              {ref: <name>, expr: <aggexpr> },...
            ]
          }
Multiple Inputs




                  { @id: 25, op: "cogroup",
                    groupings: [
                      {ref: 23, expr: “id”}, {ref: 24, expr: “id”}
                    ]
                  }
Physical Scan Operators


               Scan with schema                Scan without schema
Operator       Protocol Buffers                JSON-like (MessagePack)
output
Supported      ColumnIO (column-based          JSON
data formats   protobuf/Dremel)                HBase
               RecordIO (row-based protobuf)
               CSV
SELECT …       ColumnIO(proto URI, data URI)   Json(data URI)
FROM …         RecordIO(proto URI, data URI)   HBase(table name)
Hadoop Integration

■   Hadoop data sources
    ●   Hadoop FileSystem API (HDFS/MapR-FS)
    ●   HBase

■   Hadoop data formats
    ●   Apache Avro
    ●   RCFile

■   MapReduce-based tools to create column-based formats

■   Table registry in HCatalog

■   Run long-running services in YARN
Where is Drill now?

■ API Definition


■ Reference Implementation for Logical Plan Interpreter
 ● 1:1 mapping logical/physical op
 ● Single JVM


■ Demo
Contribute!

■ Participate in Design discussions: JIRA, ML, Wiki, Google Doc!


■ Write a parser for your favorite QL / Domain-Specific Language


■ Write Storage Engine API implementations
 ● HDFS, Hbase, relational, XML DB.


■ Write Physical Operators
 ● scan-hbase, scan-cassandra, scan-mongo
 ● scan-jdbc, scan-odbc, scan-jms (browse topic/queue), scan-*
 ● combined functionality operators: group-aggregate, ...
 ● sort-merge-join, hash-join, index-lookup-join

■ Etc...
Thanks, Q&A

■ Download these slides
  ●   http://www.mapr.com/company/events/pjug-1-15-2013

■ Join the project
  ●   drill-dev-subscribe@incubator.apache.org
  ●   #apachedrill

■ Contact me:
  ●   gshegalov@maprtech.com

■ Join MapR
  ●   jobs@mapr.com

Mais conteúdo relacionado

Mais procurados

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillMapR Technologies
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache DrillCharles Givre
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Vince Gonzalez
 

Mais procurados (20)

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Apache drill
Apache drillApache drill
Apache drill
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache Drill
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0
 

Destaque

Why Being a Creeper is Awesome
Why Being a Creeper is AwesomeWhy Being a Creeper is Awesome
Why Being a Creeper is Awesomerelak213
 
Materi 2 teori teori belajar
Materi 2 teori teori belajarMateri 2 teori teori belajar
Materi 2 teori teori belajarNhia Item
 
cara membuat fotfolio sains tahun 6
cara membuat fotfolio sains tahun 6cara membuat fotfolio sains tahun 6
cara membuat fotfolio sains tahun 6Muadzam Peace
 
Thermo part 2
Thermo part 2Thermo part 2
Thermo part 2elly_q3a
 
Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...
Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...
Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...Gera Shegalov
 
The Role of Database Systems in the Era of Big Data
The Role  of Database Systems  in the Era of Big DataThe Role  of Database Systems  in the Era of Big Data
The Role of Database Systems in the Era of Big DataGera Shegalov
 
CTL Model Checking in Database Cloud
CTL Model Checking in Database CloudCTL Model Checking in Database Cloud
CTL Model Checking in Database CloudGera Shegalov
 
Materi 1 hakekat psikologi
Materi 1 hakekat psikologiMateri 1 hakekat psikologi
Materi 1 hakekat psikologiNhia Item
 
Hadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant Scale Hadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant Scale Gera Shegalov
 
Responsive Web Design – Best Practice Approach
Responsive Web Design – Best Practice ApproachResponsive Web Design – Best Practice Approach
Responsive Web Design – Best Practice Approachlet's dev GmbH & Co. KG
 

Destaque (17)

Why Being a Creeper is Awesome
Why Being a Creeper is AwesomeWhy Being a Creeper is Awesome
Why Being a Creeper is Awesome
 
Materi 2 teori teori belajar
Materi 2 teori teori belajarMateri 2 teori teori belajar
Materi 2 teori teori belajar
 
Ppr1
Ppr1Ppr1
Ppr1
 
Fr
FrFr
Fr
 
cara membuat fotfolio sains tahun 6
cara membuat fotfolio sains tahun 6cara membuat fotfolio sains tahun 6
cara membuat fotfolio sains tahun 6
 
Presentación2
Presentación2Presentación2
Presentación2
 
Thermo part 2
Thermo part 2Thermo part 2
Thermo part 2
 
Regolamento tarsu
Regolamento tarsuRegolamento tarsu
Regolamento tarsu
 
Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...
Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...
Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...
 
The Role of Database Systems in the Era of Big Data
The Role  of Database Systems  in the Era of Big DataThe Role  of Database Systems  in the Era of Big Data
The Role of Database Systems in the Era of Big Data
 
CTL Model Checking in Database Cloud
CTL Model Checking in Database CloudCTL Model Checking in Database Cloud
CTL Model Checking in Database Cloud
 
Usl6
Usl6Usl6
Usl6
 
Place
PlacePlace
Place
 
Materi 1 hakekat psikologi
Materi 1 hakekat psikologiMateri 1 hakekat psikologi
Materi 1 hakekat psikologi
 
Hadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant Scale Hadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant Scale
 
Biynees khemjee awah
Biynees khemjee awahBiynees khemjee awah
Biynees khemjee awah
 
Responsive Web Design – Best Practice Approach
Responsive Web Design – Best Practice ApproachResponsive Web Design – Best Practice Approach
Responsive Web Design – Best Practice Approach
 

Semelhante a Apache Drill @ PJUG, Jan 15, 2013

Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19jasonfrantz
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Yahoo Developer Network
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Ted Dunning
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaJose Mº Muñoz
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for CassandraEdward Capriolo
 
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"DataStax Academy
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
 
Rust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMSRust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMSAndy Grove
 
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...LOADays 2015 - syslog-ng - from log collection to processing and infomation e...
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...BalaBit
 
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!Daniel Cousineau
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseMongoDB
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)javier ramirez
 
MongoDB - A Document NoSQL Database
MongoDB - A Document NoSQL DatabaseMongoDB - A Document NoSQL Database
MongoDB - A Document NoSQL DatabaseRuben Inoto Soto
 

Semelhante a Apache Drill @ PJUG, Jan 15, 2013 (20)

MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Rust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMSRust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMS
 
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...LOADays 2015 - syslog-ng - from log collection to processing and infomation e...
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...
 
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
MongoDB - A Document NoSQL Database
MongoDB - A Document NoSQL DatabaseMongoDB - A Document NoSQL Database
MongoDB - A Document NoSQL Database
 

Mais de Gera Shegalov

#SlimScalding - Less Memory is More Capacity
#SlimScalding - Less Memory is More Capacity#SlimScalding - Less Memory is More Capacity
#SlimScalding - Less Memory is More CapacityGera Shegalov
 
Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...
Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...
Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...Gera Shegalov
 
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
Logging Last Resource Optimization for Distributed Transactions in  Oracle We...Logging Last Resource Optimization for Distributed Transactions in  Oracle We...
Logging Last Resource Optimization for Distributed Transactions in Oracle We...Gera Shegalov
 
Logging Last Resource Optimization for Distributed Transactions in Oracle…
Logging Last Resource Optimization for Distributed Transactions in  Oracle…Logging Last Resource Optimization for Distributed Transactions in  Oracle…
Logging Last Resource Optimization for Distributed Transactions in Oracle…Gera Shegalov
 
Transaction Timestamping in Temporal Databases
Transaction Timestamping in Temporal DatabasesTransaction Timestamping in Temporal Databases
Transaction Timestamping in Temporal DatabasesGera Shegalov
 
Unstoppable Stateful PHP Web Services
Unstoppable Stateful PHP Web ServicesUnstoppable Stateful PHP Web Services
Unstoppable Stateful PHP Web ServicesGera Shegalov
 
Formal Verification of Transactional Interaction Contract
Formal Verification of Transactional Interaction ContractFormal Verification of Transactional Interaction Contract
Formal Verification of Transactional Interaction ContractGera Shegalov
 
Formal Verification of Web Service Interaction Contracts
Formal Verification of Web Service Interaction ContractsFormal Verification of Web Service Interaction Contracts
Formal Verification of Web Service Interaction ContractsGera Shegalov
 

Mais de Gera Shegalov (8)

#SlimScalding - Less Memory is More Capacity
#SlimScalding - Less Memory is More Capacity#SlimScalding - Less Memory is More Capacity
#SlimScalding - Less Memory is More Capacity
 
Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...
Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...
Integrated Data, Message, and Process Recovery for Failure Masking in Web Ser...
 
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
Logging Last Resource Optimization for Distributed Transactions in  Oracle We...Logging Last Resource Optimization for Distributed Transactions in  Oracle We...
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
 
Logging Last Resource Optimization for Distributed Transactions in Oracle…
Logging Last Resource Optimization for Distributed Transactions in  Oracle…Logging Last Resource Optimization for Distributed Transactions in  Oracle…
Logging Last Resource Optimization for Distributed Transactions in Oracle…
 
Transaction Timestamping in Temporal Databases
Transaction Timestamping in Temporal DatabasesTransaction Timestamping in Temporal Databases
Transaction Timestamping in Temporal Databases
 
Unstoppable Stateful PHP Web Services
Unstoppable Stateful PHP Web ServicesUnstoppable Stateful PHP Web Services
Unstoppable Stateful PHP Web Services
 
Formal Verification of Transactional Interaction Contract
Formal Verification of Transactional Interaction ContractFormal Verification of Transactional Interaction Contract
Formal Verification of Transactional Interaction Contract
 
Formal Verification of Web Service Interaction Contracts
Formal Verification of Web Service Interaction ContractsFormal Verification of Web Service Interaction Contracts
Formal Verification of Web Service Interaction Contracts
 

Último

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Último (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Apache Drill @ PJUG, Jan 15, 2013

  • 1. Gera Shegalov @PJUG, Jan 15, 2013
  • 2. /home/gera: whoami ■ Saarland University ■ 1st intern in Immortal DB @ Microsoft Research ■ JMS, RDBMS HA @ Oracle ■ Hadoop MapReduce / Hadoop Core ■ Founding member of Apache Drill
  • 3. ■ Open enterprise-grade distribution for Hadoop ● Easy, dependable and fast ● Open source with standards-based extensions ■ MapR is deployed at 1000’s of companies ● From small Internet startups to Fortune 100 ■ MapR customers analyze massive amounts of data: ● Hundreds of billions of events daily ● 90% of the world’s Internet population monthly ● $1 trillion in retail purchases annually ■ MapR in the Cloud: ● partnered with Google: Hadoop on Google Compute Engine ● partnered with Amazon: M3/M5 options for Elastic Map Reduce
  • 4. Agenda ■ What? ● What exactly does Drill do? ■ Why? ● Why do we need Apache Drill? ■ Who? ● Who is doing this? ■ How? ● How does Drill work inside? ■ Conclusion ● How can you help? ● Where can you find out more?
  • 5. Apache Drill Overview ■ Drill overview ● Low latency interactive queries ● Standard ANSI SQL support ● Domain Specific Languages / Your own QL ■ Open-Source ● Apache Incubator ● 100’s involved across US and Europe ● Community consensus on API, functionality
  • 6. Big Data Processing Batch Interactive Stream processing analysis processing Milliseconds to Query runtime Minutes to hours Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Analysts and Users Developers Developers developers Google MapReduce Dremel project Open source Hadoop Apache Drill Storm and S4 project MapReduce
  • 7. Latency Matters ■ Ad-hoc analysis with interactive tools ■ Real-time dashboards ■ Event/trend detection and analysis ● Network intrusions ● Fraud ● Failures
  • 8. Nested Query Languages ■ DrQL ● SQL-like query language for nested data ● Compatible with Google BigQuery/Dremel ● BigQuery applications should work with Drill ● Designed to support efficient column-based processing ● No record assembly during query processing ■ Mongo Query Language ● {$query: {x: 3, y: "abc"}, $orderby: {x: 1}} ■ Other languages/programming models can plug in
  • 9. Nested Data Model ■ The data model in Dremel is Protocol Buffers ● Nested ● Schema ■ Apache Drill is designed to support multiple data models ● Schema: Protocol Buffers, Apache Avro, … ● Schema-less: JSON, BSON, … ■ Flat records are supported as a special case of nested data ● CSV, TSV, … Avro IDL JSON enum Gender { { MALE, FEMALE "name": "Srivas", } "gender": "Male", "followers": 100 record User { } string name; { Gender gender; "name": "Raina", long followers; "gender": "Female", } "followers": 200, "zip": "94305" }
  • 10. Extensibility ■ Nested query languages ● Pluggable model ● DrQL ● Mongo Query Language ● Cascading ■ Distributed execution engine ● Extensible model (eg, Dryad) ● Low-latency ● Fault tolerant ■ Nested data formats ● Pluggable model ● Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV) ● Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON) ■ Scalable data sources ● Pluggable model ● Hadoop ● HBase
  • 11. Design Principles Flexible Easy ● Pluggable query languages ● Unzip and run ● Extensible execution engine ● Zero configuration ● Pluggable data formats ● Reverse DNS not needed ● Column-based and row-based ● IP addresses can change ● Schema and schema-less ● Clear and concise log messages ● Pluggable data sources ● N(ot)O(nly) Hadoop Dependable Fast ● No SPOF ● Minimum Java core ● Instant recovery from crashes ● C/C++ core with Java support ● Google C++ style guide ● Min latency and max throughput (limited only by hardware)
  • 13. Execution Engine Operator layer is serialization-aware Processes individual records Execution layer is not serialization-aware Processes batches of records (blobs/JSON trees) Responsible for communication, dependencies and fault tolerance
  • 14. DrQL Example local-logs = donuts.json: SELECT { ppu, "id": "0003", typeCount = "type": "donut", COUNT(*) OVER PARTITION BY ppu, "name": "Old Fashioned", quantity = "ppu": 0.55, "sales": 300, SUM(sales) OVER PARTITION BY ppu, "batters": sales = { SUM(ppu*sales) OVER PARTITION BY "batter": ppu [ FROM local-logs donuts { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" } WHERE donuts.ppu < 1.00 ] ORDER BY dountuts.ppu DESC; }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5003", "type": "Chocolate" }, { "id": "5004", "type": "Maple" } ] }
  • 15. Query Components ■ User Query (DrQL) components: ● SELECT ● FROM ● WHERE ● GROUP BY ● HAVING ● (JOIN) ■ Logical operators: ● Scan ● Filter ● Aggregate ● (Join)
  • 17. Logical Plan Syntax: Operators & Expressions query:[ { op:"sequence", do:[ { op: "scan", memo: "initial_scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "transform", transforms: [ { ref: "donuts.quanity", expr: "donuts.sales"} ] }, { op: "filter", expr: "donuts.ppu < 1.00" }, ---
  • 18. Logical Streaming Example 0 1 2 3 4 { @id: <refnum>, op: “window-frame”, input: <input>, keys: [ 0 <name>,... 01 ], 012 ref: <name>, 123 before: 2, 234 after: here }
  • 19. Representing a DAG { @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ] }
  • 20. Multiple Inputs { @id: 25, op: "cogroup", groupings: [ {ref: 23, expr: “id”}, {ref: 24, expr: “id”} ] }
  • 21. Physical Scan Operators Scan with schema Scan without schema Operator Protocol Buffers JSON-like (MessagePack) output Supported ColumnIO (column-based JSON data formats protobuf/Dremel) HBase RecordIO (row-based protobuf) CSV SELECT … ColumnIO(proto URI, data URI) Json(data URI) FROM … RecordIO(proto URI, data URI) HBase(table name)
  • 22. Hadoop Integration ■ Hadoop data sources ● Hadoop FileSystem API (HDFS/MapR-FS) ● HBase ■ Hadoop data formats ● Apache Avro ● RCFile ■ MapReduce-based tools to create column-based formats ■ Table registry in HCatalog ■ Run long-running services in YARN
  • 23. Where is Drill now? ■ API Definition ■ Reference Implementation for Logical Plan Interpreter ● 1:1 mapping logical/physical op ● Single JVM ■ Demo
  • 24. Contribute! ■ Participate in Design discussions: JIRA, ML, Wiki, Google Doc! ■ Write a parser for your favorite QL / Domain-Specific Language ■ Write Storage Engine API implementations ● HDFS, Hbase, relational, XML DB. ■ Write Physical Operators ● scan-hbase, scan-cassandra, scan-mongo ● scan-jdbc, scan-odbc, scan-jms (browse topic/queue), scan-* ● combined functionality operators: group-aggregate, ... ● sort-merge-join, hash-join, index-lookup-join ■ Etc...
  • 25. Thanks, Q&A ■ Download these slides ● http://www.mapr.com/company/events/pjug-1-15-2013 ■ Join the project ● drill-dev-subscribe@incubator.apache.org ● #apachedrill ■ Contact me: ● gshegalov@maprtech.com ■ Join MapR ● jobs@mapr.com