SlideShare uma empresa Scribd logo
1 de 21
Introduc)on	
  to	
  Apache	
  Drill	
  

          February	
  27,	
  2013	
  
            Tomer	
  Shiran	
  
Who	
  Am	
  I?	
  
•      Tomer	
  Shiran	
  
•      tshiran@maprtech.com	
  
•      Founding	
  Member	
  and	
  CommiGer,	
  Apache	
  Drill	
  
•      Director	
  of	
  Product	
  Management,	
  MapR	
  
       Technologies	
  
	
  
Agenda	
  

•    Apache	
  Drill	
  overview	
  
•    Key	
  features	
  
•    Status	
  and	
  progress	
  
•    How	
  to	
  get	
  involved	
  
Big	
  Data	
  Workloads	
  
•    ETL	
  
•    Data	
  mining	
  
•    Index	
  and	
  model	
  genera)on	
  
•    Clustering,	
  anomaly	
  detec)on	
  and	
  classifica)on	
  
•    Blob	
  store	
  
•    Lightweight	
  OLTP	
  on	
  large	
  datasets	
  
•    Web	
  crawling	
  
•    Stream	
  processing	
  
•    Interac)ve	
  analysis	
  
Interac)ve	
  Queries	
  and	
  Hadoop	
  
                                                        Exis)ng	
  Solu)ons	
  


                                Compile	
  SQL	
  (HiveQL)	
  to	
                Export	
  MapReduce	
  results	
  to	
  
                                     MapReduce	
                                   RDBMS	
  and	
  query	
  RDBMS	
  


                                               Emerging	
  Technologies	
  
                                                                                                        Stinger/Tez                    Impala
     Real-­‐)me	
  	
         PostgreSQL-­‐based	
                    PostgreSQL-­‐based	
           Hive	
  performance	
             Real-­‐)me	
  	
  
interac)ve	
  analysis	
      Hadoop	
  analy)cs	
                    Hadoop	
  analy)cs	
            improvements	
              interac)ve	
  analysis	
  


                             HAWQ                             Phoenix	
                                                        Cascading Lingual
       PostgreSQL-­‐based	
                                                                                                     Compile	
  ANSI	
  SQL	
  to	
  
       Hadoop	
  analy)cs	
                   SQL	
  layer	
  for	
  HBase	
           SQL	
  layer	
  for	
  HBase	
              MapReduce	
  
Example	
  Problem	
  
       •  Jane	
  works	
  as	
  an	
          Transac.on	
  
          analyst	
  at	
  an	
  e-­‐          informa.on	
  
          commerce	
  company	
  
       •  How	
  does	
  she	
  figure	
  
                                                 User	
  	
  
          out	
  good	
  targe)ng	
             profiles	
  
          segments	
  for	
  the	
  next	
  
          marke)ng	
  campaign?	
  
       •  She	
  has	
  some	
  ideas	
           Access	
  
          and	
  lots	
  of	
  data	
              logs	
  
Solving	
  the	
  Problem	
  with	
  Tradi)onal	
  Systems	
  

•  Use	
  an	
  RDBMS	
  
     –  ETL	
  the	
  data	
  from	
  MongoDB	
  and	
  Hadoop	
  into	
  the	
  RDBMS	
  
            •  MongoDB	
  data	
  must	
  be	
  flaGened,	
  schema)zed,	
  filtered	
  and	
  aggregated	
  
            •  Hadoop	
  data	
  must	
  be	
  filtered	
  and	
  aggregated	
  
     –  Query	
  the	
  data	
  using	
  any	
  SQL-­‐based	
  tool	
  
•  Use	
  MapReduce	
  
     –  ETL	
  the	
  data	
  from	
  Oracle	
  and	
  MongoDB	
  into	
  Hadoop	
  
     –  Work	
  with	
  the	
  MapReduce	
  team	
  to	
  generate	
  the	
  desired	
  analyses	
  
•  Use	
  Hive	
  
     –  ETL	
  the	
  data	
  from	
  Oracle	
  and	
  MongoDB	
  into	
  Hadoop	
  
            •  MongoDB	
  data	
  must	
  be	
  flaGened	
  and	
  schema)zed	
  
     –  But	
  HiveQL	
  is	
  limited,	
  queries	
  take	
  too	
  long	
  and	
  BI	
  tool	
  support	
  is	
  
        limited	
  
WWGD	
  

                 Distributed	
                       Interac.ve	
       Batch	
  
                                       NoSQL	
  
                 File	
  System	
                      analysis	
     processing	
  


                      GFS	
           BigTable	
       Dremel	
       MapReduce	
  


                                                                       Hadoop	
  
                     HDFS	
            HBase	
           ???	
  
                                                                      MapReduce	
  




  Build	
  Apache	
  Drill	
  to	
  provide	
  a	
  true	
  open	
  source	
  
     solu)on	
  to	
  interac)ve	
  analysis	
  of	
  Big	
  Data	
  
Apache	
  Drill	
  Overview	
  
•  Interac)ve	
  analysis	
  of	
  Big	
  Data	
  using	
  standard	
  SQL	
  
•  Fast	
  
      –  Low	
  latency	
  
      –  Columnar	
  execu)on	
  
             •  Inspired	
  by	
  Google	
  Dremel/BigQuery	
                                      Interac)ve	
  queries	
  
      –  Complement	
  na)ve	
  interfaces	
  and	
  MapReduce/              Apache	
  Drill	
     Data	
  analyst	
  
         Hive/Pig	
                                                                                Repor)ng	
  
                                                                                                   100	
  ms-­‐20	
  min	
  
•  Open	
  
      –  Community	
  driven	
  open	
  source	
  project	
  
      –  Under	
  Apache	
  Socware	
  Founda)on	
  
•  Modern	
  
      –    Standard	
  ANSI	
  SQL:2003	
  (select/into)	
                                         Data	
  mining	
  
      –    Nested/hierarchical	
  data	
  support	
                          MapReduce	
           Modeling	
  
                                                                                  Hive	
  
      –    Schema	
  is	
  op)onal	
                                               Pig	
  
                                                                                                   Large	
  ETL	
  
                                                                                                   20	
  min-­‐20	
  hr	
  
      –    Supports	
  RDBMS,	
  Hadoop	
  and	
  NoSQL	
  
How	
  Does	
  It	
  Work?	
  
•  Drillbits	
  run	
  on	
  each	
  node,	
  designed	
  to	
  
   maximize	
  data	
  locality	
  
•  Processing	
  is	
  done	
  outside	
  MapReduce	
  
                                                                   SELECT	
  *	
  FROM	
  
   paradigm	
  (YARN	
  is	
  supported)	
                         oracle.transac)ons,	
  
•  Queries	
  can	
  be	
  fed	
  to	
  any	
  Drillbit	
          mongo.users,	
  
                                                                   hdfs.events	
  
•  Coordina)on,	
  query	
  planning,	
  op)miza)on,	
             LIMIT	
  1	
  
   scheduling,	
  and	
  execu)on	
  are	
  distributed	
  
Key	
  Features	
  

•    Full	
  SQL	
  (ANSI	
  SQL:2003)	
  
•    Nested	
  data	
  
•    Schema	
  is	
  op)onal	
  
•    Flexible	
  and	
  extensible	
  architecture	
  
Full	
  SQL	
  (ANSI	
  SQL:2003)	
  
•  Drill	
  supports	
  standard	
  ANSI	
  SQL:2003	
  
     –  Correlated	
  subqueries,	
  analy)c	
  func)ons,	
  …	
  
     –  SQL-­‐like	
  is	
  not	
  enough	
  
•  Use	
  any	
  SQL-­‐based	
  tool	
  with	
  Apache	
  Drill	
  
     –  Tableau,	
  Microstrategy,	
  Excel,	
  SAP	
  Crystal	
  Reports,	
  Toad,	
  SQuirreL,	
  …	
  
     –  Standard	
  ODBC	
  and	
  JDBC	
  drivers	
  
                               Client

                 Tableau

                                                                           Drillbit
               MicroStrategy
                               Drill%ODBC%                    SQL%Query%               Query%
                                             Driver                                              Drillbits
                                 Driver                         Parser                Planner   Drill%Worker
                                                                                                 Drill%Worker
                   Excel


                SAP%Crystal%
                 Reports
Nested	
  Data	
  
                                                                                                         JSON	
  
•  Nested	
  data	
  is	
  becoming	
  prevalent	
                                        {	
  
                                                                                          	
  	
  "name":	
  "Homer",	
  
     –  JSON,	
  BSON,	
  XML,	
  Protocol	
  Buffers,	
  Avro,	
  …	
                     	
  	
  "gender":	
  "Male",	
  
                                                                                          	
  	
  "followers":	
  100	
  
     –  The	
  data	
  source	
  may	
  or	
  may	
  not	
  be	
  aware	
                 	
  	
  children:	
  [	
  
           •  MongoDB	
  supports	
  nested	
  data	
  na)vely	
                          	
  	
  	
  	
  {name:	
  "Bart"},	
  
                                                                                          	
  	
  	
  	
  {name:	
  "Lisa”}	
  
           •  A	
  single	
  HBase	
  value	
  could	
  be	
  a	
  JSON	
  document	
     	
  	
  ]	
  
              (compound	
  nested	
  type)	
                                              }	
  
     –  Google	
  Dremel’s	
  innova)on	
  was	
  efficient	
  columnar	
  
        storage	
  and	
  querying	
  of	
  nested	
  data	
  
•  FlaGening	
  nested	
  data	
  is	
  error-­‐prone	
  and	
  ocen	
                                    Avro	
  
                                                                                          enum	
  Gender	
  {	
  
   impossible	
                                                                           	
  	
  MALE,	
  FEMALE	
  
     –  Think	
  about	
  repeated	
  and	
  op)onal	
  fields	
  at	
  every	
            }	
  
                                                                                          	
  
        level…	
                                                                          record	
  User	
  {	
  

•  Apache	
  Drill	
  supports	
  nested	
  data	
  
                                                                                          	
  	
  string	
  name;	
  
                                                                                          	
  	
  Gender	
  gender;	
  
     –  Extensions	
  to	
  ANSI	
  SQL:2003	
                                            	
  	
  long	
  followers;	
  
                                                                                          }	
  
Schema	
  is	
  Op)onal	
  
•  Many	
  data	
  sources	
  do	
  not	
  have	
  rigid	
  schemas	
  
         –  Schemas	
  change	
  rapidly	
  
         –  Each	
  record	
  may	
  have	
  a	
  different	
  schema	
  
                 •  Sparse	
  and	
  wide	
  rows	
  in	
  HBase	
  and	
  Cassandra,	
  MongoDB	
  
•  Apache	
  Drill	
  supports	
  querying	
  against	
  unknown	
  schemas	
  
         –  Query	
  any	
  HBase,	
  Cassandra	
  or	
  MongoDB	
  table	
  
•  User	
  can	
  define	
  the	
  schema	
  or	
  let	
  the	
  system	
  discover	
  it	
  automa)cally	
  
         –  System	
  of	
  record	
  may	
  already	
  have	
  schema	
  informa)on	
  
                 •  Why	
  manage	
  it	
  in	
  a	
  separate	
  system?	
  
         –  No	
  need	
  to	
  manage	
  schema	
  evolu)on	
  

Row	
  Key	
                       CF	
  contents	
                             CF	
  anchor	
  
"com.cnn.www"	
                    contents:html	
  =	
  "<html>…"	
            anchor:my.look.ca	
  =	
  "CNN.com"	
  
                                                                                anchor:cnnsi.com	
  =	
  "CNN"	
  
"com.foxnews.www"	
                contents:html	
  =	
  "<html>…"	
            anchor:en.wikipedia.org	
  =	
  "Fox	
  News"	
  
                                                                                	
  
…	
                                …	
                                          …	
  
Flexible	
  and	
  Extensible	
  Architecture	
  
•  Apache	
  Drill	
  is	
  designed	
  for	
  extensibility	
  
      –  Well-­‐documented	
  APIs	
  and	
  interfaces	
  
•  Data	
  sources	
  and	
  file	
  formats	
  
      –  Implement	
  a	
  custom	
  scanner	
  to	
  support	
  a	
  new	
  data	
  source	
  or	
  file	
  format	
  
•  Query	
  languages	
  
      –  SQL:2003	
  is	
  the	
  primary	
  language	
  
      –  Implement	
  a	
  custom	
  Parser	
  to	
  support	
  a	
  Domain	
  Specific	
  Language	
  
      –  UDFs	
  and	
  UDTFs	
  
•  Op)mizers	
  
      –  Drill	
  will	
  have	
  a	
  cost-­‐based	
  op)mizer	
  
      –  Clear	
  surrounding	
  APIs	
  support	
  easy	
  op)mizer	
  explora)on	
  
•  Operators	
  
      –  Custom	
  operators	
  can	
  be	
  implemented	
  
             •  Special	
  operators	
  for	
  Mahout	
  (k-­‐means)	
  being	
  designed	
  
      –  Operator	
  push-­‐down	
  to	
  data	
  source	
  (RDBMS)	
  
What	
  About	
  Other	
  SQL-­‐on-­‐Hadoop	
  
Systems?	
  
•  Strengths	
  
    –  Code	
  is	
  more	
  mature	
  than	
  Apache	
  Drill	
  
           •  Already	
  in	
  beta	
  (~2	
  quarters	
  ahead)	
  
    –  Faster	
  than	
  Hive	
  on	
  some	
  queries	
  

•  Weaknesses	
  
    –    Proprietary	
  or	
  semi-­‐open	
  source	
  
    –    Query	
  results	
  must	
  fit	
  in	
  memory	
  (no	
  spooling)	
  
    –    Early	
  row	
  materializa)on	
  (no	
  columnar	
  execu)on)	
  
    –    Some	
  are	
  SQL-­‐like	
  (not	
  SQL)	
  
    –    No	
  support	
  for	
  nested	
  data	
  
    –    Rigid	
  schema	
  is	
  required	
  
    –    Limited	
  flexibility	
  and	
  extensibility	
  
    –    Only	
  support	
  Hadoop	
  and	
  HBase	
  (and	
  no	
  other	
  NoSQL	
  or	
  RDBMS)	
  
Status:	
  In	
  Progress	
  
•    Heavy	
  ac)ve	
  development	
  
      –  6-­‐7	
  companies	
  are	
  contribu)ng	
  
•    Available	
  
      –  Logical	
  plan	
  syntax	
  and	
  interpreter	
  
      –  Reference	
  execu)on	
  engine	
  
•    In	
  progress	
  
      –  SQL	
  interpreter	
  
      –  Storage	
  engine	
  implementa)ons	
  for	
  Accumulo,	
  Cassandra,	
  HBase	
  and	
  various	
  file	
  formats	
  
•    Significant	
  community	
  momentum	
  
      –    Over	
  250	
  people	
  on	
  the	
  Drill	
  mailing	
  list	
  
      –    Over	
  250	
  members	
  of	
  the	
  Bay	
  Area	
  Drill	
  User	
  Group	
  
      –    Drill	
  meetups	
  across	
  the	
  US	
  and	
  Europe	
  
      –    OpenDremel	
  team	
  joined	
  Apache	
  Drill	
  
•    An)cipated	
  schedule:	
  
      –  Prototype:	
  Q1	
  
      –  Alpha:	
  Q2	
  
      –  Beta:	
  Q3	
  
Why	
  Apache	
  Drill	
  Will	
  Be	
  Successful	
  
Resources	
                             Community	
                               Architecture	
  
•  Contributors	
  have	
  strong	
     •  Development	
  done	
  in	
  the	
     •  Full	
  SQL	
  
   backgrounds	
  from	
                   open	
                                 •  New	
  data	
  support	
  
   companies	
  like	
  Oracle,	
       •  Ac)ve	
  contributors	
  from	
        •  Extensible	
  APIs	
  
   IBM	
  Netezza,	
  Informa)ca,	
        mul)ple	
  companies	
                 •  Full	
  Columnar	
  Execu)on	
  
   Clustrix	
  and	
  Pentaho	
         •  Rapidly	
  growing	
                   •  Beyond	
  Hadoop	
  
Interested	
  in	
  Apache	
  Drill?	
  

•  Many	
  op)ons	
  to	
  contribute	
  
   –  Become	
  a	
  full	
  )me	
  Drill	
  engineer	
  @	
  MapR	
  
        •  Email	
  tshiran@maprtech.com	
  	
  
   –  Join	
  the	
  Drill	
  mailing	
  list	
  and	
  start	
  contribu)ng	
  
        •  JIRAs,	
  code,	
  unit	
  tests,	
  documenta)on,	
  …	
  
   –  Shoot	
  me	
  an	
  email	
  and	
  we	
  can	
  discuss	
  
        •  Email	
  tshiran@maprtech.com	
  
QUESTIONS?	
  
Why	
  Not	
  Leverage	
  MapReduce?	
  
•  Scheduling	
  Model	
  
    –  Coarse	
  resource	
  model	
  reduces	
  hardware	
  u)liza)on	
  
    –  Acquisi)on	
  of	
  resources	
  typically	
  takes	
  100’s	
  of	
  millis	
  to	
  seconds	
  
•  Barriers	
  
    –  Map	
  comple)on	
  required	
  before	
  shuffle/reduce	
  
       commencement	
  
    –  All	
  maps	
  must	
  complete	
  before	
  reduce	
  can	
  start	
  
    –  In	
  chained	
  jobs,	
  one	
  job	
  must	
  finish	
  en)rely	
  before	
  the	
  next	
  one	
  
       can	
  start	
  
•  Persistence	
  and	
  Recoverability	
  
    –  Data	
  is	
  persisted	
  to	
  disk	
  between	
  each	
  barrier	
  
    –  Serializa)on	
  and	
  deserializa)on	
  are	
  required	
  between	
  execu)on	
  
       phase	
  

Mais conteúdo relacionado

Mais procurados

Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Gera Shegalov
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Vince Gonzalez
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Scott Leberknight
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 

Mais procurados (20)

Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Apache drill
Apache drillApache drill
Apache drill
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 

Destaque

JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)plarsen67
 
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?inovex GmbH
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill WorkshopCharles Givre
 
Data virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss TeiidData virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss TeiidAnil Allewar
 
4×4: Big Data in der Cloud
4×4: Big Data in der Cloud4×4: Big Data in der Cloud
4×4: Big Data in der CloudDanny Linden
 
Red Hat JBOSS Data Virtualization
Red Hat JBOSS Data VirtualizationRed Hat JBOSS Data Virtualization
Red Hat JBOSS Data VirtualizationDLT Solutions
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
 
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoWhy Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoJusto Hidalgo
 

Destaque (10)

JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)
 
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
 
Data virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss TeiidData virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss Teiid
 
4×4: Big Data in der Cloud
4×4: Big Data in der Cloud4×4: Big Data in der Cloud
4×4: Big Data in der Cloud
 
Red Hat JBOSS Data Virtualization
Red Hat JBOSS Data VirtualizationRed Hat JBOSS Data Virtualization
Red Hat JBOSS Data Virtualization
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoWhy Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by Denodo
 

Semelhante a An introduction to apache drill presentation

Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLTugdual Grall
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchMapR Technologies
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseYahoo Developer Network
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 

Semelhante a An introduction to apache drill presentation (20)

Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Apache drill
Apache drillApache drill
Apache drill
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 

Mais de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Mais de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Último

ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 

Último (20)

ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 

An introduction to apache drill presentation

  • 1. Introduc)on  to  Apache  Drill   February  27,  2013   Tomer  Shiran  
  • 2. Who  Am  I?   •  Tomer  Shiran   •  tshiran@maprtech.com   •  Founding  Member  and  CommiGer,  Apache  Drill   •  Director  of  Product  Management,  MapR   Technologies    
  • 3. Agenda   •  Apache  Drill  overview   •  Key  features   •  Status  and  progress   •  How  to  get  involved  
  • 4. Big  Data  Workloads   •  ETL   •  Data  mining   •  Index  and  model  genera)on   •  Clustering,  anomaly  detec)on  and  classifica)on   •  Blob  store   •  Lightweight  OLTP  on  large  datasets   •  Web  crawling   •  Stream  processing   •  Interac)ve  analysis  
  • 5. Interac)ve  Queries  and  Hadoop   Exis)ng  Solu)ons   Compile  SQL  (HiveQL)  to   Export  MapReduce  results  to   MapReduce   RDBMS  and  query  RDBMS   Emerging  Technologies   Stinger/Tez Impala Real-­‐)me     PostgreSQL-­‐based   PostgreSQL-­‐based   Hive  performance   Real-­‐)me     interac)ve  analysis   Hadoop  analy)cs   Hadoop  analy)cs   improvements   interac)ve  analysis   HAWQ Phoenix   Cascading Lingual PostgreSQL-­‐based   Compile  ANSI  SQL  to   Hadoop  analy)cs   SQL  layer  for  HBase   SQL  layer  for  HBase   MapReduce  
  • 6. Example  Problem   •  Jane  works  as  an   Transac.on   analyst  at  an  e-­‐ informa.on   commerce  company   •  How  does  she  figure   User     out  good  targe)ng   profiles   segments  for  the  next   marke)ng  campaign?   •  She  has  some  ideas   Access   and  lots  of  data   logs  
  • 7. Solving  the  Problem  with  Tradi)onal  Systems   •  Use  an  RDBMS   –  ETL  the  data  from  MongoDB  and  Hadoop  into  the  RDBMS   •  MongoDB  data  must  be  flaGened,  schema)zed,  filtered  and  aggregated   •  Hadoop  data  must  be  filtered  and  aggregated   –  Query  the  data  using  any  SQL-­‐based  tool   •  Use  MapReduce   –  ETL  the  data  from  Oracle  and  MongoDB  into  Hadoop   –  Work  with  the  MapReduce  team  to  generate  the  desired  analyses   •  Use  Hive   –  ETL  the  data  from  Oracle  and  MongoDB  into  Hadoop   •  MongoDB  data  must  be  flaGened  and  schema)zed   –  But  HiveQL  is  limited,  queries  take  too  long  and  BI  tool  support  is   limited  
  • 8. WWGD   Distributed   Interac.ve   Batch   NoSQL   File  System   analysis   processing   GFS   BigTable   Dremel   MapReduce   Hadoop   HDFS   HBase   ???   MapReduce   Build  Apache  Drill  to  provide  a  true  open  source   solu)on  to  interac)ve  analysis  of  Big  Data  
  • 9. Apache  Drill  Overview   •  Interac)ve  analysis  of  Big  Data  using  standard  SQL   •  Fast   –  Low  latency   –  Columnar  execu)on   •  Inspired  by  Google  Dremel/BigQuery   Interac)ve  queries   –  Complement  na)ve  interfaces  and  MapReduce/ Apache  Drill   Data  analyst   Hive/Pig   Repor)ng   100  ms-­‐20  min   •  Open   –  Community  driven  open  source  project   –  Under  Apache  Socware  Founda)on   •  Modern   –  Standard  ANSI  SQL:2003  (select/into)   Data  mining   –  Nested/hierarchical  data  support   MapReduce   Modeling   Hive   –  Schema  is  op)onal   Pig   Large  ETL   20  min-­‐20  hr   –  Supports  RDBMS,  Hadoop  and  NoSQL  
  • 10. How  Does  It  Work?   •  Drillbits  run  on  each  node,  designed  to   maximize  data  locality   •  Processing  is  done  outside  MapReduce   SELECT  *  FROM   paradigm  (YARN  is  supported)   oracle.transac)ons,   •  Queries  can  be  fed  to  any  Drillbit   mongo.users,   hdfs.events   •  Coordina)on,  query  planning,  op)miza)on,   LIMIT  1   scheduling,  and  execu)on  are  distributed  
  • 11. Key  Features   •  Full  SQL  (ANSI  SQL:2003)   •  Nested  data   •  Schema  is  op)onal   •  Flexible  and  extensible  architecture  
  • 12. Full  SQL  (ANSI  SQL:2003)   •  Drill  supports  standard  ANSI  SQL:2003   –  Correlated  subqueries,  analy)c  func)ons,  …   –  SQL-­‐like  is  not  enough   •  Use  any  SQL-­‐based  tool  with  Apache  Drill   –  Tableau,  Microstrategy,  Excel,  SAP  Crystal  Reports,  Toad,  SQuirreL,  …   –  Standard  ODBC  and  JDBC  drivers   Client Tableau Drillbit MicroStrategy Drill%ODBC% SQL%Query% Query% Driver Drillbits Driver Parser Planner Drill%Worker Drill%Worker Excel SAP%Crystal% Reports
  • 13. Nested  Data   JSON   •  Nested  data  is  becoming  prevalent   {      "name":  "Homer",   –  JSON,  BSON,  XML,  Protocol  Buffers,  Avro,  …      "gender":  "Male",      "followers":  100   –  The  data  source  may  or  may  not  be  aware      children:  [   •  MongoDB  supports  nested  data  na)vely          {name:  "Bart"},          {name:  "Lisa”}   •  A  single  HBase  value  could  be  a  JSON  document      ]   (compound  nested  type)   }   –  Google  Dremel’s  innova)on  was  efficient  columnar   storage  and  querying  of  nested  data   •  FlaGening  nested  data  is  error-­‐prone  and  ocen   Avro   enum  Gender  {   impossible      MALE,  FEMALE   –  Think  about  repeated  and  op)onal  fields  at  every   }     level…   record  User  {   •  Apache  Drill  supports  nested  data      string  name;      Gender  gender;   –  Extensions  to  ANSI  SQL:2003      long  followers;   }  
  • 14. Schema  is  Op)onal   •  Many  data  sources  do  not  have  rigid  schemas   –  Schemas  change  rapidly   –  Each  record  may  have  a  different  schema   •  Sparse  and  wide  rows  in  HBase  and  Cassandra,  MongoDB   •  Apache  Drill  supports  querying  against  unknown  schemas   –  Query  any  HBase,  Cassandra  or  MongoDB  table   •  User  can  define  the  schema  or  let  the  system  discover  it  automa)cally   –  System  of  record  may  already  have  schema  informa)on   •  Why  manage  it  in  a  separate  system?   –  No  need  to  manage  schema  evolu)on   Row  Key   CF  contents   CF  anchor   "com.cnn.www"   contents:html  =  "<html>…"   anchor:my.look.ca  =  "CNN.com"   anchor:cnnsi.com  =  "CNN"   "com.foxnews.www"   contents:html  =  "<html>…"   anchor:en.wikipedia.org  =  "Fox  News"     …   …   …  
  • 15. Flexible  and  Extensible  Architecture   •  Apache  Drill  is  designed  for  extensibility   –  Well-­‐documented  APIs  and  interfaces   •  Data  sources  and  file  formats   –  Implement  a  custom  scanner  to  support  a  new  data  source  or  file  format   •  Query  languages   –  SQL:2003  is  the  primary  language   –  Implement  a  custom  Parser  to  support  a  Domain  Specific  Language   –  UDFs  and  UDTFs   •  Op)mizers   –  Drill  will  have  a  cost-­‐based  op)mizer   –  Clear  surrounding  APIs  support  easy  op)mizer  explora)on   •  Operators   –  Custom  operators  can  be  implemented   •  Special  operators  for  Mahout  (k-­‐means)  being  designed   –  Operator  push-­‐down  to  data  source  (RDBMS)  
  • 16. What  About  Other  SQL-­‐on-­‐Hadoop   Systems?   •  Strengths   –  Code  is  more  mature  than  Apache  Drill   •  Already  in  beta  (~2  quarters  ahead)   –  Faster  than  Hive  on  some  queries   •  Weaknesses   –  Proprietary  or  semi-­‐open  source   –  Query  results  must  fit  in  memory  (no  spooling)   –  Early  row  materializa)on  (no  columnar  execu)on)   –  Some  are  SQL-­‐like  (not  SQL)   –  No  support  for  nested  data   –  Rigid  schema  is  required   –  Limited  flexibility  and  extensibility   –  Only  support  Hadoop  and  HBase  (and  no  other  NoSQL  or  RDBMS)  
  • 17. Status:  In  Progress   •  Heavy  ac)ve  development   –  6-­‐7  companies  are  contribu)ng   •  Available   –  Logical  plan  syntax  and  interpreter   –  Reference  execu)on  engine   •  In  progress   –  SQL  interpreter   –  Storage  engine  implementa)ons  for  Accumulo,  Cassandra,  HBase  and  various  file  formats   •  Significant  community  momentum   –  Over  250  people  on  the  Drill  mailing  list   –  Over  250  members  of  the  Bay  Area  Drill  User  Group   –  Drill  meetups  across  the  US  and  Europe   –  OpenDremel  team  joined  Apache  Drill   •  An)cipated  schedule:   –  Prototype:  Q1   –  Alpha:  Q2   –  Beta:  Q3  
  • 18. Why  Apache  Drill  Will  Be  Successful   Resources   Community   Architecture   •  Contributors  have  strong   •  Development  done  in  the   •  Full  SQL   backgrounds  from   open   •  New  data  support   companies  like  Oracle,   •  Ac)ve  contributors  from   •  Extensible  APIs   IBM  Netezza,  Informa)ca,   mul)ple  companies   •  Full  Columnar  Execu)on   Clustrix  and  Pentaho   •  Rapidly  growing   •  Beyond  Hadoop  
  • 19. Interested  in  Apache  Drill?   •  Many  op)ons  to  contribute   –  Become  a  full  )me  Drill  engineer  @  MapR   •  Email  tshiran@maprtech.com     –  Join  the  Drill  mailing  list  and  start  contribu)ng   •  JIRAs,  code,  unit  tests,  documenta)on,  …   –  Shoot  me  an  email  and  we  can  discuss   •  Email  tshiran@maprtech.com  
  • 21. Why  Not  Leverage  MapReduce?   •  Scheduling  Model   –  Coarse  resource  model  reduces  hardware  u)liza)on   –  Acquisi)on  of  resources  typically  takes  100’s  of  millis  to  seconds   •  Barriers   –  Map  comple)on  required  before  shuffle/reduce   commencement   –  All  maps  must  complete  before  reduce  can  start   –  In  chained  jobs,  one  job  must  finish  en)rely  before  the  next  one   can  start   •  Persistence  and  Recoverability   –  Data  is  persisted  to  disk  between  each  barrier   –  Serializa)on  and  deserializa)on  are  required  between  execu)on   phase