SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
Apache	
  Drill	
  status	
  
Michael	
  Hausenblas,	
  Chief	
  Data	
  Engineer	
  EMEA,	
  MapR	
  
HUG	
  Munich,	
  2013-­‐04-­‐19	
  
Kudos	
  to	
  hEp://cmx.io/	
  	
  
Workloads	
  
•  Batch	
  processing	
  (MapReduce)	
  
•  Light-­‐weight	
  OLTP	
  (HBase,	
  Cassandra,	
  etc.)	
  
•  Stream	
  processing	
  (Storm,	
  S4)	
  
•  Search	
  (Solr,	
  ElasVcsearch)	
  
•  Interac1ve,	
  ad-­‐hoc	
  query	
  and	
  analysis	
  (?)	
  
Impala
InteracVve	
  Query	
  at	
  Scale	
  
low-­‐latency	
  
Use	
  Case	
  I	
  
•  Jane,	
  a	
  markeVng	
  analyst	
  
•  Determine	
  target	
  segments	
  
•  Data	
  from	
  different	
  sources	
  
	
  
Use	
  Case	
  II	
  
•  LogisVcs	
  –	
  supplier	
  status	
  
•  Queries	
  
– How	
  many	
  shipments	
  from	
  supplier	
  X?	
  
– How	
  many	
  shipments	
  in	
  region	
  Y?	
  
SUPPLIER_ID	
   NAME	
   REGION	
  
ACM	
   ACME	
  Corp	
   US	
  
GAL	
   GotALot	
  Inc	
   US	
  
BAP	
   Bits	
  and	
  Pieces	
  Ltd	
   Europe	
  
ZUP	
   Zu	
  Pli	
   Asia	
  
{
"shipment": 100123,
"supplier": "ACM",
“timestamp": "2013-02-01",
"description": ”first delivery today”
},
{
"shipment": 100124,
"supplier": "BAP",
"timestamp": "2013-02-02",
"description": "hope you enjoy it”
}
…
Today’s	
  SoluVons	
  
•  RDBMS-­‐focused	
  
–  ETL	
  data	
  from	
  MongoDB	
  and	
  Hadoop	
  
–  Query	
  data	
  using	
  SQL	
  
•  MapReduce-­‐focused	
  
–  ETL	
  from	
  RDBMS	
  and	
  MongoDB	
  
–  Use	
  Hive,	
  etc.	
  
Requirements	
  
•  Support	
  for	
  different	
  data	
  sources	
  
•  Support	
  for	
  different	
  query	
  interfaces	
  
•  Low-­‐latency/real-­‐Vme	
  
•  Ad-­‐hoc	
  queries	
  
•  Scalable,	
  reliable	
  
Google’s	
  Dremel*	
  
*)	
  hEp://research.google.com/pubs/pub36632.html	
  	
  
Apache	
  Drill	
  Overview	
  
•  Inspired	
  by	
  Google’s	
  Dremel	
  
•  Standard	
  	
  SQL	
  2003	
  support	
  
•  Other	
  QL	
  possible	
  
•  Plug-­‐able	
  data	
  sources	
  
•  Support	
  for	
  nested	
  data	
  
•  Schema	
  is	
  opVonal	
  
•  Community	
  driven,	
  open,	
  100’s	
  involved	
  
High-­‐level	
  Architecture	
  
High-­‐level	
  Architecture	
  
•  Each	
  node:	
  Drillbit	
  -­‐	
  maximize	
  data	
  locality	
  
•  Co-­‐ordinaVon,	
  query	
  planning,	
  execuVon,	
  etc,	
  are	
  distributed	
  
•  By	
  default	
  Drillbits	
  hold	
  all	
  roles	
  
•  Any	
  node	
  can	
  act	
  as	
  endpoint	
  for	
  a	
  query	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
High-­‐level	
  Architecture	
  
•  Zookeeper	
  for	
  ephemeral	
  cluster	
  membership	
  info	
  
•  Distributed	
  cache	
  (Hazelcast)	
  for	
  metadata,	
  locality	
  
informaVon,	
  etc.	
  
Curator/Zk	
  
Distributed	
  Cache	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Distributed	
  Cache	
   Distributed	
  Cache	
   Distributed	
  Cache	
  
High-­‐level	
  Architecture	
  
•  Origina1ng	
  Drillbit	
  acts	
  as	
  foreman,	
  manages	
  query	
  execuVon,	
  
scheduling,	
  locality	
  informaVon,	
  etc.	
  
•  Streaming	
  data	
  communica1on	
  avoiding	
  SerDe	
  
Curator/Zk	
  
Distributed	
  Cache	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Distributed	
  Cache	
   Distributed	
  Cache	
   Distributed	
  Cache	
  
Principled	
  Query	
  ExecuVon	
  
Source	
  
Query	
   Parser	
  
Logical	
  
Plan	
   OpVmizer	
  
Physical	
  
Plan	
   ExecuVon	
  
SQL	
  2003	
  	
  
DrQL	
  
MongoQL	
  
DSL	
  
scanner	
  API	
  topology	
  query: [
{
@id: "log",
op: "sequence",
do: [
{
op: "scan",
source: “logs”
},
{
op:
"filter",
condition:
"x > 3”
},
parser	
  API	
  
Drillbit	
  Modules	
  
DFS	
  Engine	
  
HBase	
  Engine	
  
RPC	
  Endpoint	
  
SQL	
  
HiveQL	
  
Pig	
  
Parser	
  
Distributed	
  Cache	
  
Logical	
  Plan	
  
Physical	
  Plan	
  
OpVmizer	
  
Storage	
  Engine	
  Interface	
  
Scheduler	
  
Foreman	
  
Operators	
  
Mongo	
  
Key	
  Features	
  
•  Full	
  SQL	
  2003	
  
•  Nested	
  data	
  
•  OpVonal	
  schema	
  
•  Extensibility	
  points	
  
Full	
  SQL	
  –	
  ANSI	
  SQL	
  2003	
  
•  SQL-­‐like	
  is	
  oien	
  not	
  enough	
  
•  IntegraVon	
  with	
  exisVng	
  tools	
  
–  Datameer,	
  Tableau,	
  Excel,	
  SAP	
  Crystal	
  Reports	
  
–  Use	
  standard	
  ODBC/JDBC	
  driver	
  
Nested	
  Data	
  
•  Nested	
  data	
  becoming	
  prevalent	
  
–  JSON/BSON,	
  XML,	
  ProtoBuf,	
  Avro	
  
–  Some	
  data	
  sources	
  support	
  it	
  naVvely	
  
(MongoDB,	
  etc.)	
  
•  FlaEening	
  nested	
  data	
  is	
  error-­‐prone	
  
•  Extension	
  to	
  ANSI	
  SQL	
  2003	
  
OpVonal	
  Schema	
  
•  Many	
  data	
  sources	
  don’t	
  have	
  rigid	
  schemas	
  
–  Schema	
  changes	
  rapidly	
  
–  Different	
  schema	
  per	
  record	
  (e.g.	
  HBase)	
  
•  Supports	
  queries	
  against	
  unknown	
  schema	
  
•  User	
  can	
  define	
  schema	
  or	
  via	
  discovery	
  
Extensibility	
  Points	
  
•  Source	
  query	
  à	
  parser	
  API	
  
•  Custom	
  operators,	
  UDF	
  à	
  logical	
  plan	
  
•  Serving	
  tree,	
  CF,	
  topology	
  à	
  physical	
  plan/opVmizer	
  
•  Data	
  sources	
  &formats	
  à	
  scanner	
  API	
  
Source	
  
Query	
   Parser	
  
Logical	
  
Plan	
   OpVmizer	
  
Physical	
  
Plan	
   ExecuVon	
  
…	
  and	
  Hadoop?	
  
•  HDFS	
  can	
  be	
  a	
  data	
  source	
  
•  Complementary	
  use	
  cases*	
  
•  …	
  use	
  Apache	
  Drill	
  
–  Find	
  record	
  with	
  specified	
  condiVon	
  
–  AggregaVon	
  under	
  dynamic	
  condiVons	
  
•  …	
  use	
  MapReduce	
  
–  Data	
  mining	
  with	
  mulVple	
  iteraVons	
  
–  ETL	
  
22	
  
*)	
  hEps://cloud.google.com/files/BigQueryTechnicalWP.pdf	
  	
  
Example	
  
hEps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo	
  	
  
{
"id": "0001",
"type": "donut",
”ppu": 0.55,
"batters":
{
"batter”:
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
…
data	
  source:	
  donuts.json	
  
query:[ {
op:"sequence",
do:[
{
op: "scan",
ref: "donuts",
source: "local-logs",
selection: {data: "activity"}
},
{
op: "filter",
expr: "donuts.ppu < 2.00"
},
…
logical	
  plan:	
  simple_plan.json	
  
result:	
  out.json	
  
{
"sales" : 700.0,
"typeCount" : 1,
"quantity" : 700,
"ppu" : 1.0
}
{
"sales" : 109.71,
"typeCount" : 2,
"quantity" : 159,
"ppu" : 0.69
}
{
"sales" : 184.25,
"typeCount" : 2,
"quantity" : 335,
"ppu" : 0.55
}
Status	
  
•  Heavy	
  development	
  by	
  mulVple	
  organizaVons	
  
•  Available	
  
– Logical	
  plan	
  (ADSP)	
  
– Reference	
  interpreter	
  
– Basic	
  SQL	
  parser	
  	
  
– Basic	
  demo	
  
– Basic	
  HBase	
  back-­‐end	
  
Status	
  
April	
  2013	
  
	
  
•  Extend	
  SQL	
  syntax	
  
•  Physical	
  plan	
  
•  In-­‐memory	
  compressed	
  data	
  interfaces	
  
•  Distributed	
  execuVon	
  
ContribuVng	
  
•  Learn	
  where	
  and	
  how	
  to	
  contribute	
  
hEps://cwiki.apache.org/confluence/display/DRILL/
ContribuVng	
  	
  
•  Jira,	
  Git,	
  Apache	
  build	
  and	
  test	
  tools	
  
•  Preparing	
  for	
  dependencies	
  
–  Hazelcast	
  
–  Neolix	
  Curator	
  
ContribuVng	
  
General	
  contribuVons	
  appreciated:	
  
•  Supersonic	
  (?)	
  
•  Test	
  data	
  &	
  test	
  queries	
  
•  Use	
  case	
  scenarios	
  (textual	
  desc./SQL	
  queries)	
  
•  DocumentaVon	
  
ContribuVng	
  
•  Dremel-­‐inspired	
  columnar	
  format	
  
–  TwiEer’s	
  Parquet	
  	
  
–  Hive’s	
  ORC	
  file	
  
•  IntegraVon	
  with	
  Hive	
  metastore	
  (?)	
  
•  DRILL-­‐13	
  Storage	
  Engine:	
  Define	
  Java	
  Interface	
  
•  DRILL-­‐15	
  Build	
  HBase	
  storage	
  engine	
  implementaVon	
  
ContribuVng	
  
•  DRILL-­‐48	
  RPC	
  interface	
  for	
  query	
  submission	
  and	
  physical	
  plan	
  
execuVon	
  
•  DRILL-­‐53	
  Setup	
  cluster	
  configuraVon	
  and	
  membership	
  mgmt	
  
system	
  
•  Further	
  schedule	
  
–  Alpha	
  Q2	
  
–  Beta	
  Q3	
  
Kudos	
  to	
  …	
  
•  Julian	
  Hyde,	
  Pentaho	
  	
  
•  Lisen	
  Mu	
  
•  Tim	
  Chen,	
  Microsoi	
  
•  Chris	
  Merrick,	
  RJMetrics	
  	
  
•  David	
  Alves,	
  UT	
  AusVn	
  
•  Sree	
  Vaadi,	
  SSS/NGData	
  
•  Jacques	
  Nadeau,	
  MapR	
  
•  Ted	
  Dunning,	
  MapR	
  
Engage!	
  
•  Follow	
  @ApacheDrill	
  on	
  TwiEer	
  
•  Sign	
  up	
  at	
  mailing	
  lists	
  (user	
  |	
  dev)	
  	
  
hEp://incubator.apache.org/drill/mailing-­‐lists.html	
  	
  
	
  
•  Standing	
  G+	
  hangouts	
  every	
  Tuesday	
  at	
  18:00	
  CET	
  
•  Keep	
  an	
  eye	
  on	
  hEp://drill-­‐user.org/	
  	
  

Mais conteúdo relacionado

Mais procurados

Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 
Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptx
Ted Dunning
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 

Mais procurados (20)

Apache Drill
Apache DrillApache Drill
Apache Drill
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache Drill
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptx
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 

Destaque

Destaque (6)

Cement
CementCement
Cement
 
Mongo jdbc
Mongo jdbcMongo jdbc
Mongo jdbc
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
MongoDB: la BBDD NoSQL más popular del mercado
MongoDB: la BBDD NoSQL más popular del mercadoMongoDB: la BBDD NoSQL más popular del mercado
MongoDB: la BBDD NoSQL más popular del mercado
 
How to Write a Technical Report
How to Write a Technical ReportHow to Write a Technical Report
How to Write a Technical Report
 
Heat Treatments
Heat TreatmentsHeat Treatments
Heat Treatments
 

Semelhante a Hadoop User Group - Status Apache Drill

Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
MapR Technologies
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
Open Analytics
 
Polyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudPolyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the Cloud
Andrei Savu
 

Semelhante a Hadoop User Group - Status Apache Drill (20)

Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Polyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudPolyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the Cloud
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 

Mais de MapR Technologies

Mais de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Hadoop User Group - Status Apache Drill

  • 1. Apache  Drill  status   Michael  Hausenblas,  Chief  Data  Engineer  EMEA,  MapR   HUG  Munich,  2013-­‐04-­‐19  
  • 3. Workloads   •  Batch  processing  (MapReduce)   •  Light-­‐weight  OLTP  (HBase,  Cassandra,  etc.)   •  Stream  processing  (Storm,  S4)   •  Search  (Solr,  ElasVcsearch)   •  Interac1ve,  ad-­‐hoc  query  and  analysis  (?)  
  • 4. Impala InteracVve  Query  at  Scale   low-­‐latency  
  • 5. Use  Case  I   •  Jane,  a  markeVng  analyst   •  Determine  target  segments   •  Data  from  different  sources    
  • 6. Use  Case  II   •  LogisVcs  –  supplier  status   •  Queries   – How  many  shipments  from  supplier  X?   – How  many  shipments  in  region  Y?   SUPPLIER_ID   NAME   REGION   ACM   ACME  Corp   US   GAL   GotALot  Inc   US   BAP   Bits  and  Pieces  Ltd   Europe   ZUP   Zu  Pli   Asia   { "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
  • 7. Today’s  SoluVons   •  RDBMS-­‐focused   –  ETL  data  from  MongoDB  and  Hadoop   –  Query  data  using  SQL   •  MapReduce-­‐focused   –  ETL  from  RDBMS  and  MongoDB   –  Use  Hive,  etc.  
  • 8. Requirements   •  Support  for  different  data  sources   •  Support  for  different  query  interfaces   •  Low-­‐latency/real-­‐Vme   •  Ad-­‐hoc  queries   •  Scalable,  reliable  
  • 9. Google’s  Dremel*   *)  hEp://research.google.com/pubs/pub36632.html    
  • 10. Apache  Drill  Overview   •  Inspired  by  Google’s  Dremel   •  Standard    SQL  2003  support   •  Other  QL  possible   •  Plug-­‐able  data  sources   •  Support  for  nested  data   •  Schema  is  opVonal   •  Community  driven,  open,  100’s  involved  
  • 12. High-­‐level  Architecture   •  Each  node:  Drillbit  -­‐  maximize  data  locality   •  Co-­‐ordinaVon,  query  planning,  execuVon,  etc,  are  distributed   •  By  default  Drillbits  hold  all  roles   •  Any  node  can  act  as  endpoint  for  a  query   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node  
  • 13. High-­‐level  Architecture   •  Zookeeper  for  ephemeral  cluster  membership  info   •  Distributed  cache  (Hazelcast)  for  metadata,  locality   informaVon,  etc.   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  • 14. High-­‐level  Architecture   •  Origina1ng  Drillbit  acts  as  foreman,  manages  query  execuVon,   scheduling,  locality  informaVon,  etc.   •  Streaming  data  communica1on  avoiding  SerDe   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  • 15. Principled  Query  ExecuVon   Source   Query   Parser   Logical   Plan   OpVmizer   Physical   Plan   ExecuVon   SQL  2003     DrQL   MongoQL   DSL   scanner  API  topology  query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, parser  API  
  • 16. Drillbit  Modules   DFS  Engine   HBase  Engine   RPC  Endpoint   SQL   HiveQL   Pig   Parser   Distributed  Cache   Logical  Plan   Physical  Plan   OpVmizer   Storage  Engine  Interface   Scheduler   Foreman   Operators   Mongo  
  • 17. Key  Features   •  Full  SQL  2003   •  Nested  data   •  OpVonal  schema   •  Extensibility  points  
  • 18. Full  SQL  –  ANSI  SQL  2003   •  SQL-­‐like  is  oien  not  enough   •  IntegraVon  with  exisVng  tools   –  Datameer,  Tableau,  Excel,  SAP  Crystal  Reports   –  Use  standard  ODBC/JDBC  driver  
  • 19. Nested  Data   •  Nested  data  becoming  prevalent   –  JSON/BSON,  XML,  ProtoBuf,  Avro   –  Some  data  sources  support  it  naVvely   (MongoDB,  etc.)   •  FlaEening  nested  data  is  error-­‐prone   •  Extension  to  ANSI  SQL  2003  
  • 20. OpVonal  Schema   •  Many  data  sources  don’t  have  rigid  schemas   –  Schema  changes  rapidly   –  Different  schema  per  record  (e.g.  HBase)   •  Supports  queries  against  unknown  schema   •  User  can  define  schema  or  via  discovery  
  • 21. Extensibility  Points   •  Source  query  à  parser  API   •  Custom  operators,  UDF  à  logical  plan   •  Serving  tree,  CF,  topology  à  physical  plan/opVmizer   •  Data  sources  &formats  à  scanner  API   Source   Query   Parser   Logical   Plan   OpVmizer   Physical   Plan   ExecuVon  
  • 22. …  and  Hadoop?   •  HDFS  can  be  a  data  source   •  Complementary  use  cases*   •  …  use  Apache  Drill   –  Find  record  with  specified  condiVon   –  AggregaVon  under  dynamic  condiVons   •  …  use  MapReduce   –  Data  mining  with  mulVple  iteraVons   –  ETL   22   *)  hEps://cloud.google.com/files/BigQueryTechnicalWP.pdf    
  • 23. Example   hEps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo     { "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, … data  source:  donuts.json   query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" }, … logical  plan:  simple_plan.json   result:  out.json   { "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
  • 24. Status   •  Heavy  development  by  mulVple  organizaVons   •  Available   – Logical  plan  (ADSP)   – Reference  interpreter   – Basic  SQL  parser     – Basic  demo   – Basic  HBase  back-­‐end  
  • 25. Status   April  2013     •  Extend  SQL  syntax   •  Physical  plan   •  In-­‐memory  compressed  data  interfaces   •  Distributed  execuVon  
  • 26. ContribuVng   •  Learn  where  and  how  to  contribute   hEps://cwiki.apache.org/confluence/display/DRILL/ ContribuVng     •  Jira,  Git,  Apache  build  and  test  tools   •  Preparing  for  dependencies   –  Hazelcast   –  Neolix  Curator  
  • 27. ContribuVng   General  contribuVons  appreciated:   •  Supersonic  (?)   •  Test  data  &  test  queries   •  Use  case  scenarios  (textual  desc./SQL  queries)   •  DocumentaVon  
  • 28. ContribuVng   •  Dremel-­‐inspired  columnar  format   –  TwiEer’s  Parquet     –  Hive’s  ORC  file   •  IntegraVon  with  Hive  metastore  (?)   •  DRILL-­‐13  Storage  Engine:  Define  Java  Interface   •  DRILL-­‐15  Build  HBase  storage  engine  implementaVon  
  • 29. ContribuVng   •  DRILL-­‐48  RPC  interface  for  query  submission  and  physical  plan   execuVon   •  DRILL-­‐53  Setup  cluster  configuraVon  and  membership  mgmt   system   •  Further  schedule   –  Alpha  Q2   –  Beta  Q3  
  • 30. Kudos  to  …   •  Julian  Hyde,  Pentaho     •  Lisen  Mu   •  Tim  Chen,  Microsoi   •  Chris  Merrick,  RJMetrics     •  David  Alves,  UT  AusVn   •  Sree  Vaadi,  SSS/NGData   •  Jacques  Nadeau,  MapR   •  Ted  Dunning,  MapR  
  • 31. Engage!   •  Follow  @ApacheDrill  on  TwiEer   •  Sign  up  at  mailing  lists  (user  |  dev)     hEp://incubator.apache.org/drill/mailing-­‐lists.html       •  Standing  G+  hangouts  every  Tuesday  at  18:00  CET   •  Keep  an  eye  on  hEp://drill-­‐user.org/