SlideShare uma empresa Scribd logo
1 de 13
Baixar para ler offline
Apache Drill
 Architecture Outlines




     Jason Frantz
whoami
• Jason


• Software Architect at MapR


• Background in Databases, File Systems,
  Distributed Systems


• jfrantz@maprtech.com, jason@apache.org
Data Flow
Big Picture




• Same basic pieces as most databases:
  –   Driver: manage interaction with client
  –   Parser: process textual query language
  –   Compiler / Optimizer: convert logical query into physical plan
  –   Execution Engine: run physical plan across cluster
  –   Storage Handlers: feed user data in/out of execution
Parser
• Converts text-based query language into internal DAG representation
  – Grammar, syntax, basic query validation
  – Generally straightforward to implement


• Initial goal is to support a SQL-like query language for nested data (DrQL)
   – Compatible with Google BigQuery/Dremel
   – Designed to support data sources that have a well-defined schema (e.g.
     protocol buffers) as well as those that don't (e.g. JSON)

• Other potential input styles:
  – MongoDB's query language
  – Hive
  – Pig
Traditional Query Optimizers
• 30+ year history into relational query optimization
   – We have to follow down the same general path


• Converts a logical query plan into a physical one
  – Example: convert logical “JOIN” operator into specific hash join operator
  – Attempts to choose the “best” overall execution plan


• Magic black box of statistics!
  – Optimizers do great with queries that can be easily modeled with
    available statistics
  – Difficulties: lack of statistics, complex schemas, complex queries
  – Database users often work around optimizer using query hints
    ● “force index”
Intermediate Representation
• Intermediate Representation (IR) is common internal API
   – Output from Parser
   – Input/Output from Optimizer
   – Input to Execution Engine


• Textual Representation:
   – Flexibility
     ● Different users can enter at different levels of the IR

     ● Advanced users can skip optimizer entirely


   – Easier to test various pieces
   – Easy to cache
     ● Query optimization can be computationally expensive, so traditional databases go to


       great lengths to reuse execution plans

• Ideally IR would be format used between optimization passes
   – Inspiration: LLVM, SQL Server showplan
Execution Engine
• Execution layer
   – Query is a DAG of operators

• Operator layer
   – Implementation of individual operators and data format
     serialization
Execution Layer
• Query structured as a Directed Acyclic Graph (DAG) representing the data flow
  – Each node is an abstract “operator”
  – Communication between nodes is “blobs” of data
  – Data model described well in Microsoft's Dryad paper (Isard '07)


• Responsible for handling:
  – Operator dependencies
  – Task scheduling
  – Inter-node communication


• Notable features:
  – Speculative execution
  – Pipelining with spill-to-disk as fallback
  – Back pressure
Operator Execution
• Implementation of individual operators
   – Example built-in operators: hash aggregate, filter, json-scan
   – Extensible so new operators are easy to plug in


• Serialization-aware:
   – Each “blob” is a batch of rows in a particular format:
     ● Row-wise, no schema: MessagePack

     ● Row-wise, schema: Protocol Buffers

     ● Columnar, schema: Dremel-style format


   – Different operator implementations for different serializations
Storage Interfaces
• Scanner operators
   – Common APIs to convert user data into formats understood by
     execution operators
   – Example conversions:
     ● JSON → MessagePack

     ● CSV → MessagePack

     ● Dremel: columnar serialization → Protocol Buffers




• Data sources:
  – HDFS
  – NFS
  – HBase / Cassandra
  – MySQL / PostgreSQL / etc
Storage Interfaces
• Scanner Flexibility:
   – Allow in-place filtering (predicate pushdown)
   – Scanners can manage their own caching policies for their
     data

• In-place processing
   – Having a separate “ETL” step is painful
     ● Easiest to process data on demand


   – Query workload gives feedback on scanner access patterns
     ● Database Cracking: adaptively convert storage layout into


       more efficient forms
Design Principles
Flexible                          Easy
•   Pluggable query languages     •   Unzip and run
•   Extensible execution engine   •   Zero configuration
•   Pluggable data formats        •   Reverse DNS not needed
•   Column-based and row-based    •   IP addresses can change
•   Schema and schema-less        •   Clear and concise log
•   Pluggable data sources            messages



Dependable                        Fast
• No SPOF                         • C/C++ core with Java support
• Instant recovery from crashes   • Google C++ style guide
                                  • Min latency and max
                                    throughput (limited only by
                                    hardware)

Mais conteúdo relacionado

Mais procurados

HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGuang Xu
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architectureOmid Vahdaty
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Roopa Tangirala
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streamingMaturin BADO
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Alluxio, Inc.
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightDatabricks
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineDataWorks Summit
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
 
Shift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to CassandraShift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to CassandraDataStax
 
Membase Meetup 2010
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010Membase
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward
 
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...Michael Stack
 

Mais procurados (20)

HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraph
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architecture
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
 
Presto - SQL on anything
Presto  - SQL on anythingPresto  - SQL on anything
Presto - SQL on anything
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
 
Shift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to CassandraShift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to Cassandra
 
Membase Meetup 2010
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
 
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
 

Destaque

Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Vicente Orjales
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19jasonfrantz
 
[UniteKorea2013] Serialization in Depth
[UniteKorea2013] Serialization in Depth[UniteKorea2013] Serialization in Depth
[UniteKorea2013] Serialization in DepthWilliam Hugo Yang
 
How BigQuery broke my heart
How BigQuery broke my heartHow BigQuery broke my heart
How BigQuery broke my heartGabriel Hamilton
 
Futures Trading Strategies on SGX - India chapter in AFACT in Singapore
Futures Trading Strategies on SGX - India chapter in AFACT in SingaporeFutures Trading Strategies on SGX - India chapter in AFACT in Singapore
Futures Trading Strategies on SGX - India chapter in AFACT in SingaporeQuantInsti
 
Trading system in stock exchange
Trading system in stock exchangeTrading system in stock exchange
Trading system in stock exchangeSumit Behura
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
Mechanical trading system based on renko charts
Mechanical trading system based on renko chartsMechanical trading system based on renko charts
Mechanical trading system based on renko chartsRaul Canessa
 
Application design for MiFID II-compliant operations
Application design for MiFID II-compliant operationsApplication design for MiFID II-compliant operations
Application design for MiFID II-compliant operationsLászló Árvai
 
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...QuantInsti
 
How to build a trading system
How to build a trading systemHow to build a trading system
How to build a trading systemFXstreet.com
 
Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Sri Prasanna
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
Mutual Exclusion Election (Distributed computing)
Mutual Exclusion Election (Distributed computing)Mutual Exclusion Election (Distributed computing)
Mutual Exclusion Election (Distributed computing)Sri Prasanna
 
EXTENT-2015: MiFID II Projected Impact on Trading Technology
EXTENT-2015: MiFID II Projected Impact on Trading TechnologyEXTENT-2015: MiFID II Projected Impact on Trading Technology
EXTENT-2015: MiFID II Projected Impact on Trading TechnologyIosif Itkin
 

Destaque (20)

Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19
 
MapReduceによる大規模データ処理 at Yahoo! JAPAN
MapReduceによる大規模データ処理 at Yahoo! JAPANMapReduceによる大規模データ処理 at Yahoo! JAPAN
MapReduceによる大規模データ処理 at Yahoo! JAPAN
 
[UniteKorea2013] Serialization in Depth
[UniteKorea2013] Serialization in Depth[UniteKorea2013] Serialization in Depth
[UniteKorea2013] Serialization in Depth
 
How BigQuery broke my heart
How BigQuery broke my heartHow BigQuery broke my heart
How BigQuery broke my heart
 
Futures Trading Strategies on SGX - India chapter in AFACT in Singapore
Futures Trading Strategies on SGX - India chapter in AFACT in SingaporeFutures Trading Strategies on SGX - India chapter in AFACT in Singapore
Futures Trading Strategies on SGX - India chapter in AFACT in Singapore
 
Trading system in stock exchange
Trading system in stock exchangeTrading system in stock exchange
Trading system in stock exchange
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Mechanical trading system based on renko charts
Mechanical trading system based on renko chartsMechanical trading system based on renko charts
Mechanical trading system based on renko charts
 
Application design for MiFID II-compliant operations
Application design for MiFID II-compliant operationsApplication design for MiFID II-compliant operations
Application design for MiFID II-compliant operations
 
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
How to build a trading system
How to build a trading systemHow to build a trading system
How to build a trading system
 
Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Mutual Exclusion Election (Distributed computing)
Mutual Exclusion Election (Distributed computing)Mutual Exclusion Election (Distributed computing)
Mutual Exclusion Election (Distributed computing)
 
EXTENT-2015: MiFID II Projected Impact on Trading Technology
EXTENT-2015: MiFID II Projected Impact on Trading TechnologyEXTENT-2015: MiFID II Projected Impact on Trading Technology
EXTENT-2015: MiFID II Projected Impact on Trading Technology
 
MapReduce入門
MapReduce入門MapReduce入門
MapReduce入門
 

Semelhante a Drill architecture 20120913

Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLAlexei Krasner
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Yahoo Developer Network
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Ted Dunning
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 

Semelhante a Drill architecture 20120913 (20)

Apache drill
Apache drillApache drill
Apache drill
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
 
Drop acid
Drop acidDrop acid
Drop acid
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 

Último

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Último (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Drill architecture 20120913

  • 1. Apache Drill Architecture Outlines Jason Frantz
  • 2. whoami • Jason • Software Architect at MapR • Background in Databases, File Systems, Distributed Systems • jfrantz@maprtech.com, jason@apache.org
  • 4. Big Picture • Same basic pieces as most databases: – Driver: manage interaction with client – Parser: process textual query language – Compiler / Optimizer: convert logical query into physical plan – Execution Engine: run physical plan across cluster – Storage Handlers: feed user data in/out of execution
  • 5. Parser • Converts text-based query language into internal DAG representation – Grammar, syntax, basic query validation – Generally straightforward to implement • Initial goal is to support a SQL-like query language for nested data (DrQL) – Compatible with Google BigQuery/Dremel – Designed to support data sources that have a well-defined schema (e.g. protocol buffers) as well as those that don't (e.g. JSON) • Other potential input styles: – MongoDB's query language – Hive – Pig
  • 6. Traditional Query Optimizers • 30+ year history into relational query optimization – We have to follow down the same general path • Converts a logical query plan into a physical one – Example: convert logical “JOIN” operator into specific hash join operator – Attempts to choose the “best” overall execution plan • Magic black box of statistics! – Optimizers do great with queries that can be easily modeled with available statistics – Difficulties: lack of statistics, complex schemas, complex queries – Database users often work around optimizer using query hints ● “force index”
  • 7. Intermediate Representation • Intermediate Representation (IR) is common internal API – Output from Parser – Input/Output from Optimizer – Input to Execution Engine • Textual Representation: – Flexibility ● Different users can enter at different levels of the IR ● Advanced users can skip optimizer entirely – Easier to test various pieces – Easy to cache ● Query optimization can be computationally expensive, so traditional databases go to great lengths to reuse execution plans • Ideally IR would be format used between optimization passes – Inspiration: LLVM, SQL Server showplan
  • 8. Execution Engine • Execution layer – Query is a DAG of operators • Operator layer – Implementation of individual operators and data format serialization
  • 9. Execution Layer • Query structured as a Directed Acyclic Graph (DAG) representing the data flow – Each node is an abstract “operator” – Communication between nodes is “blobs” of data – Data model described well in Microsoft's Dryad paper (Isard '07) • Responsible for handling: – Operator dependencies – Task scheduling – Inter-node communication • Notable features: – Speculative execution – Pipelining with spill-to-disk as fallback – Back pressure
  • 10. Operator Execution • Implementation of individual operators – Example built-in operators: hash aggregate, filter, json-scan – Extensible so new operators are easy to plug in • Serialization-aware: – Each “blob” is a batch of rows in a particular format: ● Row-wise, no schema: MessagePack ● Row-wise, schema: Protocol Buffers ● Columnar, schema: Dremel-style format – Different operator implementations for different serializations
  • 11. Storage Interfaces • Scanner operators – Common APIs to convert user data into formats understood by execution operators – Example conversions: ● JSON → MessagePack ● CSV → MessagePack ● Dremel: columnar serialization → Protocol Buffers • Data sources: – HDFS – NFS – HBase / Cassandra – MySQL / PostgreSQL / etc
  • 12. Storage Interfaces • Scanner Flexibility: – Allow in-place filtering (predicate pushdown) – Scanners can manage their own caching policies for their data • In-place processing – Having a separate “ETL” step is painful ● Easiest to process data on demand – Query workload gives feedback on scanner access patterns ● Database Cracking: adaptively convert storage layout into more efficient forms
  • 13. Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log • Pluggable data sources messages Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)