SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
Apache Drill
             Design proposal from
              OpenDremel team
                  HLD Version 0.2, 9/sep/2012

Camuel Gilyadov & Constantine Peresypkin,
Email: Camuel@BigDataCraft.com
Intro

• This is high-level design proposal for project
  ApacheDrill from OpenDeremel team.
• History slides and usual “about us” stuff moved to the
  end of the deck.
• Slide with all relevant links also published in the end.
Design Tenet #1

• Apache Drill must support multi-tenant semantics
  internally and not to be run altogether in guest VMs.
• It should be inspired by BigQuery and not only by
  Dremel/PowerDrill/Tenzing papers.
• It is not practical to setup dedicated cloud (billed
  hourly) just to be able to run a query for a few seconds.
• The codebase must be clearly divided into trusted part
  and untrusted part. Trusted part must be kept to
  absolute minimum and must be peer-reviewed, secured,
  audited and metered.
Design Tenet #2

• Apache Drill must be modular and customizable in
  many dimensions.
• Schema-on-read concept must be supported.
  Imperatively coded high-performance data parser must
  embeddable into the query.
• SQL is not longer enough. New query languages must
  be easily added as well as user-defined-functions (UDF)
  implementing deep-analytics (such as statistics and
  machine learning).
• Additionally various data-formats must be supported
  like column-stores, row-stores, PAX, RCFiles and etc.
Design Tenet #2 (cont.)

• We suggest that query plan format will be relaxed to
  arbitrary executable, and data format relaxed to
  arbitrary opaque BLOB.
• This way new query languages and new data formats
  could be easily supported without changing backend.
• As added benefit backend becomes generic lightweight
  homogeneous compute-storage cloud.
• Such approach exhibits good separation of control.
  Cloud operator controls and bills for generic
  infrastructure and the query engine is left completely in
  the control of the tenant/user.
Design Tenet #3

• Apache Drill requests/queries must be hyper-elastic
  meaning capability to exploit compute capacity of
  thousands of servers for short duration of just a few
  seconds. No resources must be kept spinning per user
  between queries or when idle.
• Traditional VMs are too heavyweight for that.
  Container approach such as OpenVZ/LXC and etc. are
  not secure enough in multi-tenancy context.
• We suggest making sandboxing pluggable and
  supporting ZeroVM ( developed for OpenDremel ) and
  LXC (is fine for private clouds) to begin with.
Design Tenet #4

•   Apache Drill must be efficient.
•   Value-per-bit is extremely low with BigData.
•   Overhead in the inner loop must be kept to minimum.
•   Java was found inefficient for general number
    crunching (such as data compression). The main
    problem with Java is that GC overhead is unavoidable
    for the whole data corpus being scanned. We went so
    far as to keep all data in byte arrays and auto-generate
    transformation code and it still underperformed and
    code complexity went through the roof.
Suggested Architecture
Browser / Client    Single-Tenant                          Multi-Tenant
                      Frontend                              Backend
                      running inside                  scale-out object store
                   traditional guest VM                and in-situ compute


                         JVM


  Query                 Query
                       Compiler
                                          Executable job




Executable job
Suggested Frontend
                                     Design
• Usual Java single-tenant web application.
• In charge of:
   –   All interaction with user.
   –   Query/job submission
   –   Query/job progress monitoring
   –   Result browsing


       Client Tools                    Java Servlet
           CLI
                                    REST       Query
         AJAX App                  Gateway    Compiler
Suggested AJAX

• What AJAX framework?
• ExtJs?
• Look&Feel – just clone Google App with the
  trademarks and logos replaced?
• Why WebUI of Drill is more important than
  Hive?
  – Drill is interactive, at least basic WebUI must be
    provided with each release.
Suggested CLI
                         Design
• Bash+curl would suffice?
• Full blown Java CLI tool?
Suggested REST-GW
                                 Design
• Usual vanilla Java WebApp with Spring!
Suggested Query
                                             Compiler Design #1
• Query Compiler consists from two component
  libraries with stable but language-dependent (so
  no reuse unfortunately ) interface between them:

Query                                                      Executable
Text        Parsers   Semantic ModelReader    Planners      Script




             Syntax                             Semantic
             Errors                              Errors
Suggested Query
                            Compiler Design #2
• DrqlSemanticModelReader is ready and published
  under …..
• SemanticModel that parsers produces closely follows
  original language. Parsers just parses query text and
  doesn’t attempts to “give it meaning” or annotate.
• Simplified example:
   –   List<Expression> getResultColumns()
   –   List<DrqlQuery> getFromClause();
   –   List<ColumnId> getGroupByClause();
   –   etc….
Suggested Query
                                      Compiler Design #3
• What is Executable Script?
   – Self-contained serializable, executable object. When executed with
     appropriate executor and yields correct query result on given input data
     of expected format
   – Self contained means no dependencies, everything is included in that
     executable object.
   – Particularly data parsing logic is included.
   – However, data access logic is NOT included.
   – The model for script is: “here is your blob of size N mapped to
     memory starting from address S, you have time T to generate your
     result up to size R in memory starting from address D. You will be
     terminated without advance notice for any attempted violation of
     any restriction”
Suggested Query
                                       Compiler Design #3
• How executable script is generated?
   1.   Query object implementing SemanticModelReader interface is
        provided to planner by parser.
   2.   Planner logic examines semantic model through the
        SemanticModelReader interface and produces query plan
        object, that implements QueryPlanModelReader interface.
        Query analysis and optimization takes place during this stage and if
        needed additional interface of QueryPlanModelRewriter
        and/or QueryPlanModelVisitor could be created for this
        reason. However DrQL is a simple language without large (or any)
        search space so optimizer value is small. We suggest bypassing
        altogether query rewriting and query optimization for initial releases.
   3.   When query plan is generated, a most appropriate code template script
        is selected. Then template engine processes template coupled with
        QueryPlanModelReader object to produce executable
Suggested Backend Design

• TODO
• Executors per se
   – Janino based Java Executor
   – LXC-GCC based C Eexecutor
   – ZeroVM-GCC based C Executor
• Storage platforms with collocated data processing
   – Local files (non distributed)
   – HDFS
   – OpenStack Swift
OpenDremel/Dazo
Two separate unfinished    We call it Metaxa          We call it Zwift
jQuery apps & cmdline        (historic reasons)      (Swift + ZeroVM)
 app with no particular    BQL Parser, unfinished
      codenames           compiler based on Apache
                                                       Alpha Quality
                                   Velocity


                                  JVM


    Query                        Query
                                Compiler




  Executable job
What is Swift?




“Swift is a highly available, distributed,
eventually consistent object/blob store.
Organizations can use Swift to store
lots of data efficiently, safely, and
cheaply.”
Don’t get it?



Swift is THE open-source
   implementation of
        Amazon S3
What is ZeroVM?




Highly-secure, low-overhead, low-latency container-style
virtualization based on Google Native Client project. The
critical security code is transferred verbatim from Chrome
Browser project and therefore is as secure as Chrome
Browser. More info: http://ZeroVM.org and
http://news.ycombinator.com/item?id=3746222
ZeroVM highlights

1.   Disposable VM per request
2.   HyperElasticity per request
3.   Embeddable into everything
4.   High-performance (x86/ARM)
5.   Erlang inspired clustering
6.   Written in pure C, not deps
Don’t get it?


ZeroVM to Virtualization
        is what
SQLite is to Databases
Links

• https://github.com/ApacheDrill/Brainstorm/wiki/Apache-Drill-Links
• OpenDremel (1st generation design):
    – http://code.google.com/p/dremel/source/browse?repo=dremel
    – http://code.google.com/p/dremel/source/browse?repo=metaxa

• Dazo (2nd generation design):
    – https://github.com/Dazo-org
OpenDremel Story: 2010

• Camuel Gilyadov started Dremel implementation on
  summer 2010 named OpenDremel.
• David Gruzman joined the effort a few months later
  followed by Constantine Peresypkin.
• There wasn’t a comprehensive design or architecture.
  The goal was to get hierarchal-columnar transformation
  working smoothly and in strict accordance to the
  Dremel paper. Several working implementations are
  published by us under Apache License.
• Hong San was hired as first full-timer to speedup the
  development. Metaxa milestone was set.
OpenDremel Story: 2011
• OpenDremel early design was found too naive, mainly due to
  Java underperformance in inner number-crunching loops.
• After fierce brainstorming, project was restarted from scratch
  under new name Dazo. With Dazo, query plan is an arbitrary
  piece of executable native code with Java frontend.
• From now on we got inspiration from BigQuery as opposed to
  from Dremel paper.
• We decided to use Google NaCl as sandboxing technology to
  isolate queries as well as meter resource consumption. The new
  sandbox was named ZeroVM.
• As for storage we decided to use OpenStack Swift.
OpenDremel Story: 2012

• Four people full-time, several others part time, we still
  don’t have fully integrated version but we are satisfied
  with what we have achieved and convinced that the
  decisions behind Dazo were correct.
• We believe ZeroVM could be a disruptive technology in
  itself revolutionizing BigData@Cloud space.
• We are excited by Apache Drill initiative and hope to be
  useful for it.
• Check the blog: http://BigDataCraft.com
Thanks
Camuel Gilyadov,
Email: Camuel@BigDataCraft.com

Mais conteúdo relacionado

Mais procurados

Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningJohn Mulhall
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latestWes McKinney
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next GenerationWes McKinney
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
 

Mais procurados (20)

Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
 
Update on HDF5 1.8
Update on HDF5 1.8Update on HDF5 1.8
Update on HDF5 1.8
 

Semelhante a Apache Drill (ver. 0.2)

Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Camuel Gilyadov
 
Meetup. Technologies Intro for Non-Tech People
Meetup. Technologies Intro for Non-Tech PeopleMeetup. Technologies Intro for Non-Tech People
Meetup. Technologies Intro for Non-Tech PeopleIT Arena
 
Headless cms architecture
Headless cms architectureHeadless cms architecture
Headless cms architectureKevin Wenger
 
AngularJS - Architecture decisions in a large project 
AngularJS - Architecture decisionsin a large project AngularJS - Architecture decisionsin a large project 
AngularJS - Architecture decisions in a large project Elad Hirsch
 
Delivering big content at NBC News with RavenDB
Delivering big content at NBC News with RavenDBDelivering big content at NBC News with RavenDB
Delivering big content at NBC News with RavenDBJohn Bennett
 
Ankit Chohan - Java
Ankit Chohan - JavaAnkit Chohan - Java
Ankit Chohan - JavaAnkit Chohan
 
What is Mean Stack Development ?
What is Mean Stack Development ?What is Mean Stack Development ?
What is Mean Stack Development ?Balajihope
 
Zend MVC pattern based Framework – Best for Enterprise web applications
Zend MVC pattern based Framework – Best for Enterprise web applicationsZend MVC pattern based Framework – Best for Enterprise web applications
Zend MVC pattern based Framework – Best for Enterprise web applicationsEtisbew Technology Group
 
Starting from scratch in 2017
Starting from scratch in 2017Starting from scratch in 2017
Starting from scratch in 2017Stefano Bonetta
 
Rami Sayar - Node microservices with Docker
Rami Sayar - Node microservices with DockerRami Sayar - Node microservices with Docker
Rami Sayar - Node microservices with DockerWeb à Québec
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
Asp.NETZERO - A Workshop Presentation by Citytech Software
Asp.NETZERO - A Workshop Presentation by Citytech SoftwareAsp.NETZERO - A Workshop Presentation by Citytech Software
Asp.NETZERO - A Workshop Presentation by Citytech SoftwareRitwik Das
 
MWLUG - Universal Java
MWLUG  -  Universal JavaMWLUG  -  Universal Java
MWLUG - Universal JavaPhilippe Riand
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaHelen Rogers
 
Lessons learned from building Eclipse-based add-ons for commercial modeling t...
Lessons learned from building Eclipse-based add-ons for commercial modeling t...Lessons learned from building Eclipse-based add-ons for commercial modeling t...
Lessons learned from building Eclipse-based add-ons for commercial modeling t...IncQuery Labs
 
Cloud compiler - Minor Project by students of CBPGEC
Cloud compiler - Minor Project by students of CBPGEC  Cloud compiler - Minor Project by students of CBPGEC
Cloud compiler - Minor Project by students of CBPGEC vipin kumar
 

Semelhante a Apache Drill (ver. 0.2) (20)

Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)
 
Javascript best practices
Javascript best practicesJavascript best practices
Javascript best practices
 
Meetup. Technologies Intro for Non-Tech People
Meetup. Technologies Intro for Non-Tech PeopleMeetup. Technologies Intro for Non-Tech People
Meetup. Technologies Intro for Non-Tech People
 
Beginners Node.js
Beginners Node.jsBeginners Node.js
Beginners Node.js
 
20120306 dublin js
20120306 dublin js20120306 dublin js
20120306 dublin js
 
Headless cms architecture
Headless cms architectureHeadless cms architecture
Headless cms architecture
 
AngularJS - Architecture decisions in a large project 
AngularJS - Architecture decisionsin a large project AngularJS - Architecture decisionsin a large project 
AngularJS - Architecture decisions in a large project 
 
Delivering big content at NBC News with RavenDB
Delivering big content at NBC News with RavenDBDelivering big content at NBC News with RavenDB
Delivering big content at NBC News with RavenDB
 
Ankit Chohan - Java
Ankit Chohan - JavaAnkit Chohan - Java
Ankit Chohan - Java
 
What is Mean Stack Development ?
What is Mean Stack Development ?What is Mean Stack Development ?
What is Mean Stack Development ?
 
Zend MVC pattern based Framework – Best for Enterprise web applications
Zend MVC pattern based Framework – Best for Enterprise web applicationsZend MVC pattern based Framework – Best for Enterprise web applications
Zend MVC pattern based Framework – Best for Enterprise web applications
 
Starting from scratch in 2017
Starting from scratch in 2017Starting from scratch in 2017
Starting from scratch in 2017
 
Rami Sayar - Node microservices with Docker
Rami Sayar - Node microservices with DockerRami Sayar - Node microservices with Docker
Rami Sayar - Node microservices with Docker
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Asp.NETZERO - A Workshop Presentation by Citytech Software
Asp.NETZERO - A Workshop Presentation by Citytech SoftwareAsp.NETZERO - A Workshop Presentation by Citytech Software
Asp.NETZERO - A Workshop Presentation by Citytech Software
 
MWLUG - Universal Java
MWLUG  -  Universal JavaMWLUG  -  Universal Java
MWLUG - Universal Java
 
Intro to Sails.js
Intro to Sails.jsIntro to Sails.js
Intro to Sails.js
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
Lessons learned from building Eclipse-based add-ons for commercial modeling t...
Lessons learned from building Eclipse-based add-ons for commercial modeling t...Lessons learned from building Eclipse-based add-ons for commercial modeling t...
Lessons learned from building Eclipse-based add-ons for commercial modeling t...
 
Cloud compiler - Minor Project by students of CBPGEC
Cloud compiler - Minor Project by students of CBPGEC  Cloud compiler - Minor Project by students of CBPGEC
Cloud compiler - Minor Project by students of CBPGEC
 

Último

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

Apache Drill (ver. 0.2)

  • 1. Apache Drill Design proposal from OpenDremel team HLD Version 0.2, 9/sep/2012 Camuel Gilyadov & Constantine Peresypkin, Email: Camuel@BigDataCraft.com
  • 2. Intro • This is high-level design proposal for project ApacheDrill from OpenDeremel team. • History slides and usual “about us” stuff moved to the end of the deck. • Slide with all relevant links also published in the end.
  • 3. Design Tenet #1 • Apache Drill must support multi-tenant semantics internally and not to be run altogether in guest VMs. • It should be inspired by BigQuery and not only by Dremel/PowerDrill/Tenzing papers. • It is not practical to setup dedicated cloud (billed hourly) just to be able to run a query for a few seconds. • The codebase must be clearly divided into trusted part and untrusted part. Trusted part must be kept to absolute minimum and must be peer-reviewed, secured, audited and metered.
  • 4. Design Tenet #2 • Apache Drill must be modular and customizable in many dimensions. • Schema-on-read concept must be supported. Imperatively coded high-performance data parser must embeddable into the query. • SQL is not longer enough. New query languages must be easily added as well as user-defined-functions (UDF) implementing deep-analytics (such as statistics and machine learning). • Additionally various data-formats must be supported like column-stores, row-stores, PAX, RCFiles and etc.
  • 5. Design Tenet #2 (cont.) • We suggest that query plan format will be relaxed to arbitrary executable, and data format relaxed to arbitrary opaque BLOB. • This way new query languages and new data formats could be easily supported without changing backend. • As added benefit backend becomes generic lightweight homogeneous compute-storage cloud. • Such approach exhibits good separation of control. Cloud operator controls and bills for generic infrastructure and the query engine is left completely in the control of the tenant/user.
  • 6. Design Tenet #3 • Apache Drill requests/queries must be hyper-elastic meaning capability to exploit compute capacity of thousands of servers for short duration of just a few seconds. No resources must be kept spinning per user between queries or when idle. • Traditional VMs are too heavyweight for that. Container approach such as OpenVZ/LXC and etc. are not secure enough in multi-tenancy context. • We suggest making sandboxing pluggable and supporting ZeroVM ( developed for OpenDremel ) and LXC (is fine for private clouds) to begin with.
  • 7. Design Tenet #4 • Apache Drill must be efficient. • Value-per-bit is extremely low with BigData. • Overhead in the inner loop must be kept to minimum. • Java was found inefficient for general number crunching (such as data compression). The main problem with Java is that GC overhead is unavoidable for the whole data corpus being scanned. We went so far as to keep all data in byte arrays and auto-generate transformation code and it still underperformed and code complexity went through the roof.
  • 8. Suggested Architecture Browser / Client Single-Tenant Multi-Tenant Frontend Backend running inside scale-out object store traditional guest VM and in-situ compute JVM Query Query Compiler Executable job Executable job
  • 9. Suggested Frontend Design • Usual Java single-tenant web application. • In charge of: – All interaction with user. – Query/job submission – Query/job progress monitoring – Result browsing Client Tools Java Servlet CLI REST Query AJAX App Gateway Compiler
  • 10. Suggested AJAX • What AJAX framework? • ExtJs? • Look&Feel – just clone Google App with the trademarks and logos replaced? • Why WebUI of Drill is more important than Hive? – Drill is interactive, at least basic WebUI must be provided with each release.
  • 11. Suggested CLI Design • Bash+curl would suffice? • Full blown Java CLI tool?
  • 12. Suggested REST-GW Design • Usual vanilla Java WebApp with Spring!
  • 13. Suggested Query Compiler Design #1 • Query Compiler consists from two component libraries with stable but language-dependent (so no reuse unfortunately ) interface between them: Query Executable Text Parsers Semantic ModelReader Planners Script Syntax Semantic Errors Errors
  • 14. Suggested Query Compiler Design #2 • DrqlSemanticModelReader is ready and published under ….. • SemanticModel that parsers produces closely follows original language. Parsers just parses query text and doesn’t attempts to “give it meaning” or annotate. • Simplified example: – List<Expression> getResultColumns() – List<DrqlQuery> getFromClause(); – List<ColumnId> getGroupByClause(); – etc….
  • 15. Suggested Query Compiler Design #3 • What is Executable Script? – Self-contained serializable, executable object. When executed with appropriate executor and yields correct query result on given input data of expected format – Self contained means no dependencies, everything is included in that executable object. – Particularly data parsing logic is included. – However, data access logic is NOT included. – The model for script is: “here is your blob of size N mapped to memory starting from address S, you have time T to generate your result up to size R in memory starting from address D. You will be terminated without advance notice for any attempted violation of any restriction”
  • 16. Suggested Query Compiler Design #3 • How executable script is generated? 1. Query object implementing SemanticModelReader interface is provided to planner by parser. 2. Planner logic examines semantic model through the SemanticModelReader interface and produces query plan object, that implements QueryPlanModelReader interface. Query analysis and optimization takes place during this stage and if needed additional interface of QueryPlanModelRewriter and/or QueryPlanModelVisitor could be created for this reason. However DrQL is a simple language without large (or any) search space so optimizer value is small. We suggest bypassing altogether query rewriting and query optimization for initial releases. 3. When query plan is generated, a most appropriate code template script is selected. Then template engine processes template coupled with QueryPlanModelReader object to produce executable
  • 17. Suggested Backend Design • TODO • Executors per se – Janino based Java Executor – LXC-GCC based C Eexecutor – ZeroVM-GCC based C Executor • Storage platforms with collocated data processing – Local files (non distributed) – HDFS – OpenStack Swift
  • 18. OpenDremel/Dazo Two separate unfinished We call it Metaxa We call it Zwift jQuery apps & cmdline (historic reasons) (Swift + ZeroVM) app with no particular BQL Parser, unfinished codenames compiler based on Apache Alpha Quality Velocity JVM Query Query Compiler Executable job
  • 19. What is Swift? “Swift is a highly available, distributed, eventually consistent object/blob store. Organizations can use Swift to store lots of data efficiently, safely, and cheaply.”
  • 20. Don’t get it? Swift is THE open-source implementation of Amazon S3
  • 21. What is ZeroVM? Highly-secure, low-overhead, low-latency container-style virtualization based on Google Native Client project. The critical security code is transferred verbatim from Chrome Browser project and therefore is as secure as Chrome Browser. More info: http://ZeroVM.org and http://news.ycombinator.com/item?id=3746222
  • 22. ZeroVM highlights 1. Disposable VM per request 2. HyperElasticity per request 3. Embeddable into everything 4. High-performance (x86/ARM) 5. Erlang inspired clustering 6. Written in pure C, not deps
  • 23. Don’t get it? ZeroVM to Virtualization is what SQLite is to Databases
  • 24. Links • https://github.com/ApacheDrill/Brainstorm/wiki/Apache-Drill-Links • OpenDremel (1st generation design): – http://code.google.com/p/dremel/source/browse?repo=dremel – http://code.google.com/p/dremel/source/browse?repo=metaxa • Dazo (2nd generation design): – https://github.com/Dazo-org
  • 25. OpenDremel Story: 2010 • Camuel Gilyadov started Dremel implementation on summer 2010 named OpenDremel. • David Gruzman joined the effort a few months later followed by Constantine Peresypkin. • There wasn’t a comprehensive design or architecture. The goal was to get hierarchal-columnar transformation working smoothly and in strict accordance to the Dremel paper. Several working implementations are published by us under Apache License. • Hong San was hired as first full-timer to speedup the development. Metaxa milestone was set.
  • 26. OpenDremel Story: 2011 • OpenDremel early design was found too naive, mainly due to Java underperformance in inner number-crunching loops. • After fierce brainstorming, project was restarted from scratch under new name Dazo. With Dazo, query plan is an arbitrary piece of executable native code with Java frontend. • From now on we got inspiration from BigQuery as opposed to from Dremel paper. • We decided to use Google NaCl as sandboxing technology to isolate queries as well as meter resource consumption. The new sandbox was named ZeroVM. • As for storage we decided to use OpenStack Swift.
  • 27. OpenDremel Story: 2012 • Four people full-time, several others part time, we still don’t have fully integrated version but we are satisfied with what we have achieved and convinced that the decisions behind Dazo were correct. • We believe ZeroVM could be a disruptive technology in itself revolutionizing BigData@Cloud space. • We are excited by Apache Drill initiative and hope to be useful for it. • Check the blog: http://BigDataCraft.com