SlideShare uma empresa Scribd logo
1 de 13
Using realtime SQL2003 to query
JSON on Hadoop with Apache Drill
               January 28, 2013
                    Jacques Nadeau
     Apache Drill Contributor @ MapR Technologies
Me
• Apache Drill and HBase Contributor
• Sponsored by MapR Technologies to lead Apache Drill
  contributions


   – Enterprise-grade high performance distribution for
     Hadoop
   – Open source plus standards-based extensions
   – Large number Fortune 100 customers, startups too.
   – Free distribution for unlimited nodes
   – Partnered to provide on Google Compute Engine and
     Amazon Elastic MapReduce
Transaction
                         information
Jane works as an
Analyst at an
ecommerce website

How does she figure         User
                            profiles
out good targeting
segments for the next
marketing campaign?

She has some ideas
and lots of data        Access
                        logs
Let’s try using existing options
•   Use Oracle
     – Write flattening MongoDB query for export and generate giant CSV. Work with MapReduce
       team to build a MapReduce job that provides export. Contact DBA to import data exports. Use
       Oracle SQL to determine answers.
•   Use Hive
     – Pull up Hive. Start writing queries. Realize that Hive/Mongo interconnector doesn’t support
       nested data. Realize that Hive doesn’t have JDBC/ODBC storage handler. Query data from
       Oracle and copy to Hadoop. Query flattened Mongo data and copy into Hadoop. Write HiveQL
       query. Wait 30 minutes for result. Repeat until desired outcome. Avoid frustration along the
       way with the flattened Mongo data, portion of Oracle extraction, and the lack of major
       portions of SQL syntax.
•   Use Data Virtualization Solution
     – Write SQL query against virtualization interface. Realize that you still need to ETL Mongo data
       since it isn’t natively supported. Query runs slowly since virtualization solution doesn’t run
       locally against Hadoop data and fails to effectively distribute your query.
•   Use MapReduce
     – Work with Engineering to define a specification for needs. Use Sqoop to setup regular ETL
       from Oracle. Define a custom MapReduce to import Mongo data.
     – Look at output and realize different analyses should be done, repeat cycle (or learn Java)
Why are things so hard?
• Slow
   – Virtualization solutions don’t support data locality and pushdown
   – MapReduce sacrifices performance to support long running jobs, recoverability, and
     ultimate flexibility
• Old
   – Most systems assume flat data with well-defined static schemas
• Hard
   – Write queries in multiple languages (Does anybody no MongoQL, CQL, HiveQL and
     SQL?)
   – Analysts often need custom development help
• Error Prone
   – ETL leads to data synchronization issues
   – Lack of query transparency leads to incorrect assumptions and bad business conclusions
• Expensive
   – Commercial solutions are very expensive
   – Typically provide poor compatibility with newer NoSQL technologies
Open Source Mantra: WWGD?
         Distributed                 Interactive   Batch
                       Datastore
         File System                 analysis      processing


              GFS         BigTable      Dremel      MapReduce


                                                     Hadoop
             HDFS          HBase
                                                    MapReduce




Build Apache Drill to provide a true open source
   solution to interactive analysis of Big Data
Apache Drill Overview
• Drill overview
   –   Low latency interactive queries
   –   Standard ANSI SQL2003 support
   –   Domain Specific Languages / Your own QL
   –   Inspired by, compatible with Google BigQuery/Dremel
   –   Supports Nested/Hierarchical Data Formats
   –   Supports RDBMS, Hadoop and NoSQL alike

• Open-Source and Flexible
   – Apache Incubator
   – 100’s involved across US and Europe
   – Community consensus on API, functionality
Why do we need another tool?

Point queries              Data Analyst & Reporting Queries
0-100ms                    3 minutes – 20 minutes
     Interactive Queries
     100ms – 3 minutes                                  Data Mining and Major ETL
                                                        20 minutes – 20 hours




                                                              MapReduce,
                           Apache
 Per                                                          Hive and PIG
                           Drill
 system
 interfaces
Why not improve Hive or Pig?
•   Different Goals
•   SQL should be first class concern
•   MapReduce severely hampers processing model and performance
     –   Startup cost is high
     –   Map:Reduce recoverability and barrier disadvantages
     –   Job:Job recoverability and barrier disadvantages (chained jobs)
•   Need to build from in-memory representation
     –   Two canonical in-memory formats (row-based and columnar)
     –   Support much larger memory sizes
     –   Smaller memory footprint per record
     –   Avoid serialization/deserialization and object creation costs between nodes and operations
•   Performance of interactive queries is critical
     –   Evaluation and Operator code generation & compilation
•   First class recognition of nested types without metadata requirement
     –   Schema Discovery and standard schema representation
•   Clear delineation between important stages
     –   Support for multiple optimizers and researcher experimentation
How does it work?
• Drillbits run on each node to minimize
  network transfer
• Queries can be fed to any Drillbit.      SELECT * FROM
                                           oracle.transactions,
• Coordination, query planning,            mongo.users,
  optimization, scheduling, and            hdfs.events
                                           LIMIT 1
  execution are distributed
Flexibility with Strongly Defined Tiers and APIs
Apache Drill currently in development
• Heavy active development by multiple
  supporting organizations
• Available
  – Logical plan syntax and interpreter
  – Reference Interpreter
• In progress
  – SQL interpreter
  – Storage Engine implementations for Accumulo,
    Cassandra, HBase, and HDFS file formats
Conclusion & Questions
• Put Apache Drill on your roadmap, we’ll make your life
  easier

• Join the community
   – Code: http://github.com/apache/incubator-drill
   – Mailing List: drill-user@incubator.apache.org
   – Wiki: https://cwiki.apache.org/confluence/display/DRILL

• Access this presentation: http://bit.ly/Wo6DLd

• Contact Me:
   – jacques.drill@gmail.com

Mais conteúdo relacionado

Mais de MapR Technologies

Mais de MapR Technologies (20)

Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 
How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data Platform
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -
 
Handling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceHandling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in Finance
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

GBDC 2013-01-28

  • 1. Using realtime SQL2003 to query JSON on Hadoop with Apache Drill January 28, 2013 Jacques Nadeau Apache Drill Contributor @ MapR Technologies
  • 2. Me • Apache Drill and HBase Contributor • Sponsored by MapR Technologies to lead Apache Drill contributions – Enterprise-grade high performance distribution for Hadoop – Open source plus standards-based extensions – Large number Fortune 100 customers, startups too. – Free distribution for unlimited nodes – Partnered to provide on Google Compute Engine and Amazon Elastic MapReduce
  • 3. Transaction information Jane works as an Analyst at an ecommerce website How does she figure User profiles out good targeting segments for the next marketing campaign? She has some ideas and lots of data Access logs
  • 4. Let’s try using existing options • Use Oracle – Write flattening MongoDB query for export and generate giant CSV. Work with MapReduce team to build a MapReduce job that provides export. Contact DBA to import data exports. Use Oracle SQL to determine answers. • Use Hive – Pull up Hive. Start writing queries. Realize that Hive/Mongo interconnector doesn’t support nested data. Realize that Hive doesn’t have JDBC/ODBC storage handler. Query data from Oracle and copy to Hadoop. Query flattened Mongo data and copy into Hadoop. Write HiveQL query. Wait 30 minutes for result. Repeat until desired outcome. Avoid frustration along the way with the flattened Mongo data, portion of Oracle extraction, and the lack of major portions of SQL syntax. • Use Data Virtualization Solution – Write SQL query against virtualization interface. Realize that you still need to ETL Mongo data since it isn’t natively supported. Query runs slowly since virtualization solution doesn’t run locally against Hadoop data and fails to effectively distribute your query. • Use MapReduce – Work with Engineering to define a specification for needs. Use Sqoop to setup regular ETL from Oracle. Define a custom MapReduce to import Mongo data. – Look at output and realize different analyses should be done, repeat cycle (or learn Java)
  • 5. Why are things so hard? • Slow – Virtualization solutions don’t support data locality and pushdown – MapReduce sacrifices performance to support long running jobs, recoverability, and ultimate flexibility • Old – Most systems assume flat data with well-defined static schemas • Hard – Write queries in multiple languages (Does anybody no MongoQL, CQL, HiveQL and SQL?) – Analysts often need custom development help • Error Prone – ETL leads to data synchronization issues – Lack of query transparency leads to incorrect assumptions and bad business conclusions • Expensive – Commercial solutions are very expensive – Typically provide poor compatibility with newer NoSQL technologies
  • 6. Open Source Mantra: WWGD? Distributed Interactive Batch Datastore File System analysis processing GFS BigTable Dremel MapReduce Hadoop HDFS HBase MapReduce Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
  • 7. Apache Drill Overview • Drill overview – Low latency interactive queries – Standard ANSI SQL2003 support – Domain Specific Languages / Your own QL – Inspired by, compatible with Google BigQuery/Dremel – Supports Nested/Hierarchical Data Formats – Supports RDBMS, Hadoop and NoSQL alike • Open-Source and Flexible – Apache Incubator – 100’s involved across US and Europe – Community consensus on API, functionality
  • 8. Why do we need another tool? Point queries Data Analyst & Reporting Queries 0-100ms 3 minutes – 20 minutes Interactive Queries 100ms – 3 minutes Data Mining and Major ETL 20 minutes – 20 hours MapReduce, Apache Per Hive and PIG Drill system interfaces
  • 9. Why not improve Hive or Pig? • Different Goals • SQL should be first class concern • MapReduce severely hampers processing model and performance – Startup cost is high – Map:Reduce recoverability and barrier disadvantages – Job:Job recoverability and barrier disadvantages (chained jobs) • Need to build from in-memory representation – Two canonical in-memory formats (row-based and columnar) – Support much larger memory sizes – Smaller memory footprint per record – Avoid serialization/deserialization and object creation costs between nodes and operations • Performance of interactive queries is critical – Evaluation and Operator code generation & compilation • First class recognition of nested types without metadata requirement – Schema Discovery and standard schema representation • Clear delineation between important stages – Support for multiple optimizers and researcher experimentation
  • 10. How does it work? • Drillbits run on each node to minimize network transfer • Queries can be fed to any Drillbit. SELECT * FROM oracle.transactions, • Coordination, query planning, mongo.users, optimization, scheduling, and hdfs.events LIMIT 1 execution are distributed
  • 11. Flexibility with Strongly Defined Tiers and APIs
  • 12. Apache Drill currently in development • Heavy active development by multiple supporting organizations • Available – Logical plan syntax and interpreter – Reference Interpreter • In progress – SQL interpreter – Storage Engine implementations for Accumulo, Cassandra, HBase, and HDFS file formats
  • 13. Conclusion & Questions • Put Apache Drill on your roadmap, we’ll make your life easier • Join the community – Code: http://github.com/apache/incubator-drill – Mailing List: drill-user@incubator.apache.org – Wiki: https://cwiki.apache.org/confluence/display/DRILL • Access this presentation: http://bit.ly/Wo6DLd • Contact Me: – jacques.drill@gmail.com