SlideShare uma empresa Scribd logo
1 de 59
The MapR Big Data platform
M7 and Apache Drill
Michael Hausenblas, Chief Data Engineer EMEA, MapR
HUG France, Paris, 2013-06-05
Which
workloads do
you
encounter in
your
environment?
http://www.flickr.com/photos/kevinomara/2866648330/licensedunderCCBY-NC-ND2.0
Batch processing
… for recurring tasks such as large-scale data mining, ETL
offloading/data-warehousing  for the batch layer in Lambda
architecture
OLTP
… user-facing eCommerce transactions, real-time messaging at
scale (FB), time-series processing, etc.  for the serving layer in
Lambda architecture
Stream processing
… in order to handle stream sources such as social media feeds
or sensor data (mobile phones, RFID, weather stations, etc.) 
for the speed layer in Lambda architecture
Search/Information Retrieval
… retrieval of items from unstructured documents (plain
text, etc.), semi-structured data formats (JSON, etc.), as
well as data stores (MongoDB, CouchDB, etc.)
MapR Big Data platform
MAPR’S M7
HOW WE MADE HBASE ENTERPRISE-GRADE
HBase Adoption
 Used by 45% of Hadoop users
 Database Operations:
Large scale key-value store
Blob store
Lightweight OLTP
 Real-time Analytics:
Logistics Shopping cart
Billing Auction engine
Log analysis
HBase Issues
Reliability
•Compactions disrupt operations
•Very slow crash recovery
•Unreliable splitting
Business continuity
•Common hardware/software issues cause downtime
•Administration requires downtime
•No point-in-time recovery
•Complex backup process
Performance
•Many bottlenecks result in low throughput
•Limited data locality
•Limited # of tables
Manageability
•Compactions, splits and merges must be done manually (in reality)
•Basic operations like backup or table rename are complex
M7: Enterprise-grade HBase
EASY DEPENDABLE FAST
M7 Enterprise Grade
No RegionServers No
Compactions
Consistent Low Latency
Snapshots
Mirroring
Instant Recovery
No Manual Processes
Unified Namespace
Unified Namespace
$ pwd
/mapr/default/user/dave
$ ls
file1 file2 table1 table2
$ hbase shell
hbase(main):003:0> create '/user/dave/table3', 'cf1', 'cf2', 'cf3'
0 row(s) in 0.1570 seconds
$ ls
file1 file2 table1 table2 table3
$ hadoop fs -ls /user/dave
Found 5 items
-rw-r--r-- 3 mapr mapr 16 2012-09-28 08:34 /user/dave/file1
-rw-r--r-- 3 mapr mapr 22 2012-09-28 08:34 /user/dave/file2
trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:32 /user/dave/table1
trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:33 /user/dave/table2
trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:38 /user/dave/table3
Fewer Layers
MapR%M7
Eliminating Compactions
HBase-style LevelDB-style M7
Examples BigTable, HBase,
Cassandra, Riak
Cassandra, Riak
WAF Low High Low
RAF High Low Low
I/O storms Yes No No
Disk space overhead High (2x) Low Low
Skewed data handling Bad Good Good
Rewrite large values Yes Yes No
Write-amplification factor (WAF): The ratio between writes to disk and application writes. Note that data must
be rewritten in every indexed structure.
Read-amplification factor (RAF): The ratio between reads from disk and application reads.
Skewed data handling: When inserting values with similar keys (eg, increasing keys, trending topic), do other
values also need to be rewritten?
Portability (In Both Ways)
• HBase applications work as is with M7
– No need to recompile
– No vendor lock-in
• You can also run Apache HBase on an M7 cluster
– Recommended during a migration
– Table names with a slash (/) are in M7, table names without a slash
are in Apache HBase (this can be overridden to allow table-by-table
migration)
• Use standard CopyTable tool to copy a table from HBase to M7 and
vice versa
– hbase org.apache.hadoop.hbase.mapreduce.CopyTable --
new.name=/user/tshiran/mytable mytable
MapR Big Data platform
Apache Drill
interactive, ad-hoc query at scale
http://www.flickr.com/photos/9479603@N02/4144121838/ licensed under CC BY-NC-ND 2.0
How to do
interactive
ad-hoc query
at scale?
Impala
Interactive Query (?)
low-latency
Use Case: Marketing Campaign
• Jane, a marketing analyst
• Determine target segments
• Data from different sources
Use Case: Logistics
• Supplier tracking and performance
• Queries
– Shipments from supplier ‘ACM’ in last 24h
– Shipments in region ‘US’ not from ‘ACM’
SUPPLIER_ID NAME REGION
ACM ACME Corp US
GAL GotALot Inc US
BAP Bits and Pieces Ltd Europe
ZUP Zu Pli Asia
{
"shipment": 100123,
"supplier": "ACM",
“timestamp": "2013-02-01",
"description": ”first delivery today”
},
{
"shipment": 100124,
"supplier": "BAP",
"timestamp": "2013-02-02",
"description": "hope you enjoy it”
}
…
Use Case: Crime Detection
• Online purchases
• Fraud, bilking, etc.
• Batch-generated overview
• Modes
– Explorative
– Alerts
Requirements
• Support for different data sources
• Support for different query interfaces
• Low-latency/real-time
• Ad-hoc queries
• Scalable, reliable
And now for something completely different …
Google’s Dremel
http://research.google.com/pubs/pub36632.html
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton,
Theo Vassilakis, Proc. of the 36th Int'l Conf on Very Large Data Bases (2010), pp. 330-339
Dremel is a scalable, interactive ad-hoc
query system for analysis of read-only
nested data. By combining multi-level
execution trees and columnar data layout,
it is capable of running aggregation
queries over trillion-row tables in
seconds. The system scales to thousands of
CPUs and petabytes of data, and has
thousands of users at Google.
…
“
“
Dremel is a scalable, interactive ad-hoc
query system for analysis of read-only
nested data. By combining multi-level
execution trees and columnar data layout,
it is capable of running aggregation
queries over trillion-row tables in
seconds. The system scales to thousands of
CPUs and petabytes of data, and has
thousands of users at Google.
…
Google’s Dremel
multi-level execution trees
columnar data layout
Google’s Dremel
nested data + schema column-striped representation
map nested data to tables
Google’s Dremel
experiments:
datasets & query performance
Back to Apache Drill …
Apache Drill–key facts
• Inspired by Google’s Dremel
• Standard SQL 2003 support
• Plug-able data sources
• Nested data is a first-class citizen
• Schema is optional
• Community driven, open, 100’s involved
High-level Architecture
Principled Query Execution
• Source query—what we want to do (analyst
friendly)
• Logical Plan— what we want to do (language
agnostic, computer friendly)
• Physical Plan—how we want to do it (the best
way we can tell)
• Execution Plan—where we want to do it
Principled Query Execution
Source
Query Parser
Logical
Plan Optimizer
Physical
Plan Execution
SQL 2003
DrQL
MongoQL
DSL
scanner APITopology
CF
etc.
query: [
{
@id: "log",
op: "sequence",
do: [
{
op: "scan",
source: “logs”
},
{
op: "filter",
condition:
"x > 3”
},
parser API
Wire-level Architecture
• Each node: Drillbit - maximize data locality
• Co-ordination, query planning, execution, etc, are distributed
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Wire-level Architecture
• Curator/Zookeeper for ephemeral cluster membership info
• Distributed cache (Hazelcast) for metadata, locality
information, etc.
Curator/Zk
Distributed Cache
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Distributed Cache Distributed Cache Distributed Cache
Wire-level Architecture
• Originating Drillbit acts as foreman: manages query execution,
scheduling, locality information, etc.
• Streaming data communication avoiding SerDe
Curator/Zk
Distributed Cache
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Distributed Cache Distributed Cache Distributed Cache
Wire-level Architecture
Foreman turns into
root of the multi-level
execution tree, leafs
activate their storage
engine interface.
node
node node
Curator/Zk
On the shoulders of giants …
• Jackson for JSON SerDe for metadata
• Typesafe HOCON for configuration and module management
• Netty4 as core RPC engine, protobuf for communication
• Vanilla Java, Larray and Netty ByteBuf for off-heap large data structures
• Hazelcast for distributed cache
• Netflix Curator on top of Zookeeper for service registry
• Optiq for SQL parsing and cost optimization
• Parquet (http://parquet.io) as native columnar format
• Janino for expression compilation
• ASM for ByteCode manipulation
• Yammer Metrics for metrics
• Guava extensively
• Carrot HPC for primitive collections
Key features
• Full SQL – ANSI SQL 2003
• Nested Data as first class citizen
• Optional Schema
• Extensibility Points …
Extensibility Points
• Source query  parser API
• Custom operators, UDF  logical plan
• Serving tree, CF, topology  physical plan/optimizer
• Data sources &formats  scanner API
Source
Query Parser
Logical
Plan Optimizer
Physical
Plan Execution
… and Hadoop?
• How is it different to Hive, Cascading, etc.?
• Complementary use cases*
• … use Apache Drill
– Find record with specified condition
– Aggregation under dynamic conditions
• … use MapReduce
– Data mining with multiple iterations
– ETL
*)https://cloud.google.com/files/BigQueryTechnicalWP.pdf
LET’S GET OUR HANDS DIRTY…
Basic Demo
https://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo
{
"id": "0001",
"type": "donut",
”ppu": 0.55,
"batters":
{
"batter”:
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
…
data source: donuts.json
query:[ {
op:"sequence",
do:[
{
op: "scan",
ref: "donuts",
source: "local-logs",
selection: {data: "activity"}
},
{
op: "filter",
expr: "donuts.ppu < 2.00"
},
…
logical plan: simple_plan.json
result: out.json
{
"sales" : 700.0,
"typeCount" : 1,
"quantity" : 700,
"ppu" : 1.0
}
{
"sales" : 109.71,
"typeCount" : 2,
"quantity" : 159,
"ppu" : 0.69
}
{
"sales" : 184.25,
"typeCount" : 2,
"quantity" : 335,
"ppu" : 0.55
}
SELECT
t.cf1.name as name,
SUM(t.cf1.sales) as total_sales
FROM m7://cluster1/sales t
GROUP BY name
ORDER BY by total_sales desc
sequence: [
{ op: scan, storageengine: m7,
selection: {table: sales}}
{ op: project, projections: [
{ref: name, expr: cf1.name},
{ref: sales, expr: cf1.sales}]}
{ op: segment, ref: by_name, exprs: [name]}
{ op: collapsingaggregate, target: by_name,
carryovers: [name],
aggregations: [{ref: total_sales, expr:
sum(name)}]}
{ op: order, ordering: [{order: desc, expr:
total_sales}]}
{ op: store, storageengine: screen}
]
{
@id: 1, pop: m7scan, cluster: def,
table: sales, cols: [cf1.name, cf2.name]
}
{
@id: 2, op: hash-random-exchange,
input: 1, expr: 1
}
{
@id: 3, op: sorting-hash-aggregate, input: 2,
grouping: 1, aggr:[sum(2)], carry: [1], sort:
~agrr[0]
}
{
@id: 4, op: screen, input: 4
}
Execution Plan
• Break physical plan into fragments
• Determine quantity of parallelization for each
task based on estimated costs
• Assign particular nodes based on affinity, load
and topology
BE A PART OF IT!
Status
• Heavy development by multiple organizations
• Available
– Logical plan (ADSP)
– Reference interpreter
– Basic SQL parser
– Basic demo
Status
June 2013
• Full SQL support (+JDBC)
• Physical plan
• In-memory compressed data interfaces
• Distributed execution
Status
June 2013
• HBase and MySQL storage engine
• User Interfaces
User Interfaces
User Interfaces
• API—DrillClient
– Encapsulates endpoint discovery
– Supports logical and physical plan submission,
query cancellation, query status
– Supports streaming return results
• JDBC driver, converting JDBC into DrillClient
communication.
• REST proxy for DrillClient
User Interfaces
Contributing
Contributions appreciated (not only code drops) …
• Test data & test queries
• Use case scenarios (textual/SQL queries)
• Documentation
• Further schedule
– Alpha Q2
– Beta Q3
Kudos to …
• Julian Hyde, Pentaho
• Lisen Mu, XingCloud
• Tim Chen, Microsoft
• Chris Merrick, RJMetrics
• David Alves, UT Austin
• Sree Vaadi, SSS
• Jacques Nadeau, MapR
• Ted Dunning, MapR
Engage!
• Follow @ApacheDrill on Twitter
• Sign up at mailing lists (user | dev)
http://incubator.apache.org/drill/mailing-lists.html
• Standing G+ hangouts every Tuesday at 18:00 CET
http://j.mp/apache-drill-hangouts
• Keep an eye on http://drill-user.org/

Mais conteúdo relacionado

Mais procurados

Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillMapR Technologies
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillDataWorks Summit
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache DrillCharles Givre
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Vince Gonzalez
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillMapR Technologies
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 

Mais procurados (20)

Apache Drill
Apache DrillApache Drill
Apache Drill
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache Drill
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 

Destaque

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04Ted Dunning
 
Analyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien CabotAnalyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien CabotModern Data Stack France
 
Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Modern Data Stack France
 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaModern Data Stack France
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiModern Data Stack France
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopHortonworks
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainHadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainModern Data Stack France
 

Destaque (20)

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Cascalog présenté par Bertrand Dechoux
Cascalog présenté par Bertrand DechouxCascalog présenté par Bertrand Dechoux
Cascalog présenté par Bertrand Dechoux
 
Hadoop on Azure
Hadoop on AzureHadoop on Azure
Hadoop on Azure
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
IBM Stream au Hadoop User Group
IBM Stream au Hadoop User GroupIBM Stream au Hadoop User Group
IBM Stream au Hadoop User Group
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 
Analyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien CabotAnalyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien Cabot
 
Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)
 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy Hanna
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on Hadoop
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainHadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
 
Dépasser map() et reduce()
Dépasser map() et reduce()Dépasser map() et reduce()
Dépasser map() et reduce()
 
Hadoop chez Kobojo
Hadoop chez KobojoHadoop chez Kobojo
Hadoop chez Kobojo
 
Big Data et SEO, par Vincent Heuschling
Big Data et SEO, par Vincent HeuschlingBig Data et SEO, par Vincent Heuschling
Big Data et SEO, par Vincent Heuschling
 
HCatalog
HCatalogHCatalog
HCatalog
 
Hadopp Vue d'ensemble
Hadopp Vue d'ensembleHadopp Vue d'ensemble
Hadopp Vue d'ensemble
 
Hadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas VialHadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas Vial
 

Semelhante a M7 and Apache Drill, Micheal Hausenblas

Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill MapR Technologies
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSAmazon Web Services
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore
 
Xadoop - new approaches to data analytics
Xadoop - new approaches to data analyticsXadoop - new approaches to data analytics
Xadoop - new approaches to data analyticsMaxim Grinev
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
 

Semelhante a M7 and Apache Drill, Micheal Hausenblas (20)

Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Xadoop - new approaches to data analytics
Xadoop - new approaches to data analyticsXadoop - new approaches to data analytics
Xadoop - new approaches to data analytics
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 

Mais de Modern Data Stack France

Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupModern Data Stack France
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Modern Data Stack France
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlusModern Data Stack France
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)Modern Data Stack France
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Modern Data Stack France
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandationModern Data Stack France
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielModern Data Stack France
 

Mais de Modern Data Stack France (20)

Stash - Data FinOPS
Stash - Data FinOPSStash - Data FinOPS
Stash - Data FinOPS
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark Meetup
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Hug janvier 2016 -EDF
Hug   janvier 2016 -EDFHug   janvier 2016 -EDF
Hug janvier 2016 -EDF
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlus
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Spark dataframe
Spark dataframeSpark dataframe
Spark dataframe
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandation
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Spark meetup at viadeo
Spark meetup at viadeoSpark meetup at viadeo
Spark meetup at viadeo
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
 

Último

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

M7 and Apache Drill, Micheal Hausenblas

  • 1. The MapR Big Data platform M7 and Apache Drill Michael Hausenblas, Chief Data Engineer EMEA, MapR HUG France, Paris, 2013-06-05
  • 2.
  • 4. Batch processing … for recurring tasks such as large-scale data mining, ETL offloading/data-warehousing  for the batch layer in Lambda architecture
  • 5. OLTP … user-facing eCommerce transactions, real-time messaging at scale (FB), time-series processing, etc.  for the serving layer in Lambda architecture
  • 6. Stream processing … in order to handle stream sources such as social media feeds or sensor data (mobile phones, RFID, weather stations, etc.)  for the speed layer in Lambda architecture
  • 7. Search/Information Retrieval … retrieval of items from unstructured documents (plain text, etc.), semi-structured data formats (JSON, etc.), as well as data stores (MongoDB, CouchDB, etc.)
  • 8. MapR Big Data platform
  • 9. MAPR’S M7 HOW WE MADE HBASE ENTERPRISE-GRADE
  • 10. HBase Adoption  Used by 45% of Hadoop users  Database Operations: Large scale key-value store Blob store Lightweight OLTP  Real-time Analytics: Logistics Shopping cart Billing Auction engine Log analysis
  • 11. HBase Issues Reliability •Compactions disrupt operations •Very slow crash recovery •Unreliable splitting Business continuity •Common hardware/software issues cause downtime •Administration requires downtime •No point-in-time recovery •Complex backup process Performance •Many bottlenecks result in low throughput •Limited data locality •Limited # of tables Manageability •Compactions, splits and merges must be done manually (in reality) •Basic operations like backup or table rename are complex
  • 12. M7: Enterprise-grade HBase EASY DEPENDABLE FAST M7 Enterprise Grade No RegionServers No Compactions Consistent Low Latency Snapshots Mirroring Instant Recovery No Manual Processes
  • 14. Unified Namespace $ pwd /mapr/default/user/dave $ ls file1 file2 table1 table2 $ hbase shell hbase(main):003:0> create '/user/dave/table3', 'cf1', 'cf2', 'cf3' 0 row(s) in 0.1570 seconds $ ls file1 file2 table1 table2 table3 $ hadoop fs -ls /user/dave Found 5 items -rw-r--r-- 3 mapr mapr 16 2012-09-28 08:34 /user/dave/file1 -rw-r--r-- 3 mapr mapr 22 2012-09-28 08:34 /user/dave/file2 trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:32 /user/dave/table1 trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:33 /user/dave/table2 trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:38 /user/dave/table3
  • 16. Eliminating Compactions HBase-style LevelDB-style M7 Examples BigTable, HBase, Cassandra, Riak Cassandra, Riak WAF Low High Low RAF High Low Low I/O storms Yes No No Disk space overhead High (2x) Low Low Skewed data handling Bad Good Good Rewrite large values Yes Yes No Write-amplification factor (WAF): The ratio between writes to disk and application writes. Note that data must be rewritten in every indexed structure. Read-amplification factor (RAF): The ratio between reads from disk and application reads. Skewed data handling: When inserting values with similar keys (eg, increasing keys, trending topic), do other values also need to be rewritten?
  • 17. Portability (In Both Ways) • HBase applications work as is with M7 – No need to recompile – No vendor lock-in • You can also run Apache HBase on an M7 cluster – Recommended during a migration – Table names with a slash (/) are in M7, table names without a slash are in Apache HBase (this can be overridden to allow table-by-table migration) • Use standard CopyTable tool to copy a table from HBase to M7 and vice versa – hbase org.apache.hadoop.hbase.mapreduce.CopyTable -- new.name=/user/tshiran/mytable mytable
  • 18. MapR Big Data platform
  • 20. http://www.flickr.com/photos/9479603@N02/4144121838/ licensed under CC BY-NC-ND 2.0 How to do interactive ad-hoc query at scale?
  • 22. Use Case: Marketing Campaign • Jane, a marketing analyst • Determine target segments • Data from different sources
  • 23. Use Case: Logistics • Supplier tracking and performance • Queries – Shipments from supplier ‘ACM’ in last 24h – Shipments in region ‘US’ not from ‘ACM’ SUPPLIER_ID NAME REGION ACM ACME Corp US GAL GotALot Inc US BAP Bits and Pieces Ltd Europe ZUP Zu Pli Asia { "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
  • 24. Use Case: Crime Detection • Online purchases • Fraud, bilking, etc. • Batch-generated overview • Modes – Explorative – Alerts
  • 25. Requirements • Support for different data sources • Support for different query interfaces • Low-latency/real-time • Ad-hoc queries • Scalable, reliable
  • 26. And now for something completely different …
  • 27. Google’s Dremel http://research.google.com/pubs/pub36632.html Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Proc. of the 36th Int'l Conf on Very Large Data Bases (2010), pp. 330-339 Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. … “ “ Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. …
  • 28. Google’s Dremel multi-level execution trees columnar data layout
  • 29. Google’s Dremel nested data + schema column-striped representation map nested data to tables
  • 31. Back to Apache Drill …
  • 32. Apache Drill–key facts • Inspired by Google’s Dremel • Standard SQL 2003 support • Plug-able data sources • Nested data is a first-class citizen • Schema is optional • Community driven, open, 100’s involved
  • 34. Principled Query Execution • Source query—what we want to do (analyst friendly) • Logical Plan— what we want to do (language agnostic, computer friendly) • Physical Plan—how we want to do it (the best way we can tell) • Execution Plan—where we want to do it
  • 35. Principled Query Execution Source Query Parser Logical Plan Optimizer Physical Plan Execution SQL 2003 DrQL MongoQL DSL scanner APITopology CF etc. query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, parser API
  • 36. Wire-level Architecture • Each node: Drillbit - maximize data locality • Co-ordination, query planning, execution, etc, are distributed Storage Process Drillbit node Storage Process Drillbit node Storage Process Drillbit node Storage Process Drillbit node
  • 37. Wire-level Architecture • Curator/Zookeeper for ephemeral cluster membership info • Distributed cache (Hazelcast) for metadata, locality information, etc. Curator/Zk Distributed Cache Storage Process Drillbit node Storage Process Drillbit node Storage Process Drillbit node Storage Process Drillbit node Distributed Cache Distributed Cache Distributed Cache
  • 38. Wire-level Architecture • Originating Drillbit acts as foreman: manages query execution, scheduling, locality information, etc. • Streaming data communication avoiding SerDe Curator/Zk Distributed Cache Storage Process Drillbit node Storage Process Drillbit node Storage Process Drillbit node Storage Process Drillbit node Distributed Cache Distributed Cache Distributed Cache
  • 39. Wire-level Architecture Foreman turns into root of the multi-level execution tree, leafs activate their storage engine interface. node node node Curator/Zk
  • 40. On the shoulders of giants … • Jackson for JSON SerDe for metadata • Typesafe HOCON for configuration and module management • Netty4 as core RPC engine, protobuf for communication • Vanilla Java, Larray and Netty ByteBuf for off-heap large data structures • Hazelcast for distributed cache • Netflix Curator on top of Zookeeper for service registry • Optiq for SQL parsing and cost optimization • Parquet (http://parquet.io) as native columnar format • Janino for expression compilation • ASM for ByteCode manipulation • Yammer Metrics for metrics • Guava extensively • Carrot HPC for primitive collections
  • 41. Key features • Full SQL – ANSI SQL 2003 • Nested Data as first class citizen • Optional Schema • Extensibility Points …
  • 42. Extensibility Points • Source query  parser API • Custom operators, UDF  logical plan • Serving tree, CF, topology  physical plan/optimizer • Data sources &formats  scanner API Source Query Parser Logical Plan Optimizer Physical Plan Execution
  • 43. … and Hadoop? • How is it different to Hive, Cascading, etc.? • Complementary use cases* • … use Apache Drill – Find record with specified condition – Aggregation under dynamic conditions • … use MapReduce – Data mining with multiple iterations – ETL *)https://cloud.google.com/files/BigQueryTechnicalWP.pdf
  • 44. LET’S GET OUR HANDS DIRTY…
  • 45. Basic Demo https://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo { "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, … data source: donuts.json query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" }, … logical plan: simple_plan.json result: out.json { "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
  • 46. SELECT t.cf1.name as name, SUM(t.cf1.sales) as total_sales FROM m7://cluster1/sales t GROUP BY name ORDER BY by total_sales desc
  • 47. sequence: [ { op: scan, storageengine: m7, selection: {table: sales}} { op: project, projections: [ {ref: name, expr: cf1.name}, {ref: sales, expr: cf1.sales}]} { op: segment, ref: by_name, exprs: [name]} { op: collapsingaggregate, target: by_name, carryovers: [name], aggregations: [{ref: total_sales, expr: sum(name)}]} { op: order, ordering: [{order: desc, expr: total_sales}]} { op: store, storageengine: screen} ]
  • 48. { @id: 1, pop: m7scan, cluster: def, table: sales, cols: [cf1.name, cf2.name] } { @id: 2, op: hash-random-exchange, input: 1, expr: 1 } { @id: 3, op: sorting-hash-aggregate, input: 2, grouping: 1, aggr:[sum(2)], carry: [1], sort: ~agrr[0] } { @id: 4, op: screen, input: 4 }
  • 49. Execution Plan • Break physical plan into fragments • Determine quantity of parallelization for each task based on estimated costs • Assign particular nodes based on affinity, load and topology
  • 50. BE A PART OF IT!
  • 51. Status • Heavy development by multiple organizations • Available – Logical plan (ADSP) – Reference interpreter – Basic SQL parser – Basic demo
  • 52. Status June 2013 • Full SQL support (+JDBC) • Physical plan • In-memory compressed data interfaces • Distributed execution
  • 53. Status June 2013 • HBase and MySQL storage engine • User Interfaces
  • 55. User Interfaces • API—DrillClient – Encapsulates endpoint discovery – Supports logical and physical plan submission, query cancellation, query status – Supports streaming return results • JDBC driver, converting JDBC into DrillClient communication. • REST proxy for DrillClient
  • 57. Contributing Contributions appreciated (not only code drops) … • Test data & test queries • Use case scenarios (textual/SQL queries) • Documentation • Further schedule – Alpha Q2 – Beta Q3
  • 58. Kudos to … • Julian Hyde, Pentaho • Lisen Mu, XingCloud • Tim Chen, Microsoft • Chris Merrick, RJMetrics • David Alves, UT Austin • Sree Vaadi, SSS • Jacques Nadeau, MapR • Ted Dunning, MapR
  • 59. Engage! • Follow @ApacheDrill on Twitter • Sign up at mailing lists (user | dev) http://incubator.apache.org/drill/mailing-lists.html • Standing G+ hangouts every Tuesday at 18:00 CET http://j.mp/apache-drill-hangouts • Keep an eye on http://drill-user.org/

Notas do Editor

  1. http://solr-vs-elasticsearch.com/
  2. (This is a ASR-35 at DEC mainframe–other console terminals used were Teletype model 35 Teletypes)Allowing the user to issue ad-hoc queries is essential: often, the user might not necessarily know ahead of time what queries to issue. Also, one may need to react to changing circumstances. The lack of tools to perform interactive ad-hoc analysis at scale is a gap that Apache Drill fills.
  3. Hive: compile to MR, Aster: external tables in MPP, Oracle/MySQL: export MR results to RDBMSDrill, Impala, CitusDB: real-time
  4. Suppose a marketing analyst trying to experiment with ways to do targeting of user segments for next campaign. Needs access to web logs stored in Hadoop, and also needs to access user profiles stored in MongoDB as well as access to transaction data stored in a conventional database.
  5. Geo-spatial + time series data with highly discriminative queries (timeframe, region, etc.)
  6. Re ad-hoc:You might not know ahead of time what queries you will want to make. You may need to react to changing circumstances.
  7. Two innovations: handle nested-data column style (column-striped representation) and multi-level execution trees
  8. repetition levels (r) — at what repeated field in the field’s path the value has repeated.definition levels (d) — how many fields in path thatcould be undefined (because they are optional or repeated) are actually presentOnly repeated fields increment the repetition level, only non-required fields increment the definition level. Required fields are always defined and do not need a definition level. Non repeated fields do not need a repetition level.An optional field requires one extra bit to store zero if it is NULL and one if it is defined. NULL values do not need to be stored as the definition level captures this information.
  9. Source query - Human (eg DSL) or tool written(eg SQL/ANSI compliant) query Source query is parsed and transformed to produce the logical planLogical plan: dataflow of what should logically be doneTypically, the logical plan lives in memory in the form of Java objects, but also has a textual formThe logical query is then transformed and optimized into the physical plan.Optimizer introduces of parallel computation, taking topology into accountOptimizer handles columnar data to improve processing speedThe physical plan represents the actual structure of computation as it is done by the systemHow physical and exchange operators should be appliedAssignment to particular nodes and cores + actual query execution per node
  10. Drillbits per node, maximize data localityCo-ordination, query planning, optimization, scheduling, execution are distributedBy default, Drillbits hold all roles, modules can optionally be disabled.Any node/Drillbit can act as endpoint for particular query.
  11. Zookeeper maintains ephemeral cluster membership information onlySmall distributed cache utilizing embedded Hazelcast maintains information about individual queue depth, cached query plans, metadata, locality information, etc.
  12. Originating Drillbit acts as foreman, manages all execution for their particular query, scheduling based on priority, queue depth and locality information.Drillbit data communication is streaming and avoids any serialization/deserialization
  13. Red: originating drillbit, is the root of the multi-level execution tree, per query/jobLeafs use their storage engine interface to scan respective data source (DB, file, etc.)
  14. Relation of Drill to HadoopHadoop = HDFS + MapReduceDrill for:Finding particular records with specified conditions. For example, to findrequest logs with specified account ID.Quick aggregation of statistics with dynamically-changing conditions. For example, getting a summary of request traffic volume from the previous night for a web application and draw a graph from it.Trial-and-error data analysis. For example, identifying the cause of trouble and aggregating values by various conditions, including by hour, day and etc...MapReduce: Executing a complex data mining on Big Data which requires multiple iterations and paths of data processing with programmed algorithms.Executing large join operations across huge datasets.Exporting large amount of data after processing.
  15. Designed to be as easy as possible for language implementers to utilizeDon’t constrain ourselves to SQL specific paradigm – support complex data type operators such as collapse and expand as wellAllow late typing
  16. Insert points of parallelization where optimizer thinks they are necessaryPick the right version of each operatorApply projection and other push-down rules into capable operators