SlideShare uma empresa Scribd logo
1 de 54
Beyond Batch
  HBase, Drill, & Storm

  Brad Anderson


©MapR Technologies
whoami
• Brad Anderson

• Solutions Architect at MapR (Atlanta)

• ATLHUG co-chair

• ‘boorad’ most places (twitter, github)

• banderson@maprtech.com
©MapR Technologies
•    The open enterprise-grade distribution for Hadoop
     • Easy, dependable and fast
     • Open source with standards-based extensions

•    MapR is deployed at 1000’s of companies
     • From small Internet startups to the world’s largest enterprises

•    MapR customers analyze massive amounts of data:
     • Hundreds of billions of events daily
     • 90% of the world’s Internet population monthly
     • $1 trillion in retail purchases annually

•    MapR Cloud Partners
     • Google to provide Hadoop on Google Compute Engine
     • Amazon for Elastic Map Reduce + instances
©MapR Technologies
Beyond Batch
• HBase & M7

• Apache Drill

• Storm




©MapR Technologies
Latency Matters

         Batch       Interactive   Streaming




©MapR Technologies
HBase Issues
Reliability
• Compactions disrupt operations
• Very slow crash recovery
• Unreliable splitting

Business continuity
• Common hardware/software issues cause downtime
• Administration requires downtime
• No point-in-time recovery
• Complex backup process

Performance
• Many bottlenecks result in low throughput
• Limited data locality
• Limited # of tables

Manageability
• Compactions, splits and merges must be done manually (in reality)
• Basic operations like backup or table rename are complex
©MapR Technologies
M7
    An integrated system for unstructured and structured data
     – Unified namespace for files and tables
     – Data management
     – Data protection
     – Disaster recovery
     – No additional administration

    An architecture that delivers reliability and performance
     – Fewer layers
     – No compactions
     – Seamless splits
     – Automatic merges
     – Single network hop
     – Instant recovery
     – Reduced read and write amplification

©MapR Technologies
Unified Namespace
$ pwd
/mapr/default/user/boorad

$ ls
file1 file2 table1 table2

$ hbase shell
hbase(main):003:0> create '/user/boorad/table3', 'cf1', 'cf2', 'cf3'
0 row(s) in 0.1570 seconds

$ ls
file1 file2 table1 table2 table3

$ hadoop fs -ls /user/boorad
Found 5 items
-rw-r--r-- 3 mapr mapr       16 2012-09-28 08:34 /user/boorad/file1
-rw-r--r-- 3 mapr mapr       22 2012-09-28 08:34 /user/boorad/file2
trwxr-xr-x 3 mapr mapr       2 2012-09-28 08:32 /user/boorad/table1
trwxr-xr-x 3 mapr mapr       2 2012-09-28 08:33 /user/boorad/table2
trwxr-xr-x 3 mapr mapr       2 2012-09-28 08:38 /user/boorad/table3
 ©MapR Technologies
Simplifying HBase Architecture

               HBase
                     JVM


                     DFS   HBase
                     JVM    JVM

                 ext3      MapR    Unified

                Disks      Disks   Disks
    Other Distributions

©MapR Technologies
No RegionServers?

                            One network hop
No daemons to manage




                              One cache




©MapR Technologies
                       15
No RegionServers?

                            One network hop
No daemons to manage




                              One cache




©MapR Technologies
                       15
Region Assignment




©MapR Technologies
Region Assignment




©MapR Technologies
Instant Recovery

    Apache HBase experiences an outage when any node
     crashes
     – Each RegionServer replays WAL before any region can be
       recovered
     – All regions served by that RegionServer cannot be accessed
    M7 provides instant recovery
     –   M7 uses small WALs
         •   Multiple WALs per region vs. 1 per RegionServer (1000 regions)
     –   Instant recovery on put
     –   1000-10000x faster recovery on get
    How?
     –   M7 leverages unique MapR-FS capabilities, not impacted by
         HDFS limitations
         •   Append support
         •   No limit to # of files
©MapR Technologies
LSMT (FTW)
 Traditional disk-based index structures like B-
  Trees are expensive to maintain in real-time
 Log Structured Merge Trees reduce the cost by
  deferring and batching index changes
 Writes
     – Writes        go to an in-memory index
         •   And a commit log in case the node crashes and recovery is
             needed
     – The   in-memory index is occasionally merged into the
         disk-based index
         •   This may trigger a compaction
    Reads
     – Reads         hit the in-memory index and the disk-based
         index
©MapR Technologies
Storage Subsystem Performance
What does it cost to merge the in-memory index into the disk-
based index?
                          HBase-style         LevelDB-style        M7
Examples                  BigTable, HBase, Cassandra, Riak         M7
                          Cassandra, Riak
WAF                       Low              High                    Low
RAF                       High                Low                  Low
I/O storms                Yes                 No                   No
Disk space                High (2x)           Low                  Low
overhead
Skewed data               Bad                 Good                 Good
handling
Rewrite large             Yes                 Yes                  No
values
Terminology:
    Write-amplification factor (WAF): The ratio between writes to disk and
     application writes. Note that data must be rewritten in every indexed structure.
    Read-amplification factor (RAF): The ratio between reads from disk and
     application reads.
    Skewed data handling: When inserting values with similar keys (eg, increasing
©MapR Technologies
Other M7 Features
    Smaller disk footprint
     – HBase  stores key & column name for every version of
       every cell
     – M7 never repeats the key or column name

    Columnar layout
     – HBasesupports 2-3 column families in practice
     – M7 supports 64 column families

    Online schema changes
     – No   need to disable table to add/remove/modify
         column families


©MapR Technologies
©MapR Technologies
Big Data Picture
                        Batch processing    Interactive analysis     Stream processing

Query runtime           Minutes to hours   Milliseconds to minutes     Never-ending

Data volume                TBs to PBs           GBs to PBs           Continuous stream

Programming model         MapReduce               Queries                  DAG

Users                      Developers      Analysts and Developers      Developers

Google project            MapReduce               Dremel

Open source project Hadoop MapReduce                                     Storm, S4




   ©MapR Technologies
Big Data Picture
                        Batch processing    Interactive analysis     Stream processing

Query runtime           Minutes to hours   Milliseconds to minutes     Never-ending

Data volume                TBs to PBs           GBs to PBs           Continuous stream

Programming model         MapReduce               Queries                  DAG

Users                      Developers      Analysts and Developers      Developers

Google project            MapReduce               Dremel

Open source project Hadoop MapReduce                                     Storm, S4




                                           Apache Drill
   ©MapR Technologies
Google Dremel
• Interactive analysis of large-scale datasets
      • Trillion records at interactive speeds
      • Complementary to MapReduce
      • Used by thousands of Google employees
      • Paper published at VLDB 2010
• Model
      • Nested data model with schema
          • Most data at Google is stored/transferred in Protocol Buffers
          • Normalization (to relational) is prohibitive
      • SQL-like query language with nested data support
• Implementation
      • Column-based storage and processing
      • In-situ data access (GFS and Bigtable)
      • Tree architecture as in Web search (and databases)
©MapR Technologies
Google BigQuery
• Hosted Dremel (Dremel as a Service)
• CLI (bq) and Web UI
• Import data from Google Cloud Storage or local files
          • Files must be in CSV format
          • Nested data not supported [yet] except built-in datasets
          • Schema definition required




©MapR Technologies
DrQL Example
 DocId: 10
 Links
  Forward: 20        SELECT DocId AS Id,
  Forward: 40         COUNT(Name.Language.Code) WITHIN Name AS
  Forward: 60        Cnt,
 Name                 Name.Url + ',' + Name.Language.Code AS Str
  Language           FROM t
    Code: 'en-us'    WHERE REGEXP(Name.Url, '^http') AND DocId < 20;
    Country: 'us'
  Language
    Code: 'en'                         Id: 10
  Url: 'http://A'                      Name
 Name                                   Cnt: 2
  Url: 'http://B'                       Language
 Name                                     Str: 'http://A,en-us'
  Language                                Str: 'http://A,en'
    Code: 'en-gb'                      Name
    Country: 'gb'                       Cnt: 0
©MapR Technologies
                                       * Example from the Dremel paper
Data Flow




©MapR Technologies
Extensibility
• Nested query languages
      •   Pluggable model
      •   DrQL
      •   Mongo Query Language
      •   Cascading
• Distributed execution engine
      • Extensible model (eg, Dryad)
      • Low-latency
      • Fault tolerant



©MapR Technologies
Extensibility
• Nested data formats
      • Pluggable model
        • Column-based (ColumnIO/Dremel, Trevni, RCFile)
        • Row-based (RecordIO, Avro, JSON, CSV)
        • Schema (Protocol Buffers, Avro, CSV)
        • Schema-less (JSON, BSON)
• Scalable data sources
      • Pluggable model
      • Hadoop
      • HBase


©MapR Technologies
Architecture


• Only the execution engine knows the physical attributes of the
  cluster
      • # nodes, hardware, file locations, …


• Public interfaces enable extensibility
      • Developers can build parsers for new query languages
      • Developers can provide an execution plan directly


• Each level of the plan has a human readable representation
      • Facilitates debugging and unit testing
©MapR Technologies
Architecture




©MapR Technologies
Query Components
• Query components:
      •   SELECT
      •   FROM
      •   WHERE
      •   GROUP BY
      •   HAVING
      •   (JOIN)

• Key logical operators:
      •   Scan
      •   Filter
      •   Aggregate
      •   (Join)
©MapR Technologies
Execution Engine Layers
• Drill execution engine has two layers
      • Operator layer is serialization-aware
          • Processes individual records
      • Execution layer is not serialization-aware
          • Processes batches of records (blobs)
          • Responsible for communication, dependencies and fault tolerance




©MapR Technologies
Design Principles
    Flexible                         Easy
•     Pluggable query languages     • Unzip and run
•     Extensible execution engine   • Zero configuration
•     Pluggable data formats        • Reverse DNS not needed
     • Column-based and row-        • IP addresses can change
     based                          • Clear and concise log
     • Schema and schema-less       messages


    Fast                             Dependable
• C/C++ core with Java              • No SPOF
support                             • Instant recovery from
  • Google C++ style guide          crashes
• Min latency and max
   throughput (limited only by
   hardware)
 ©MapR Technologies
Hadoop Integration
• Hadoop data sources
      • Hadoop FileSystem API (HDFS/MapR-FS)
      • HBase
• Hadoop data formats
      • Apache Avro
      • RCFile
• MapReduce-based tools to create column-based
  formats




©MapR Technologies
Fully Open




©MapR Technologies
Storm




©MapR Technologies
Before Storm




                     Queues   Workers


©MapR Technologies
Example




©MapR Technologies
                     (simplified)
Storm

                     Guaranteed data processing
                     Horizontal scalability
                     Fault-tolerance
                     No intermediate message brokers!
                     Higher level abstraction than
                     message passing
                     “Just works”
©MapR Technologies
Concepts




©MapR Technologies
Streams



  Tuple               Tuple   Tuple   Tuple   Tuple   Tuple   Tuple




                     Unbounded sequence of tuples
©MapR Technologies
Spouts

                                                         Tuple
                                       Tuple Tuple Tuple
                     Tuple Tuple Tuple



                     Tuple Tuple
                                 Tuple Tuple
                                             Tuple Tuple
                                                         Tuple




                     Source of streams

©MapR Technologies
Spouts

public interface ISpout extends Serializable {
  void open(Map conf,
         TopologyContext context,
         SpoutOutputCollector collector);
  void close();
  void nextTuple();
  void ack(Object msgId);
  void fail(Object msgId);
}



©MapR Technologies
Bolts

 Tuple      Tuple     Tuple   Tuple   Tuple   Tuple   Tuple

                                                              Tuple   Tuple   Tuple   Tuple


                                              Tuple   Tuple
                              Tuple   Tuple
            Tuple     Tuple
 Tuple




Processes input streams and produces new streams

 ©MapR Technologies
Bolts
  public class DoubleAndTripleBolt extends BaseRichBolt {
    private OutputCollectorBase _collector;

       public void prepare(Map conf,
                    TopologyContext context,
                    OutputCollectorBase collector) {
         _collector = collector;
       }

       public void execute(Tuple input) {
         int val = input.getInteger(0);
         _collector.emit(input, new Values(val*2, val*3));
         _collector.ack(input);
       }

    public void declareOutputFields(OutputFieldsDeclarer
  declarer) {
       declarer.declare(new Fields("double", "triple"));
    }
  }
©MapR Technologies
Topologies




                     Network of spouts and bolts
©MapR Technologies
Trident
Cascading for Storm




©MapR Technologies
Trident
TridentTopology topology = new TridentTopology();
TridentState wordCounts =
   topology.newStream("spout1", spout)
    .each(new Fields("sentence"),
        new Split(),
        new Fields("word"))
    .groupBy(new Fields("word"))
    .persistentAggregate(new MemoryMapState.Factory(),
                  new Count(),
                   new Fields("count"))
    .parallelismHint(6);




 ©MapR Technologies
Interoperability




©MapR Technologies
Spouts
        •Kafka (with transactions)
        •Kestrel
        •JMS
        •AMQP
        •Beanstalkd


©MapR Technologies
Bolts
 •Functions
 •Filters
 •Aggregation
 •Joins
 •Talk to databases, Hadoop write-behind


©MapR Technologies
Storm

                               realtime
                              processes
                                                         Apps
                      Queue


Raw
Data                                                    Business
                                                         Value
                                            Hadoop




                                              batch
                                            processes
 ©MapR Technologies
Storm

                               realtime
                              processes
                                                         Apps
                      Queue


Raw
Data                                                    Business
                                                         Value
                                            Hadoop

Parallel Cluster Ingest

                                              batch
                                            processes
 ©MapR Technologies
Storm

                                        realtime
                                       processes
                                                    Apps
                      Queue


Raw
Data                                               Business
                                                    Value
                              Hadoop




                                     batch
                                   processes
 ©MapR Technologies
Storm

                                realtime
                               processes
                                            Apps
Raw
Data                                       Business
                                            Value
                      Hadoop




                             batch
                           processes
 ©MapR Technologies
Get Involved!
• Get more details on M7
      • http://mapr.com/products/mapr-editions/m7-edition

• Join the Apache Drill mailing list
      • drill-dev-subscribe@incubator.apache.org

• Watch TailSpout development
      • https://github.com/{tdunning | boorad}/mapr-spout

• Join MapR
      • jobs@mapr.com
      • banderson@maprtech.com

• @boorad
©MapR Technologies

Mais conteúdo relacionado

Mais procurados

HBase @ Twitter
HBase @ TwitterHBase @ Twitter
HBase @ Twitterctrezzo
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
 
Challenges & Capabilites in Managing a MapR Cluster by David Tucker
Challenges & Capabilites in Managing a MapR Cluster by David TuckerChallenges & Capabilites in Managing a MapR Cluster by David Tucker
Challenges & Capabilites in Managing a MapR Cluster by David TuckerMapR Technologies
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
エンタープライズ NoSQL/HBase プラットフォーム – MapR M7 エディション - db tech showcase 大阪 2014 201...
エンタープライズ NoSQL/HBase プラットフォーム – MapR M7 エディション - db tech showcase 大阪 2014 201...エンタープライズ NoSQL/HBase プラットフォーム – MapR M7 エディション - db tech showcase 大阪 2014 201...
エンタープライズ NoSQL/HBase プラットフォーム – MapR M7 エディション - db tech showcase 大阪 2014 201...MapR Technologies Japan
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 

Mais procurados (20)

HUG slides on NFS and ODBC
HUG slides on NFS and ODBCHUG slides on NFS and ODBC
HUG slides on NFS and ODBC
 
HBase @ Twitter
HBase @ TwitterHBase @ Twitter
HBase @ Twitter
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Challenges & Capabilites in Managing a MapR Cluster by David Tucker
Challenges & Capabilites in Managing a MapR Cluster by David TuckerChallenges & Capabilites in Managing a MapR Cluster by David Tucker
Challenges & Capabilites in Managing a MapR Cluster by David Tucker
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
エンタープライズ NoSQL/HBase プラットフォーム – MapR M7 エディション - db tech showcase 大阪 2014 201...
エンタープライズ NoSQL/HBase プラットフォーム – MapR M7 エディション - db tech showcase 大阪 2014 201...エンタープライズ NoSQL/HBase プラットフォーム – MapR M7 エディション - db tech showcase 大阪 2014 201...
エンタープライズ NoSQL/HBase プラットフォーム – MapR M7 エディション - db tech showcase 大阪 2014 201...
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on Mesos
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 

Destaque

Link Building: Y como sobrevivir a las mascotas de Google | Iday
Link Building: Y como sobrevivir a las mascotas de Google | IdayLink Building: Y como sobrevivir a las mascotas de Google | Iday
Link Building: Y como sobrevivir a las mascotas de Google | IdayPablo Baselice
 
Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase - NoSQL matte...
Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase - NoSQL matte...Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase - NoSQL matte...
Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase - NoSQL matte...NoSQLmatters
 
Event-Stream Processing with Kafka
Event-Stream Processing with KafkaEvent-Stream Processing with Kafka
Event-Stream Processing with KafkaTim Lossen
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
 
L’impegno di OGC per gli standard e la loro divulgazione: benefici per la dif...
L’impegno di OGC per gli standard e la loro divulgazione: benefici per la dif...L’impegno di OGC per gli standard e la loro divulgazione: benefici per la dif...
L’impegno di OGC per gli standard e la loro divulgazione: benefici per la dif...Andrea Borruso
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBasedave_revell
 

Destaque (6)

Link Building: Y como sobrevivir a las mascotas de Google | Iday
Link Building: Y como sobrevivir a las mascotas de Google | IdayLink Building: Y como sobrevivir a las mascotas de Google | Iday
Link Building: Y como sobrevivir a las mascotas de Google | Iday
 
Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase - NoSQL matte...
Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase - NoSQL matte...Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase - NoSQL matte...
Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase - NoSQL matte...
 
Event-Stream Processing with Kafka
Event-Stream Processing with KafkaEvent-Stream Processing with Kafka
Event-Stream Processing with Kafka
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
L’impegno di OGC per gli standard e la loro divulgazione: benefici per la dif...
L’impegno di OGC per gli standard e la loro divulgazione: benefici per la dif...L’impegno di OGC per gli standard e la loro divulgazione: benefici per la dif...
L’impegno di OGC per gli standard e la loro divulgazione: benefici per la dif...
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 

Semelhante a TriHUG - Beyond Batch

Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Timothy Spann
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesshnkr_rmchndrn
 
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDHigh-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDinside-BigData.com
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges DataWorks Summit
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheSandeepTaksande
 

Semelhante a TriHUG - Beyond Batch (20)

Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDHigh-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 

Mais de boorad

Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad
 
Hadoop and Storm - AJUG talk
Hadoop and Storm - AJUG talkHadoop and Storm - AJUG talk
Hadoop and Storm - AJUG talkboorad
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Stormboorad
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Casesboorad
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Stormboorad
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Toolsboorad
 
DevNexus 2011
DevNexus 2011DevNexus 2011
DevNexus 2011boorad
 
DevNation Atlanta
DevNation AtlantaDevNation Atlanta
DevNation Atlantaboorad
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloudboorad
 
Why Erlang? - Bar Camp Atlanta 2008
Why Erlang?  - Bar Camp Atlanta 2008Why Erlang?  - Bar Camp Atlanta 2008
Why Erlang? - Bar Camp Atlanta 2008boorad
 

Mais de boorad (10)

Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
 
Hadoop and Storm - AJUG talk
Hadoop and Storm - AJUG talkHadoop and Storm - AJUG talk
Hadoop and Storm - AJUG talk
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Storm
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Storm
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Tools
 
DevNexus 2011
DevNexus 2011DevNexus 2011
DevNexus 2011
 
DevNation Atlanta
DevNation AtlantaDevNation Atlanta
DevNation Atlanta
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloud
 
Why Erlang? - Bar Camp Atlanta 2008
Why Erlang?  - Bar Camp Atlanta 2008Why Erlang?  - Bar Camp Atlanta 2008
Why Erlang? - Bar Camp Atlanta 2008
 

TriHUG - Beyond Batch

  • 1. Beyond Batch HBase, Drill, & Storm Brad Anderson ©MapR Technologies
  • 2. whoami • Brad Anderson • Solutions Architect at MapR (Atlanta) • ATLHUG co-chair • ‘boorad’ most places (twitter, github) • banderson@maprtech.com ©MapR Technologies
  • 3. The open enterprise-grade distribution for Hadoop • Easy, dependable and fast • Open source with standards-based extensions • MapR is deployed at 1000’s of companies • From small Internet startups to the world’s largest enterprises • MapR customers analyze massive amounts of data: • Hundreds of billions of events daily • 90% of the world’s Internet population monthly • $1 trillion in retail purchases annually • MapR Cloud Partners • Google to provide Hadoop on Google Compute Engine • Amazon for Elastic Map Reduce + instances ©MapR Technologies
  • 4. Beyond Batch • HBase & M7 • Apache Drill • Storm ©MapR Technologies
  • 5. Latency Matters Batch Interactive Streaming ©MapR Technologies
  • 6. HBase Issues Reliability • Compactions disrupt operations • Very slow crash recovery • Unreliable splitting Business continuity • Common hardware/software issues cause downtime • Administration requires downtime • No point-in-time recovery • Complex backup process Performance • Many bottlenecks result in low throughput • Limited data locality • Limited # of tables Manageability • Compactions, splits and merges must be done manually (in reality) • Basic operations like backup or table rename are complex ©MapR Technologies
  • 7. M7  An integrated system for unstructured and structured data – Unified namespace for files and tables – Data management – Data protection – Disaster recovery – No additional administration  An architecture that delivers reliability and performance – Fewer layers – No compactions – Seamless splits – Automatic merges – Single network hop – Instant recovery – Reduced read and write amplification ©MapR Technologies
  • 8. Unified Namespace $ pwd /mapr/default/user/boorad $ ls file1 file2 table1 table2 $ hbase shell hbase(main):003:0> create '/user/boorad/table3', 'cf1', 'cf2', 'cf3' 0 row(s) in 0.1570 seconds $ ls file1 file2 table1 table2 table3 $ hadoop fs -ls /user/boorad Found 5 items -rw-r--r-- 3 mapr mapr 16 2012-09-28 08:34 /user/boorad/file1 -rw-r--r-- 3 mapr mapr 22 2012-09-28 08:34 /user/boorad/file2 trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:32 /user/boorad/table1 trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:33 /user/boorad/table2 trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:38 /user/boorad/table3 ©MapR Technologies
  • 9. Simplifying HBase Architecture HBase JVM DFS HBase JVM JVM ext3 MapR Unified Disks Disks Disks Other Distributions ©MapR Technologies
  • 10. No RegionServers? One network hop No daemons to manage One cache ©MapR Technologies 15
  • 11. No RegionServers? One network hop No daemons to manage One cache ©MapR Technologies 15
  • 14. Instant Recovery  Apache HBase experiences an outage when any node crashes – Each RegionServer replays WAL before any region can be recovered – All regions served by that RegionServer cannot be accessed  M7 provides instant recovery – M7 uses small WALs • Multiple WALs per region vs. 1 per RegionServer (1000 regions) – Instant recovery on put – 1000-10000x faster recovery on get  How? – M7 leverages unique MapR-FS capabilities, not impacted by HDFS limitations • Append support • No limit to # of files ©MapR Technologies
  • 15. LSMT (FTW)  Traditional disk-based index structures like B- Trees are expensive to maintain in real-time  Log Structured Merge Trees reduce the cost by deferring and batching index changes  Writes – Writes go to an in-memory index • And a commit log in case the node crashes and recovery is needed – The in-memory index is occasionally merged into the disk-based index • This may trigger a compaction  Reads – Reads hit the in-memory index and the disk-based index ©MapR Technologies
  • 16. Storage Subsystem Performance What does it cost to merge the in-memory index into the disk- based index? HBase-style LevelDB-style M7 Examples BigTable, HBase, Cassandra, Riak M7 Cassandra, Riak WAF Low High Low RAF High Low Low I/O storms Yes No No Disk space High (2x) Low Low overhead Skewed data Bad Good Good handling Rewrite large Yes Yes No values Terminology:  Write-amplification factor (WAF): The ratio between writes to disk and application writes. Note that data must be rewritten in every indexed structure.  Read-amplification factor (RAF): The ratio between reads from disk and application reads.  Skewed data handling: When inserting values with similar keys (eg, increasing ©MapR Technologies
  • 17. Other M7 Features  Smaller disk footprint – HBase stores key & column name for every version of every cell – M7 never repeats the key or column name  Columnar layout – HBasesupports 2-3 column families in practice – M7 supports 64 column families  Online schema changes – No need to disable table to add/remove/modify column families ©MapR Technologies
  • 19. Big Data Picture Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries DAG Users Developers Analysts and Developers Developers Google project MapReduce Dremel Open source project Hadoop MapReduce Storm, S4 ©MapR Technologies
  • 20. Big Data Picture Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries DAG Users Developers Analysts and Developers Developers Google project MapReduce Dremel Open source project Hadoop MapReduce Storm, S4 Apache Drill ©MapR Technologies
  • 21. Google Dremel • Interactive analysis of large-scale datasets • Trillion records at interactive speeds • Complementary to MapReduce • Used by thousands of Google employees • Paper published at VLDB 2010 • Model • Nested data model with schema • Most data at Google is stored/transferred in Protocol Buffers • Normalization (to relational) is prohibitive • SQL-like query language with nested data support • Implementation • Column-based storage and processing • In-situ data access (GFS and Bigtable) • Tree architecture as in Web search (and databases) ©MapR Technologies
  • 22. Google BigQuery • Hosted Dremel (Dremel as a Service) • CLI (bq) and Web UI • Import data from Google Cloud Storage or local files • Files must be in CSV format • Nested data not supported [yet] except built-in datasets • Schema definition required ©MapR Technologies
  • 23. DrQL Example DocId: 10 Links Forward: 20 SELECT DocId AS Id, Forward: 40 COUNT(Name.Language.Code) WITHIN Name AS Forward: 60 Cnt, Name Name.Url + ',' + Name.Language.Code AS Str Language FROM t Code: 'en-us' WHERE REGEXP(Name.Url, '^http') AND DocId < 20; Country: 'us' Language Code: 'en' Id: 10 Url: 'http://A' Name Name Cnt: 2 Url: 'http://B' Language Name Str: 'http://A,en-us' Language Str: 'http://A,en' Code: 'en-gb' Name Country: 'gb' Cnt: 0 ©MapR Technologies * Example from the Dremel paper
  • 25. Extensibility • Nested query languages • Pluggable model • DrQL • Mongo Query Language • Cascading • Distributed execution engine • Extensible model (eg, Dryad) • Low-latency • Fault tolerant ©MapR Technologies
  • 26. Extensibility • Nested data formats • Pluggable model • Column-based (ColumnIO/Dremel, Trevni, RCFile) • Row-based (RecordIO, Avro, JSON, CSV) • Schema (Protocol Buffers, Avro, CSV) • Schema-less (JSON, BSON) • Scalable data sources • Pluggable model • Hadoop • HBase ©MapR Technologies
  • 27. Architecture • Only the execution engine knows the physical attributes of the cluster • # nodes, hardware, file locations, … • Public interfaces enable extensibility • Developers can build parsers for new query languages • Developers can provide an execution plan directly • Each level of the plan has a human readable representation • Facilitates debugging and unit testing ©MapR Technologies
  • 29. Query Components • Query components: • SELECT • FROM • WHERE • GROUP BY • HAVING • (JOIN) • Key logical operators: • Scan • Filter • Aggregate • (Join) ©MapR Technologies
  • 30. Execution Engine Layers • Drill execution engine has two layers • Operator layer is serialization-aware • Processes individual records • Execution layer is not serialization-aware • Processes batches of records (blobs) • Responsible for communication, dependencies and fault tolerance ©MapR Technologies
  • 31. Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row- • IP addresses can change based • Clear and concise log • Schema and schema-less messages Fast Dependable • C/C++ core with Java • No SPOF support • Instant recovery from • Google C++ style guide crashes • Min latency and max throughput (limited only by hardware) ©MapR Technologies
  • 32. Hadoop Integration • Hadoop data sources • Hadoop FileSystem API (HDFS/MapR-FS) • HBase • Hadoop data formats • Apache Avro • RCFile • MapReduce-based tools to create column-based formats ©MapR Technologies
  • 35. Before Storm Queues Workers ©MapR Technologies
  • 37. Storm Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing “Just works” ©MapR Technologies
  • 39. Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples ©MapR Technologies
  • 40. Spouts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Source of streams ©MapR Technologies
  • 41. Spouts public interface ISpout extends Serializable { void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void nextTuple(); void ack(Object msgId); void fail(Object msgId); } ©MapR Technologies
  • 42. Bolts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Processes input streams and produces new streams ©MapR Technologies
  • 43. Bolts public class DoubleAndTripleBolt extends BaseRichBolt { private OutputCollectorBase _collector; public void prepare(Map conf, TopologyContext context, OutputCollectorBase collector) { _collector = collector; } public void execute(Tuple input) { int val = input.getInteger(0); _collector.emit(input, new Values(val*2, val*3)); _collector.ack(input); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("double", "triple")); } } ©MapR Technologies
  • 44. Topologies Network of spouts and bolts ©MapR Technologies
  • 46. Trident TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6); ©MapR Technologies
  • 48. Spouts •Kafka (with transactions) •Kestrel •JMS •AMQP •Beanstalkd ©MapR Technologies
  • 49. Bolts •Functions •Filters •Aggregation •Joins •Talk to databases, Hadoop write-behind ©MapR Technologies
  • 50. Storm realtime processes Apps Queue Raw Data Business Value Hadoop batch processes ©MapR Technologies
  • 51. Storm realtime processes Apps Queue Raw Data Business Value Hadoop Parallel Cluster Ingest batch processes ©MapR Technologies
  • 52. Storm realtime processes Apps Queue Raw Data Business Value Hadoop batch processes ©MapR Technologies
  • 53. Storm realtime processes Apps Raw Data Business Value Hadoop batch processes ©MapR Technologies
  • 54. Get Involved! • Get more details on M7 • http://mapr.com/products/mapr-editions/m7-edition • Join the Apache Drill mailing list • drill-dev-subscribe@incubator.apache.org • Watch TailSpout development • https://github.com/{tdunning | boorad}/mapr-spout • Join MapR • jobs@mapr.com • banderson@maprtech.com • @boorad ©MapR Technologies

Notas do Editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. hbase - random reads/writes - 45% of all hadoop clusters\n\n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. Drill \nRemove schema requirement\nIn-situ for real since we&amp;#x2019;ll support multiple formats\n\nNote: MR needed for big joins so to speak\n
  20. Drill\nWill support nested\nNo schema required\n
  21. Protocol buffers are conceptual data model\nWill support multiple data models\nWill have to define a way to explain data format\n (filtering, fields, etc)\nSchema-less will have perf penalty\nHbase will be one format\n
  22. Likely to support these\nCould add HiveQL and more as well. Could even be clever and support HiveQL to MR or Drill based upon query\nPig as well\n\nPluggability\nData format\nQuery language\n\nSomething 6-9 months alpha quality\nCommunity driven, I can&amp;#x2019;t speak for project\n\nMapR\nFS gives better chunk size control\nNFS support may make small test drivers easier\nUnified namespace will allow multi-cluster access\nMight even have drill component that autoformats data\n\n\nRead only model\n
  23. Example query that Drill should support\n\nNeed to talk more here about what Dremel does\n
  24. Load data into Drill (optional)\nCould just use as is in &amp;#x201C;row&amp;#x201D; format\nMultiple query languages\nPluggability very important\n
  25. Note: we have an already partially built execution engine\n
  26. Note: we have an already partially built execution engine\n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. Be prepared for Apache questions\nCommitter vs committee vs contributor\n\nIf can&amp;#x2019;t answer question, ask them to answer and contribute\nLisa - Need landing page\nReferences to paper and such at end\n
  34. \n
  35. \n
  36. \n
  37. scaling is painful\npoor fault tolerance\ncoding is hard\n
  38. \n
  39. \n
  40. tweets stock ticks manufacturing machine data sensor messages\n
  41. \n
  42. \n
  43. \n
  44. \n
  45. DAG\n\nruns continuously\n
  46. abstractions like Cascading, Hive, Pig make MR approachable\n\ncode size reduction\n
  47. \n
  48. \n
  49. kestrel - via thrift\nkafka - transactional topologies, idempotentcy, process only once\nactivemq\n
  50. \n
  51. current architecture\n\ndata ingest tool for hadoop (avoid Flume madness)\n
  52. current architecture\n\ndata ingest tool for hadoop (avoid Flume madness)\n
  53. \n