SlideShare uma empresa Scribd logo
1 de 62
August 21 2012 – Toronto Hadoop User Group
a.k.a. THUGs
Introduction to Hadoop:
 Pretty Picture Version
{Due credit to Todd’s Magic}
Why are we here?

    • Become exposed to the core concepts of
      Hadoop
    • Understand the projects within Hadoop
      and how they fit together
    • Review Common Use Cases for Hadoop
    • Share beginner experiences with Hadoop
    • Ask a @$%$#-load of questions about
      Hadoop

2
                   ©2011 Cloudera, Inc. All Rights Reserved.
What I won’t be able to give you…

    • A complete introduction to the technology
      (takes too long)
    • Enough information to begin development or
      implementation of Hadoop (too complicated)
    • Enough information to install and configure
      Hadoop (I recommend you start with the
      Cloudera VMWare image individually or
      Cloudera Manager for a real cluster)
    • Have a hands-on Pig-fest or Hive-fest (that’s
      a THUG meetup to come…)

3
                      ©2011 Cloudera, Inc. All Rights Reserved.
Users of Cloudera
    Financial                                                              Retail &
                Web                Telecom                        Media
                                                                          Consumer




4
                      ©2011 Cloudera, Inc. All Rights Reserved.
Hadoop Use Cases
Use Case                     Application                     Industry                               Application            Use Case

                        Social Network Analysis                  Web                          Clickstream Sessionization


                         Content Optimization                  Media                          Clickstream Sessionization
   ADVANCED ANALYTICS




                                                                                                                              DATA PROCESSING
                          Network Analytics                     Telco                                 Mediation

                         Loyalty & Promotions
                                                               Retail                                Data Factory
                               Analysis

                            Fraud Analysis                  Financial                            Trade Reconciliation


                            Entity Analysis                   Federal                                  SIGINT


                         Sequencing Analysis          Bioinformatics                              Genome Mapping



  5
                                                  ©2011 Cloudera, Inc. All Rights Reserved.
CDH
       File System Mount           UI Framework                                 SDK
                 FUSE-DFS                                     HUE                  HUE SDK

            Workflow                  Scheduling                              Metadata
            APACHE OOZIE                  APACHE OOZIE                          APACHE HIVE


                                 Query / Analytics

                                     APACHE PIG, APACHE                            Fast
      Data Integration              HIVE, APACHE MAHOUT                         Read/Write
                                                                                 Access


         APACHE
      FLUME, APACHE                         HDFS, MAPREDUCE                      APACHE
         SQOOP                                                                   HBASE

                                    Coordination
                                                                          APACHE ZOOKEEPER




6
                            ©2012 Cloudera, Inc. | Company confidential
Typical Data Pipeline


                                                                                         Marts

                                           Processing
                                              Layer
    Data Sources




                                                                               Data
                   (Temporary)
                                                                             Warehouse
                     Storage




                                                                              Archive




7
                                 ©2011 Cloudera, Inc. All Rights Reserved.
Typical Data Pipeline with Hadoop

                                                   Hadoop
                                                                                                                                               Marts
                                                    Oozie




                                                                                               Result or Calculated Data
                           Original Source Data
    Data Sources




                                                     Pig
                                                                                                                                     Data
                                                     Hive                                                                          Warehouse
                                                  MapReduce                                                                Sqoop

                   Sqoop
                   Flume                            HDFS




8
                                                   ©2011 Cloudera, Inc. All Rights Reserved.
Several advantages

    •   Store more data, cheaply
    •   Use commodity hardware
    •   Scale linearly, predictably
    •   Tolerate hardware failure
    •   Turn data into strategic asset
        – Ad hoc analytics
        – Predictive analytics



9
                        ©2012 Cloudera, Inc. | Company confidential
Several more advantages

 • Get long term view of data
 • Add unstructured, semi-structured data


 • Change schema on the fly (late binding)
 • Integrate with existing infrastructure




10
                 ©2012 Cloudera, Inc. | Company confidential
HDFS

 Self-healing, high bandwidth

     1

     2

     3      HDFS


     4                    2                     1                            1   2   1
                          4                     2                            3   3   3
     5                    5                     5                            4   5   4


 HDFS breaks incoming files into blocks and stores them redundantly across the cluster.



11
                                 ©2012 Cloudera, Inc. All Rights Reserved.
HDFS

 Self-healing, high bandwidth

     1

     2

     3      HDFS


     4                    2                                                  1   2   1
                          4                                                  3   3   3
     5                    5                                                  4   5   4


 HDFS breaks incoming files into blocks and stores them redundantly across the cluster.



12
                                 ©2012 Cloudera, Inc. All Rights Reserved.
MapReduce: Map

• Records from the data source (lines out of files, rows of a
  database, etc.) are fed into the map function as key*value
  pairs: e.g., (filename, line).

• map() produces one or more intermediate values along
  with an output key from the input.
                   (key
                                                                        (key 1, int.
                 1, value
                                                                          values)
                    s)


       Map         (key                   Shuffle                                                  Final
                                                                        (key 1, int.   Reduce
       Task      2, value                 Phase                                                 (key, value
                                                                          values)       Task
                    s)                                                                              s)

                   (key
                                                                        (key 1, int.
                 3, value
                                                                          values)
                    s)




13
                            ©2011 Cloudera, Inc. All Rights Reserved.
MapReduce: Reduce

• After the map phase is over, all the intermediate values for
  a given output key are combined together into a list

• reduce() combines those intermediate values into one or
  more final values for that same output key

                   (key
                                                                        (key 1, int.
                 1, value
                                                                          values)
                    s)


       Map         (key                   Shuffle                                                  Final
                                                                        (key 1, int.   Reduce
       Task      2, value                 Phase                                                 (key, value
                                                                          values)       Task
                    s)                                                                              s)

                   (key
                                                                        (key 1, int.
                 3, value
                                                                          values)
                    s)




14
                            ©2011 Cloudera, Inc. All Rights Reserved.
MapReduce: Execution




15
            ©2011 Cloudera, Inc. All Rights Reserved.
MapReduce: WordCount
          Input text: The cat sat on the mat. The aardvark sat on the sofa.

      Mapping             Shuffling                           Reducing
     The, 1            aardvark, 1                        aardvark, 1
     cat, 1
                       cat, 1                                                    Final Result
     sat, 1                                               cat, 1
     on, 1                                                                       aardvark, 1
     the, 1            mat, 1                             mat, 1                 cat, 1
     mat, 1                                                                      mat, 1
     The, 1            on [1, 1]                          on, 2                  on, 2
     aardvark, 1                                                                 sat, 2
     sat, 1            sat [1, 1]                         sat, 2                 sofa, 1
     on, 1                                                                       the, 4
     the, 1            sofa, 1                            sofa, 1
     sofa, 1
                       the [1, 1, 1, 1]                   the, 4




16
                                     ©2011 Cloudera, Inc. All Rights Reserved.
Sqoop: RDBMS to HDFS




17
           ©2011 Cloudera, Inc. All Rights Reserved.
Sqoop: HDFS to RDBMS




18
           ©2011 Cloudera, Inc. All Rights Reserved.
FlumeNG: High-level Architecture

                 Client


                               Agent

                 Client


                                                             Agent

                 Client

                               Agent


                 Client




                                                                               Channel      Sink 1
     Examples                                                                     1
                                                                      Source
     Sources: Avro, netcat, exec
                                                                               Channel      Sink 2
     Channels: memory, JDBC                                                       2

     Sink: HDFS, Avro                                                                    Agent




19
                                   ©2011 Cloudera, Inc. All Rights Reserved.
HBase: Table Structure
                    Column family “contents”                            Column family “anchor_text”

Row Key             Column   Timestamp               Cell               Column      Timestamp       Cell
                    Key                                                 Key
Com.cloudera.info            1273716197868           <html>             Bar.com     1273871824184   Cloudera!...
                                                     …
Com.cloudera.www             1273746289103           <html>             Baz.org     1273871962874   Hadoop!...
                                                     …
Com.foo.www                  1273698729045           <html>
                                                     …
Com.foo.www                  1273699734191           <html>             Bar.gov     1273879456211   Edu.foo…
                                                     …
…




    20
                                    ©2011 Cloudera, Inc. All Rights Reserved.
HBase: Architecture




21
              ©2011 Cloudera, Inc. All Rights Reserved.
Hive
SQL-based data warehousing application
      Language is SQL-like
      Features for analyzing very large data sets
         Partition columns, Sampling, Buckets


                  SELECT
                     s.word, s.freq, k.freq
                  FROM shakespeare
                  JOIN ON (s.word= k.word)
                  WHERE s.freq >= 5;


22
                        ©2011 Cloudera, Inc. All Rights Reserved.
Pig

Data-flow oriented language – “Pig latin”
      Datatypes include sets, associative arrays, tuples
      High-level language for routing data, allows easy
       integration of Java for complex tasks


        emps = LOAD 'people.txt’ AS (id,name,salary);
        rich = FILTER emps BY salary > 200000;
        sorted_rich = ORDER rich BY salary DESC;
        STORE sorted_rich INTO ’rich_people.txt';



23
                        ©2011 Cloudera, Inc. All Rights Reserved.
Oozie
 Workflow/coordination service to manage data processing
 jobs for Hadoop




24
                      ©2011 Cloudera, Inc. All Rights Reserved.
Oozie
 Workflow/coordination service to manage data processing
 jobs for Hadoop




25
                      ©2011 Cloudera, Inc. All Rights Reserved.
Hadoop Security

 Authentication is secured by Kerberos v5 and integrated with LDAP
 Hadoop server can ensure that users and groups are who they say they are
 Job Control includes Access Control Lists, which means Jobs can specify who
     can view logs, counters, configurations and who can modify a job
 Tasks now run as the user who launched the job




26
                                ©2011 Cloudera, Inc. All Rights Reserved.
Typical Use Cases
 ©2011 Cloudera, Inc. All
 Rights Reserved.
27
Common Challenges

1    Network Analysis and Sessionization
2    Content Optimization and Engagement Modeling
3    Usage Analysis and Mediation
4    Entity Surveillance and Signal Monitoring
5    Recommendations and Modeling
6    Loyalty, Promotion Analysis and Targeting
7    Fraud Analysis, Reconciliation and Risk
8    Time series Analysis, Mapping and Modeling


28
                        ©2011 Cloudera, Inc. All Rights Reserved.
What Can Hadoop Do For You?
                                                  Two Core Use Cases

1                                                                                                                             2
                                                  Applied Across Verticals
                         INDUSTRY TERM                          VERTICAL                            INDUSTRY TERM

                        Social Network Analysis                     Web                          Clickstream Sessionization
ADVANCED ANALYTICS




                                                                                                                              DATA PROCESSING
                         Content Optimization                      Media                               Engagement



                           Network Analytics                       Telco                                 Mediation



                     Loyalty & Promotions Analysis                 Retail                              Data Factory


                            Fraud Analysis                      Financial                           Trade Reconciliation



                            Entity Analysis                       Federal                                 SIGINT


                         Sequencing Analysis               Bioinformatics                            Genome Mapping




29
                                                     ©2011 Cloudera, Inc. All Rights Reserved.
Financial Services

1    Customer Risk Analysis

2    Surveillance and Fraud Detection
3    Central Data Repository
4    Personalization and Asset Management
5    Market Risk Modeling
6    Trade Performance Analytics




30
                       ©2011 Cloudera, Inc. All Rights Reserved.
Customer Risk Analysis

 Build comprehensive data picture of customer side risk
     Publish a consolidated set of attributes for analysis
     Map ratings across products
 Parse and aggregate data from difference sources
     Credit and debit cards, product payments, deposits and savings
     Banking activity, browsing behavior, call logs, e-mails and chats
 Merge data into a single view
     A “fuzzy join” among data sources
     Structure and normalize attributes
     Sentiment analysis, pattern recognition


31
                            Copyright 2010 Cloudera Inc. All rights reserved
Surveillance and Fraud Detection

 Trade surveillance records activity in a central
 repository
     Centralized logging across all execution platforms
     Structured and raw log data from multiple applications
 Pattern recognition detect anomalies/harmful behavior
     Feature set and timeline vector are very dynamic
     Schema on read provides flexibility for analysis
 Data is primarily served and processed in HDFS with MR
     Data filtering and projection in Pig and Hive
     Statistical modeling of data sets in R or SAS


32
                           Copyright 2010 Cloudera Inc. All rights reserved
Central Data Repository

 Financial Data messy due to many interacting systems
    Personal data is obfuscated for security and records get out of sync
    Trades need to be “sessionized” into accounts and products
    Discrepancies are difficult to reconcile, need to track corrections
 Hadoop is a centralized platform for data collection
    Single source for data, processing happens on the platform
    Metadata used to track information lifecycle
    Workflows run and monitor data transformation pipelines
 Data served via APIs or in Batch
    Single version of the truth, data processed and cleansed centrally
    Clear audit trail of data dependencies and usage


33
                           Copyright 2010 Cloudera Inc. All rights reserved
Personalization and Asset Mgmt

 Institutional and personal investing services
     Arms investor with sophisticated models for their positions
     Success measured by upsell and conversion (as well as profit)
 Data analysis across distinct data sources
     Market data and individual assets by investor
     Investor strategy, goals and interactive behavior
 Data sources combined in HDFS
     Models written in Pig with UDFs and generated regularly
     Reports for sales and fed into online recommendation system



34
                             ©2011 Cloudera, Inc. All Rights Reserved.
Market Risk Modeling


 Evaluating asset risk is very data intensive
     Trade volumes have increased dramatically
     Classic indicators at the daily level don’t provide a clear picture
 Trends across complex instruments can be hard to spot
     Models require massive brute force calculation
     Multiple models built in batch and in parallel
 Data is primarily structured and sourced from RDBMS
     Transactional data sqooped to combine with market feeds
     Resulting predictions sqooped and served via RDBMS




35
                              ©2011 Cloudera, Inc. All Rights Reserved.
Trade Performance Analytics

 Increased Demands on Trade Analytics
     Regulatory requirements for best price trading across exchanges
     Increased competition and scrutiny adds a focus on optimization
 Trade Analytics becomes a Clickstream problem
     Trade execution systems include order trails and execution logs
     Sessionized across order systems and combined with system logs
 Processing, Analysis and Audit Trail all in Hadoop
     KPIs summarized as regular reports written in Hive
     Data available for historical analysis and discovery



36
                             ©2011 Cloudera, Inc. All Rights Reserved.
Science and Energy

1    Genomics
2    Utilities and Power Grid
3    Smart Meters
4    Biodiversity Indexing
5    Network Failures
6    Seismic Data




37
                        ©2011 Cloudera, Inc. All Rights Reserved.
Genomics

 Cost of DNA Sequencing Falling Very Fast
   Raw data needs to be aligned and matched
   Scientists want to collect and analyze these sequences
 Hadoop Can Read Native Format
    hadoop-bam Java library for manipulation of Binary Alignment/Map
    Alignment, SNP discovery, genotyping
 Genomic Tools Based On Hadoop
    SEAL – distributed short read alignment
    BlastReduce – parallel read mapping
    Crossbow – whole genome re-sequencing analysis
    Cloudburst - sensitive MapReduce alignment


38
                         Copyright 2010 Cloudera Inc. All rights reserved
Utilities and the Power Grid


 Power grid is aging and maintained incrementally
    Failures hard to predicate and can have cascading effects
    Looking at vibration of transformers over time to find patterns
 Predicting failure of grid equipment
    Supervised learning to scan time series data for fuzzy patterns
    Identify likely faulting equipment for targeted replacement
 Hadoop based tools to model equipment behavior
    openPDC project: http://openpdc.codeplex.com
    Lumberyard - indexing time series data for low latency fuzzy queries




39
                          Copyright 2010 Cloudera Inc. All rights reserved
Smart Meter Example Workflow

 Looking at usage patterns in home smart meter data
    How to educate consumers to save energy
    Capacity planning for the grid
 Individual analysis is critical
    Personalized reporting to consumers
    Predictive modeling of peak usage and potential cost savings
 Hadoop for collection, reporting and analysis
    Collect time series samples in Hadoop
    Partition at various granularities and roll up reports and models




40
                           Copyright 2010 Cloudera Inc. All rights reserved
Biodiversity Indexing

 Consolidation and serving of Biological data
   Provide free and open access to biodiversity data
   Collection, search, discovery and access to a variety of data
 Data matching and cleansing
    Geography, Water/land mapping
    Dictionaries and taxonomic services
 Data is harvested into multiple RDBMS
   Sqoop to Hadoop for processing workflows and index generation
   Sqoop back to MySQL for Web app serving
   Future development is to crawl into and serve from HBase




41
                          ©2011 Cloudera, Inc. All Rights Reserved.
Preventing Network Failure


 Need to Model and understand Network behavior
   Better understanding how the network reacts to fluctuations
   Discrete anomalies may, in fact, be interconnected
 Collection and forensic analysis of emerging patterns
   Record the data exhaust – all metrics, logs, traffic metadata
   Identify leading indicators of component failure
 New techniques when all data is available
   Expand the range of indexing techniques
   Starting with simple scans to more complex data mining




42
                          ©2011 Cloudera, Inc. All Rights Reserved.
Processing Seismic Data

Optimize the IO-intensive phases of seismic processing
   Incorporate additional parallelism where it makes sense
   Simplify gather/transpose operations with MapReduce
Seismic Unix for Core Algorithms
   Well-known, used at many grad programs in geophysics
   SU file format can be easily transformed for processing on HDFS
Hadoop Streaming
   Seismic Unix, SEPlib, Javaseis - non-Java code in MR
   Framework is aware of parameter files needed by SU commands




                        Copyright 2011 Cloudera Inc. All rights reserved
Retail and Manufacturing

1    Customer Churn
2    Brand and Sentiment Analysis
3    Point of Sales
4    Pricing Models
5    Customer Loyalty
6    Targeted Offers




44
                        ©2011 Cloudera, Inc. All Rights Reserved.
Customer Churn Analysis

 Understanding Customer Behavior and Preferences
    Rapidly test and build behavioral model of customer
    Combine disparate data sources (transactional, social,etc)
 Structure and analyze with Hadoop
    Traversing usage and social graphs
    Pattern identification and recognition to find indicators
 Feature Extraction to find Root Causes
    Defining attributes and modeling statistical significance
    Combinations and sequence of attributes and actions factor in




45
                           ©2011 Cloudera, Inc. All Rights Reserved.
Brands and Sentiment Analysis

 Internet generates a lot of chatter about brands
    Understanding what’s being said is crucial to protecting brand value
    Facebook, Twitter generate a lot of data for a global top brand
 Capturing and Processing direct feedback
    Better engagement and alerting via Sentiment Analysis
    Not yet ready for fully automated customer service
  Hadoop handles the diverse data types and processing
    Sources of data changing and semantics continuously evolving
    Sophistication of algorithms is improving daily




46
                           Copyright 2010 Cloudera Inc. All rights reserved
Point of Sale Transaction Analysis

 Lot’s of machine generated data available
    Line items, stock, coupons, ads
    Stored in various formats
 Pattern recognition enables constant reassessment
    Optimizing across multiple data sources
    Demand prediction based on
 Joining multiple data sets for more insight
    Retail Supply Chain
    Weather and Financial data




47
                    Copyright 2010 Cloudera Inc. All rights reserved
Pricing Models

 Retailers have increased flexibility in pricing
   Comparison shopping is dynamic
   Customer weighs combined value and time to delivery
 Understand how prices affect purchasing
   New techniques apply such as A/B testing and spot discounts
   Motivations can be difficult to discern, need to look for correlations
 Combinations multiply, Hadoop provides scale to analyze
   Bundles can have incentive discounts
   Clustering and supervised learning to group attributes




48
                            ©2011 Cloudera, Inc. All Rights Reserved.
Customer Loyalty

 Comparison shopping is making Retail hyper-competitive
   Discount programs, e-mail correspondence entice shoppers
   Brand loyalty means attention to detail and service
 Customer lifecycle is more than purchases
   Browsing and online data used to capture customer attention
   Loyalty programs bridge the gap between purchases
 Reach into online channels
   Online engagement is personalized just as in store
   Connecting online and in store shows customer awareness




49
                        ©2011 Cloudera, Inc. All Rights Reserved.
Targeted Offers

 The checkout lane is everywhere
   Cookies track users through ad impressions
   Purchasing behavior is time sensitive
 Logs collected from on-site and off-site browsing
    Data is ingested incrementally
    Process happens at a variety of time scales
 Data logged to HBase as primary store
    Some events naturally associate, others require deeper analysis
    Random access useful for debugging algorithms




50
                           ©2011 Cloudera, Inc. All Rights Reserved.
Web and e-Commerce

1    Online Media

2    Mobile
3    Online Gaming
4    Search Quality
5    Recommendations
6    Influence




51
                       ©2011 Cloudera, Inc. All Rights Reserved.
Online Media

 Centralized platform for consolidated log processing
   Many online properties each with separate sys, ad, ops logs
   Different standards and techniques for processing
 Data feeds are varied
   Advertising logs, website traffic feeds from 3rd party
   providers, system logs, application logs and other operational
   metrics
 Data pipeline can be normalized
   Cleansing, standard analytics and reporting
   Soon an exploratory platform as well as storage across all
   properties



52
                           ©2011 Cloudera, Inc. All Rights Reserved.
Mobile

 Mobile advertisement platform
    Measuring metrics impressions, clicks, actions and conversions.
    Most metrics are arbitrary text strings (data is dirty)
 Stringent SLAs for delivering results
    SLA of several minutes between event and report to advertisers
    SLA also covers data accuracy
 Hadoop for ETL, Analytics, reporting
    HBase for serving results to advertisers
    Mimics the popular online analytics services




53
                           ©2011 Cloudera, Inc. All Rights Reserved.
Online Gaming

 Consolidating data silos for a holistic view of users
    Various silos of data – user reg, financial, game play, web
    Poplar games simulate real world sports
 First goal is accessibility
    Multiple business can access all data
    Game play metrics are extremely detailed (think sensor data)
 Second is exploratory
    Distributions, event triggers, distinct counts and association rates
    Compute online statistics such as leaderboards




54
                             ©2011 Cloudera, Inc. All Rights Reserved.
Search Quality

 Understand user search behavior
    Improve service, assess quality of results
    Understand load, identify trends, generate predictive search
 Search query logs stored in HDFS
   Hive based aggregation
   Sqoop to RDBMS for end user analytics
 Now focused on internal monitoring
    Analytics have become a critical part of the service
    Where are analytic needs growing?
    What data about searches do people want to see?




55
                           ©2011 Cloudera, Inc. All Rights Reserved.
Recommendations and Forecasting

 Collect and serve personalization information
    Wide variety of constantly changing data sources
    Data guaranteed to be messy
 Data ingestion includes collection of raw data
    Filtering and fixing of poorly formatted data
    Normalization and matching across data sources
 Analysis looks for reliable attributes and groupings
    Interpretation (e.g. gender by name)
    Aggregation across likely matching identifiers
    Identify possible predicted attributes or preferences




56
                        Copyright 2010 Cloudera Inc. All rights reserved
Influence

 Collect a fire hose of data about social commentary
   Personal opinions, references to opinions, links
   Look for tracking and referencing (like very messy page rank)
 Hadoop to bucket and prepare for analysis
    Meta data and distinct topics
    Social graph scoring, bot and spam detection
 Hadoop stack used throughout
    Pig and Java, coordinated with Oozie
    Batch serve data in CSV and load to HBase for API servers




57
                           ©2011 Cloudera, Inc. All Rights Reserved.
August 2012

Cloudera University
Sarah Sproehnle
Why invest in training?

• Maximize your investment in a new
  technology
• Make fewer mistakes by learning the best
  practices
• Cheaper and easier to cross-train than
  hire
  – Existing DBAs, Analysts and System
    Administrators can become Hadoop users
Cloudera University
 • Experience
     – We’ve trained over 12,000 people
     – Our courses incorporate the best practices that Cloudera has learned
       from supporting our customers
 • Depth of courseware
     – A comprehensive, role-based curriculum
     – We can train your entire staff in all aspects of CDH
 • Geographical coverage
     – We offer public and private classes in over 20 countries including
       US, Canada, Brazil, Germany, UK, Poland, Spain, Israel, France, The
       Netherlands, South Africa, China, India, Australia and Singapore
 • Certification
     – Available worldwide at Pearson VUE (vouchers included in our courses)
     – Certifications for Developers (CCDH), Admins (CCAH), and HBase
       (CCSHB)



60                          ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                           Reproduction or redistribution without written permission is
                                                   prohibited.
Value proposition of private training
• 12k/day for up to 20 students
  – NEW: 8k/day for up to 10 students
  – Price includes courseware, lab materials, cert
    vouchers (for Dev, Admin, HBase), and T&E
• We can tailor a class
  – We have ~ 3 weeks of content that we can mix
    and match into a customized class
  – Saves the customer’s time by covering the most
    relevant topics, cutting out non essential material
• Customer chooses location and date
• We’re under NDA
Learning paths




62           ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
            Reproduction or redistribution without written permission is
                                    prohibited.

Mais conteúdo relacionado

Mais procurados

Liquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANALiquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANASAP Technology
 
Creating Data Hubs to Enhance Information Sharing
Creating Data Hubs to Enhance Information SharingCreating Data Hubs to Enhance Information Sharing
Creating Data Hubs to Enhance Information SharingInnoTech
 
Infosüsteemide infrastruktuuri haldus ja monitooring Oracle Enterprise Manage...
Infosüsteemide infrastruktuuri haldus ja monitooring Oracle Enterprise Manage...Infosüsteemide infrastruktuuri haldus ja monitooring Oracle Enterprise Manage...
Infosüsteemide infrastruktuuri haldus ja monitooring Oracle Enterprise Manage...ORACLE USER GROUP ESTONIA
 
Google apps brochure
Google apps brochureGoogle apps brochure
Google apps brochureFrank Jung
 
SQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsSQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsDataWorks Summit
 
Innovations in SAP BusinessObjects 4.0
Innovations in SAP BusinessObjects 4.0Innovations in SAP BusinessObjects 4.0
Innovations in SAP BusinessObjects 4.0Pierre Leroux
 
Sap sap so h 2013
Sap sap so h 2013Sap sap so h 2013
Sap sap so h 2013deepersnet
 
Analytics on Hadoop
Analytics on HadoopAnalytics on Hadoop
Analytics on HadoopEMC
 

Mais procurados (10)

Liquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANALiquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANA
 
Creating Data Hubs to Enhance Information Sharing
Creating Data Hubs to Enhance Information SharingCreating Data Hubs to Enhance Information Sharing
Creating Data Hubs to Enhance Information Sharing
 
Infosüsteemide infrastruktuuri haldus ja monitooring Oracle Enterprise Manage...
Infosüsteemide infrastruktuuri haldus ja monitooring Oracle Enterprise Manage...Infosüsteemide infrastruktuuri haldus ja monitooring Oracle Enterprise Manage...
Infosüsteemide infrastruktuuri haldus ja monitooring Oracle Enterprise Manage...
 
Google apps brochure
Google apps brochureGoogle apps brochure
Google apps brochure
 
2012 06 hortonworks paris hug
2012 06 hortonworks paris hug2012 06 hortonworks paris hug
2012 06 hortonworks paris hug
 
SQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsSQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analytics
 
Day @ cio-gipfel 2007
Day @ cio-gipfel 2007Day @ cio-gipfel 2007
Day @ cio-gipfel 2007
 
Innovations in SAP BusinessObjects 4.0
Innovations in SAP BusinessObjects 4.0Innovations in SAP BusinessObjects 4.0
Innovations in SAP BusinessObjects 4.0
 
Sap sap so h 2013
Sap sap so h 2013Sap sap so h 2013
Sap sap so h 2013
 
Analytics on Hadoop
Analytics on HadoopAnalytics on Hadoop
Analytics on Hadoop
 

Destaque

2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101Adam Muise
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Njug presentation
Njug presentationNjug presentation
Njug presentationiwrigley
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL TechnologiesAmit Singh
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingYahoo Developer Network
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetupiwrigley
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

Destaque (20)

2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Njug presentation
Njug presentationNjug presentation
Njug presentation
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Hadoop 101 v1
Hadoop 101 v1Hadoop 101 v1
Hadoop 101 v1
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Semelhante a hadoop 101 aug 21 2012 tohug

Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
Amr Awadallah, unSEXY Presentation
Amr Awadallah, unSEXY PresentationAmr Awadallah, unSEXY Presentation
Amr Awadallah, unSEXY Presentation500 Startups
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseDataWorks Summit
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseCloudera, Inc.
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...Cloudera, Inc.
 
Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 ...
Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 ...Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 ...
Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 ...Cloudera, Inc.
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Jonathan Seidman
 

Semelhante a hadoop 101 aug 21 2012 tohug (20)

Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Amr Awadallah, unSEXY Presentation
Amr Awadallah, unSEXY PresentationAmr Awadallah, unSEXY Presentation
Amr Awadallah, unSEXY Presentation
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Zh tw cloud computing era
Zh tw cloud computing eraZh tw cloud computing era
Zh tw cloud computing era
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the Enterprise
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
 
Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 ...
Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 ...Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 ...
Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 ...
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 

Mais de Adam Muise

2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadamAdam Muise
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop IntroductionAdam Muise
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopAdam Muise
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_securityAdam Muise
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoopAdam Muise
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitectureAdam Muise
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mdaAdam Muise
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - HadoopAdam Muise
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACAdam Muise
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013Adam Muise
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_pointsAdam Muise
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalogAdam Muise
 

Mais de Adam Muise (20)

2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 

Último

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 

Último (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 

hadoop 101 aug 21 2012 tohug

  • 1. August 21 2012 – Toronto Hadoop User Group a.k.a. THUGs Introduction to Hadoop: Pretty Picture Version {Due credit to Todd’s Magic}
  • 2. Why are we here? • Become exposed to the core concepts of Hadoop • Understand the projects within Hadoop and how they fit together • Review Common Use Cases for Hadoop • Share beginner experiences with Hadoop • Ask a @$%$#-load of questions about Hadoop 2 ©2011 Cloudera, Inc. All Rights Reserved.
  • 3. What I won’t be able to give you… • A complete introduction to the technology (takes too long) • Enough information to begin development or implementation of Hadoop (too complicated) • Enough information to install and configure Hadoop (I recommend you start with the Cloudera VMWare image individually or Cloudera Manager for a real cluster) • Have a hands-on Pig-fest or Hive-fest (that’s a THUG meetup to come…) 3 ©2011 Cloudera, Inc. All Rights Reserved.
  • 4. Users of Cloudera Financial Retail & Web Telecom Media Consumer 4 ©2011 Cloudera, Inc. All Rights Reserved.
  • 5. Hadoop Use Cases Use Case Application Industry Application Use Case Social Network Analysis Web Clickstream Sessionization Content Optimization Media Clickstream Sessionization ADVANCED ANALYTICS DATA PROCESSING Network Analytics Telco Mediation Loyalty & Promotions Retail Data Factory Analysis Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping 5 ©2011 Cloudera, Inc. All Rights Reserved.
  • 6. CDH File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Query / Analytics APACHE PIG, APACHE Fast Data Integration HIVE, APACHE MAHOUT Read/Write Access APACHE FLUME, APACHE HDFS, MAPREDUCE APACHE SQOOP HBASE Coordination APACHE ZOOKEEPER 6 ©2012 Cloudera, Inc. | Company confidential
  • 7. Typical Data Pipeline Marts Processing Layer Data Sources Data (Temporary) Warehouse Storage Archive 7 ©2011 Cloudera, Inc. All Rights Reserved.
  • 8. Typical Data Pipeline with Hadoop Hadoop Marts Oozie Result or Calculated Data Original Source Data Data Sources Pig Data Hive Warehouse MapReduce Sqoop Sqoop Flume HDFS 8 ©2011 Cloudera, Inc. All Rights Reserved.
  • 9. Several advantages • Store more data, cheaply • Use commodity hardware • Scale linearly, predictably • Tolerate hardware failure • Turn data into strategic asset – Ad hoc analytics – Predictive analytics 9 ©2012 Cloudera, Inc. | Company confidential
  • 10. Several more advantages • Get long term view of data • Add unstructured, semi-structured data • Change schema on the fly (late binding) • Integrate with existing infrastructure 10 ©2012 Cloudera, Inc. | Company confidential
  • 11. HDFS Self-healing, high bandwidth 1 2 3 HDFS 4 2 1 1 2 1 4 2 3 3 3 5 5 5 4 5 4 HDFS breaks incoming files into blocks and stores them redundantly across the cluster. 11 ©2012 Cloudera, Inc. All Rights Reserved.
  • 12. HDFS Self-healing, high bandwidth 1 2 3 HDFS 4 2 1 2 1 4 3 3 3 5 5 4 5 4 HDFS breaks incoming files into blocks and stores them redundantly across the cluster. 12 ©2012 Cloudera, Inc. All Rights Reserved.
  • 13. MapReduce: Map • Records from the data source (lines out of files, rows of a database, etc.) are fed into the map function as key*value pairs: e.g., (filename, line). • map() produces one or more intermediate values along with an output key from the input. (key (key 1, int. 1, value values) s) Map (key Shuffle Final (key 1, int. Reduce Task 2, value Phase (key, value values) Task s) s) (key (key 1, int. 3, value values) s) 13 ©2011 Cloudera, Inc. All Rights Reserved.
  • 14. MapReduce: Reduce • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce() combines those intermediate values into one or more final values for that same output key (key (key 1, int. 1, value values) s) Map (key Shuffle Final (key 1, int. Reduce Task 2, value Phase (key, value values) Task s) s) (key (key 1, int. 3, value values) s) 14 ©2011 Cloudera, Inc. All Rights Reserved.
  • 15. MapReduce: Execution 15 ©2011 Cloudera, Inc. All Rights Reserved.
  • 16. MapReduce: WordCount Input text: The cat sat on the mat. The aardvark sat on the sofa. Mapping Shuffling Reducing The, 1 aardvark, 1 aardvark, 1 cat, 1 cat, 1 Final Result sat, 1 cat, 1 on, 1 aardvark, 1 the, 1 mat, 1 mat, 1 cat, 1 mat, 1 mat, 1 The, 1 on [1, 1] on, 2 on, 2 aardvark, 1 sat, 2 sat, 1 sat [1, 1] sat, 2 sofa, 1 on, 1 the, 4 the, 1 sofa, 1 sofa, 1 sofa, 1 the [1, 1, 1, 1] the, 4 16 ©2011 Cloudera, Inc. All Rights Reserved.
  • 17. Sqoop: RDBMS to HDFS 17 ©2011 Cloudera, Inc. All Rights Reserved.
  • 18. Sqoop: HDFS to RDBMS 18 ©2011 Cloudera, Inc. All Rights Reserved.
  • 19. FlumeNG: High-level Architecture Client Agent Client Agent Client Agent Client Channel Sink 1 Examples 1 Source Sources: Avro, netcat, exec Channel Sink 2 Channels: memory, JDBC 2 Sink: HDFS, Avro Agent 19 ©2011 Cloudera, Inc. All Rights Reserved.
  • 20. HBase: Table Structure Column family “contents” Column family “anchor_text” Row Key Column Timestamp Cell Column Timestamp Cell Key Key Com.cloudera.info 1273716197868 <html> Bar.com 1273871824184 Cloudera!... … Com.cloudera.www 1273746289103 <html> Baz.org 1273871962874 Hadoop!... … Com.foo.www 1273698729045 <html> … Com.foo.www 1273699734191 <html> Bar.gov 1273879456211 Edu.foo… … … 20 ©2011 Cloudera, Inc. All Rights Reserved.
  • 21. HBase: Architecture 21 ©2011 Cloudera, Inc. All Rights Reserved.
  • 22. Hive SQL-based data warehousing application  Language is SQL-like  Features for analyzing very large data sets  Partition columns, Sampling, Buckets SELECT s.word, s.freq, k.freq FROM shakespeare JOIN ON (s.word= k.word) WHERE s.freq >= 5; 22 ©2011 Cloudera, Inc. All Rights Reserved.
  • 23. Pig Data-flow oriented language – “Pig latin”  Datatypes include sets, associative arrays, tuples  High-level language for routing data, allows easy integration of Java for complex tasks emps = LOAD 'people.txt’ AS (id,name,salary); rich = FILTER emps BY salary > 200000; sorted_rich = ORDER rich BY salary DESC; STORE sorted_rich INTO ’rich_people.txt'; 23 ©2011 Cloudera, Inc. All Rights Reserved.
  • 24. Oozie Workflow/coordination service to manage data processing jobs for Hadoop 24 ©2011 Cloudera, Inc. All Rights Reserved.
  • 25. Oozie Workflow/coordination service to manage data processing jobs for Hadoop 25 ©2011 Cloudera, Inc. All Rights Reserved.
  • 26. Hadoop Security  Authentication is secured by Kerberos v5 and integrated with LDAP  Hadoop server can ensure that users and groups are who they say they are  Job Control includes Access Control Lists, which means Jobs can specify who can view logs, counters, configurations and who can modify a job  Tasks now run as the user who launched the job 26 ©2011 Cloudera, Inc. All Rights Reserved.
  • 27. Typical Use Cases ©2011 Cloudera, Inc. All Rights Reserved. 27
  • 28. Common Challenges 1 Network Analysis and Sessionization 2 Content Optimization and Engagement Modeling 3 Usage Analysis and Mediation 4 Entity Surveillance and Signal Monitoring 5 Recommendations and Modeling 6 Loyalty, Promotion Analysis and Targeting 7 Fraud Analysis, Reconciliation and Risk 8 Time series Analysis, Mapping and Modeling 28 ©2011 Cloudera, Inc. All Rights Reserved.
  • 29. What Can Hadoop Do For You? Two Core Use Cases 1 2 Applied Across Verticals INDUSTRY TERM VERTICAL INDUSTRY TERM Social Network Analysis Web Clickstream Sessionization ADVANCED ANALYTICS DATA PROCESSING Content Optimization Media Engagement Network Analytics Telco Mediation Loyalty & Promotions Analysis Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping 29 ©2011 Cloudera, Inc. All Rights Reserved.
  • 30. Financial Services 1 Customer Risk Analysis 2 Surveillance and Fraud Detection 3 Central Data Repository 4 Personalization and Asset Management 5 Market Risk Modeling 6 Trade Performance Analytics 30 ©2011 Cloudera, Inc. All Rights Reserved.
  • 31. Customer Risk Analysis Build comprehensive data picture of customer side risk Publish a consolidated set of attributes for analysis Map ratings across products Parse and aggregate data from difference sources Credit and debit cards, product payments, deposits and savings Banking activity, browsing behavior, call logs, e-mails and chats Merge data into a single view A “fuzzy join” among data sources Structure and normalize attributes Sentiment analysis, pattern recognition 31 Copyright 2010 Cloudera Inc. All rights reserved
  • 32. Surveillance and Fraud Detection Trade surveillance records activity in a central repository Centralized logging across all execution platforms Structured and raw log data from multiple applications Pattern recognition detect anomalies/harmful behavior Feature set and timeline vector are very dynamic Schema on read provides flexibility for analysis Data is primarily served and processed in HDFS with MR Data filtering and projection in Pig and Hive Statistical modeling of data sets in R or SAS 32 Copyright 2010 Cloudera Inc. All rights reserved
  • 33. Central Data Repository Financial Data messy due to many interacting systems Personal data is obfuscated for security and records get out of sync Trades need to be “sessionized” into accounts and products Discrepancies are difficult to reconcile, need to track corrections Hadoop is a centralized platform for data collection Single source for data, processing happens on the platform Metadata used to track information lifecycle Workflows run and monitor data transformation pipelines Data served via APIs or in Batch Single version of the truth, data processed and cleansed centrally Clear audit trail of data dependencies and usage 33 Copyright 2010 Cloudera Inc. All rights reserved
  • 34. Personalization and Asset Mgmt Institutional and personal investing services Arms investor with sophisticated models for their positions Success measured by upsell and conversion (as well as profit) Data analysis across distinct data sources Market data and individual assets by investor Investor strategy, goals and interactive behavior Data sources combined in HDFS Models written in Pig with UDFs and generated regularly Reports for sales and fed into online recommendation system 34 ©2011 Cloudera, Inc. All Rights Reserved.
  • 35. Market Risk Modeling Evaluating asset risk is very data intensive Trade volumes have increased dramatically Classic indicators at the daily level don’t provide a clear picture Trends across complex instruments can be hard to spot Models require massive brute force calculation Multiple models built in batch and in parallel Data is primarily structured and sourced from RDBMS Transactional data sqooped to combine with market feeds Resulting predictions sqooped and served via RDBMS 35 ©2011 Cloudera, Inc. All Rights Reserved.
  • 36. Trade Performance Analytics Increased Demands on Trade Analytics Regulatory requirements for best price trading across exchanges Increased competition and scrutiny adds a focus on optimization Trade Analytics becomes a Clickstream problem Trade execution systems include order trails and execution logs Sessionized across order systems and combined with system logs Processing, Analysis and Audit Trail all in Hadoop KPIs summarized as regular reports written in Hive Data available for historical analysis and discovery 36 ©2011 Cloudera, Inc. All Rights Reserved.
  • 37. Science and Energy 1 Genomics 2 Utilities and Power Grid 3 Smart Meters 4 Biodiversity Indexing 5 Network Failures 6 Seismic Data 37 ©2011 Cloudera, Inc. All Rights Reserved.
  • 38. Genomics Cost of DNA Sequencing Falling Very Fast Raw data needs to be aligned and matched Scientists want to collect and analyze these sequences Hadoop Can Read Native Format hadoop-bam Java library for manipulation of Binary Alignment/Map Alignment, SNP discovery, genotyping Genomic Tools Based On Hadoop SEAL – distributed short read alignment BlastReduce – parallel read mapping Crossbow – whole genome re-sequencing analysis Cloudburst - sensitive MapReduce alignment 38 Copyright 2010 Cloudera Inc. All rights reserved
  • 39. Utilities and the Power Grid Power grid is aging and maintained incrementally Failures hard to predicate and can have cascading effects Looking at vibration of transformers over time to find patterns Predicting failure of grid equipment Supervised learning to scan time series data for fuzzy patterns Identify likely faulting equipment for targeted replacement Hadoop based tools to model equipment behavior openPDC project: http://openpdc.codeplex.com Lumberyard - indexing time series data for low latency fuzzy queries 39 Copyright 2010 Cloudera Inc. All rights reserved
  • 40. Smart Meter Example Workflow Looking at usage patterns in home smart meter data How to educate consumers to save energy Capacity planning for the grid Individual analysis is critical Personalized reporting to consumers Predictive modeling of peak usage and potential cost savings Hadoop for collection, reporting and analysis Collect time series samples in Hadoop Partition at various granularities and roll up reports and models 40 Copyright 2010 Cloudera Inc. All rights reserved
  • 41. Biodiversity Indexing Consolidation and serving of Biological data Provide free and open access to biodiversity data Collection, search, discovery and access to a variety of data Data matching and cleansing Geography, Water/land mapping Dictionaries and taxonomic services Data is harvested into multiple RDBMS Sqoop to Hadoop for processing workflows and index generation Sqoop back to MySQL for Web app serving Future development is to crawl into and serve from HBase 41 ©2011 Cloudera, Inc. All Rights Reserved.
  • 42. Preventing Network Failure Need to Model and understand Network behavior Better understanding how the network reacts to fluctuations Discrete anomalies may, in fact, be interconnected Collection and forensic analysis of emerging patterns Record the data exhaust – all metrics, logs, traffic metadata Identify leading indicators of component failure New techniques when all data is available Expand the range of indexing techniques Starting with simple scans to more complex data mining 42 ©2011 Cloudera, Inc. All Rights Reserved.
  • 43. Processing Seismic Data Optimize the IO-intensive phases of seismic processing Incorporate additional parallelism where it makes sense Simplify gather/transpose operations with MapReduce Seismic Unix for Core Algorithms Well-known, used at many grad programs in geophysics SU file format can be easily transformed for processing on HDFS Hadoop Streaming Seismic Unix, SEPlib, Javaseis - non-Java code in MR Framework is aware of parameter files needed by SU commands Copyright 2011 Cloudera Inc. All rights reserved
  • 44. Retail and Manufacturing 1 Customer Churn 2 Brand and Sentiment Analysis 3 Point of Sales 4 Pricing Models 5 Customer Loyalty 6 Targeted Offers 44 ©2011 Cloudera, Inc. All Rights Reserved.
  • 45. Customer Churn Analysis Understanding Customer Behavior and Preferences Rapidly test and build behavioral model of customer Combine disparate data sources (transactional, social,etc) Structure and analyze with Hadoop Traversing usage and social graphs Pattern identification and recognition to find indicators Feature Extraction to find Root Causes Defining attributes and modeling statistical significance Combinations and sequence of attributes and actions factor in 45 ©2011 Cloudera, Inc. All Rights Reserved.
  • 46. Brands and Sentiment Analysis Internet generates a lot of chatter about brands Understanding what’s being said is crucial to protecting brand value Facebook, Twitter generate a lot of data for a global top brand Capturing and Processing direct feedback Better engagement and alerting via Sentiment Analysis Not yet ready for fully automated customer service Hadoop handles the diverse data types and processing Sources of data changing and semantics continuously evolving Sophistication of algorithms is improving daily 46 Copyright 2010 Cloudera Inc. All rights reserved
  • 47. Point of Sale Transaction Analysis Lot’s of machine generated data available Line items, stock, coupons, ads Stored in various formats Pattern recognition enables constant reassessment Optimizing across multiple data sources Demand prediction based on Joining multiple data sets for more insight Retail Supply Chain Weather and Financial data 47 Copyright 2010 Cloudera Inc. All rights reserved
  • 48. Pricing Models Retailers have increased flexibility in pricing Comparison shopping is dynamic Customer weighs combined value and time to delivery Understand how prices affect purchasing New techniques apply such as A/B testing and spot discounts Motivations can be difficult to discern, need to look for correlations Combinations multiply, Hadoop provides scale to analyze Bundles can have incentive discounts Clustering and supervised learning to group attributes 48 ©2011 Cloudera, Inc. All Rights Reserved.
  • 49. Customer Loyalty Comparison shopping is making Retail hyper-competitive Discount programs, e-mail correspondence entice shoppers Brand loyalty means attention to detail and service Customer lifecycle is more than purchases Browsing and online data used to capture customer attention Loyalty programs bridge the gap between purchases Reach into online channels Online engagement is personalized just as in store Connecting online and in store shows customer awareness 49 ©2011 Cloudera, Inc. All Rights Reserved.
  • 50. Targeted Offers The checkout lane is everywhere Cookies track users through ad impressions Purchasing behavior is time sensitive Logs collected from on-site and off-site browsing Data is ingested incrementally Process happens at a variety of time scales Data logged to HBase as primary store Some events naturally associate, others require deeper analysis Random access useful for debugging algorithms 50 ©2011 Cloudera, Inc. All Rights Reserved.
  • 51. Web and e-Commerce 1 Online Media 2 Mobile 3 Online Gaming 4 Search Quality 5 Recommendations 6 Influence 51 ©2011 Cloudera, Inc. All Rights Reserved.
  • 52. Online Media Centralized platform for consolidated log processing Many online properties each with separate sys, ad, ops logs Different standards and techniques for processing Data feeds are varied Advertising logs, website traffic feeds from 3rd party providers, system logs, application logs and other operational metrics Data pipeline can be normalized Cleansing, standard analytics and reporting Soon an exploratory platform as well as storage across all properties 52 ©2011 Cloudera, Inc. All Rights Reserved.
  • 53. Mobile Mobile advertisement platform Measuring metrics impressions, clicks, actions and conversions. Most metrics are arbitrary text strings (data is dirty) Stringent SLAs for delivering results SLA of several minutes between event and report to advertisers SLA also covers data accuracy Hadoop for ETL, Analytics, reporting HBase for serving results to advertisers Mimics the popular online analytics services 53 ©2011 Cloudera, Inc. All Rights Reserved.
  • 54. Online Gaming Consolidating data silos for a holistic view of users Various silos of data – user reg, financial, game play, web Poplar games simulate real world sports First goal is accessibility Multiple business can access all data Game play metrics are extremely detailed (think sensor data) Second is exploratory Distributions, event triggers, distinct counts and association rates Compute online statistics such as leaderboards 54 ©2011 Cloudera, Inc. All Rights Reserved.
  • 55. Search Quality Understand user search behavior Improve service, assess quality of results Understand load, identify trends, generate predictive search Search query logs stored in HDFS Hive based aggregation Sqoop to RDBMS for end user analytics Now focused on internal monitoring Analytics have become a critical part of the service Where are analytic needs growing? What data about searches do people want to see? 55 ©2011 Cloudera, Inc. All Rights Reserved.
  • 56. Recommendations and Forecasting Collect and serve personalization information Wide variety of constantly changing data sources Data guaranteed to be messy Data ingestion includes collection of raw data Filtering and fixing of poorly formatted data Normalization and matching across data sources Analysis looks for reliable attributes and groupings Interpretation (e.g. gender by name) Aggregation across likely matching identifiers Identify possible predicted attributes or preferences 56 Copyright 2010 Cloudera Inc. All rights reserved
  • 57. Influence Collect a fire hose of data about social commentary Personal opinions, references to opinions, links Look for tracking and referencing (like very messy page rank) Hadoop to bucket and prepare for analysis Meta data and distinct topics Social graph scoring, bot and spam detection Hadoop stack used throughout Pig and Java, coordinated with Oozie Batch serve data in CSV and load to HBase for API servers 57 ©2011 Cloudera, Inc. All Rights Reserved.
  • 59. Why invest in training? • Maximize your investment in a new technology • Make fewer mistakes by learning the best practices • Cheaper and easier to cross-train than hire – Existing DBAs, Analysts and System Administrators can become Hadoop users
  • 60. Cloudera University • Experience – We’ve trained over 12,000 people – Our courses incorporate the best practices that Cloudera has learned from supporting our customers • Depth of courseware – A comprehensive, role-based curriculum – We can train your entire staff in all aspects of CDH • Geographical coverage – We offer public and private classes in over 20 countries including US, Canada, Brazil, Germany, UK, Poland, Spain, Israel, France, The Netherlands, South Africa, China, India, Australia and Singapore • Certification – Available worldwide at Pearson VUE (vouchers included in our courses) – Certifications for Developers (CCDH), Admins (CCAH), and HBase (CCSHB) 60 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 61. Value proposition of private training • 12k/day for up to 20 students – NEW: 8k/day for up to 10 students – Price includes courseware, lab materials, cert vouchers (for Dev, Admin, HBase), and T&E • We can tailor a class – We have ~ 3 weeks of content that we can mix and match into a customized class – Saves the customer’s time by covering the most relevant topics, cutting out non essential material • Customer chooses location and date • We’re under NDA
  • 62. Learning paths 62 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.

Notas do Editor

  1. Customers experience many pain points when leveraging this architecture for Big Data. Here are 3 of the most common.
  2. Hadoop typically solves two types of problems:Advanced AnalyticsData processingThese go by different terms in different industriesThe applicability of these solutions is broadWe’ve successfully deployed Hadoop and helped solve a diverse set of business problems
  3. FinSvc companies are realizing that they need to understand the fundamental risk in their customer base.All of a bank’s working capital originals with customers.Being able to better predict fluctuations can help them optimize how to put that capital to work.
  4. FinSvc need to analyze trades both for regulatory requirements as well as for internal surveillance and detecting fraud (internal and external).To date this primarily involves looking at transactions and sampling data.Hadoop enables access to detailed data and non-transactional data.
  5. FinSvc companies have many data sources and many consumers of data.Multiple data processing paths can lead to discrepancies in data as well as redundancies in work.A central repository manages all in bound data, takes requests for processing and delivers data sets.This makes the data reliable and traceable. Also FinSvc data is messy and often needs to be updated or restated.A central location can improve tracing all the dependent data sets that need to be reprocessed.
  6. Bank is becoming increasingly competitive, very similar to retail.It used to be you banked with your location credit union for life.Now every company you have a different 401k, you have some 529s somewhere, checking, mortgage, etc.Competitive pressure has driven down fees (despite recent complaints about new fees).Banks now need to compete on what they can offer on top of the ubiquitous financial products.Enter personalized asset management – merge financial models of market trends with personalized portfolios and goals.Embarrassingly parallel, can be offered self-service or via a sales person.
  7. Assessing actual risk exposure in investments is incredible complex.Multi-tiered instruments have lots of variables.Trends that cross the instruments have complex relationships.This is all well structured data with intricate and fluid relationships.Add that the trade volumes have skyrocketed and this clearly becomes a Hadoop problem.
  8. There are regulatory requirements for trade analytics (e.g. RegNMS) that need to be audited.The margins on trades can be razor thin and there’s value in analyzing trade performance.Trade execution platforms and algorithms are incredibly complicated.This is timeseries data, which looks a lot like clickstream data.Tracing particular trades through systems – in effect sessionizing them – and comparing to performance metrics is a classic Hadoop problem.
  9. There’s a yearly revolution in life sciences every time the cost of sequencing falls and the throughput doubles.The existing HPC systems can’t keep up with the amount of data.Hadoop allows scientists to combine data and processing into one scale out gridThere are already numerous libraries available to tackle these problems
  10. A big challenge in our electrical grid is that the infrastructure has grown incrementally over the past 100 yearsWe can’t wholesale replace it – both because of cost and riskIn order to prevent brown outs and black outs caused by component failure the TVA (responsible for the east coast electrical grid) is analyzing for patterns that can predict likely failure.This uses a combination of supervised learning and time series indexing to detect and analyze how components are behaving
  11. Smart Meters are opening up a whole new world of data about how people consumeelectricity (vs how it’s delivered).There are two particular focuses initially – one is to turn this data into education to help consumers be smarter about their electrical use.The other is to help in better capacity planning.
  12. An area you might not consider as being on the cutting edge of technology is in biodiversity indexing.One of the advantages of Hadoop is that it can store any kind of data in any format.It gives you the ability to cleanse that data repeatedly and turn it into well defined structured data.If you need to adjust how you tackle that data, it’s always available in raw form.The final results can be served out of a traditional database or HBase.
  13. We relay today on networks as much as we rely on electricityThis puts a heavy strain on the underlying network infrastructure.Closely monitoring those networks results in a flood of data (the largest network we’re aware of collects several hundred TB/day).Much of the monitoring is data exhaust – not fundamentally required to operating the Network but highly indicative of how it is functioning.
  14. Seismic readings generate massive data volumes when mapping out the topology of the planet.These are typically collected on large storage farms keeping only sampled or aggregated measurements.Then they’re transferred to HPC grids to perform the complex model definition.Hadoop opens the door towards using standard well known libraries in parallel and run them on the same grid that is storing the data.This reduces the need for sampling and significantly speeds up processing.
  15. Companies have been able to analyze customer churn based on when other customers are leaving. Hadoop for the first time helps them capture behaviors leading up to customer loss to help predict when these events are likely.This gives companies more time to respond to possible customer loss.This involves traversing the social graph (customers rarely leave one at a time) and identifying and recognizing patterns that are leading indicators.
  16. Much of the discussions about brands today happens in the social media.This not only impacts the companies perception but can have a direct influence on relationships with customers and the ability to sell.Hadoop is a natural solution for gathering and contextualizing discussions about company brands and products.
  17. Point of sale analysis includes many different types of data today, from standard POS data to online, coupon based and mixed.Companies need to track data from any different sources in different formats to understand their sales in depthHadoop can be used to better understand the supply chain or to incorporate external data to explain sales behaviors.
  18. It used to be that prices were set varying by region or season and and updated periodically.Today pricing can be completely dynamic – especially for online retailers.And consumers are able to comparison shop with a few keystrokes.Customers also weigh the value of their purchase with time to delivery.Taking all these behaviors into account in a hyper competitive market is complex.Hadoop is being used to tackle these challenges and new techniques are being applied to understand correlations, effects of bundles and incentive discountsAnd to cluster customers by a variety of attributes, not just as one type of consumer or another.
  19. Customer loyalty used to be taken for granted. The programs were designed to help track customer purchases with finer granularity.Today customer loyalty is being used to bridge the gap between purchases. When customers can easily comparison shop, it’s not clear the incentives to stay with the same vendor.Loyalty programs are being designed not just to track or encourage customers to shop but to build a relationship with the customer.So that the next time they shop, they prefer the brand that has been thinking of them and their needs.Loyalty programs can also be used to make timely offers, for example when a customer is expected to run out of a particular product, provide a coupon that offers an upsell.
  20. The Internet has expanded the world of offers from candy and magazines while to wait in the checkout line to anywhere and everywhere.Using modern ad network, companies can track their customers after they’ve left their site.This opens up possibilities to re-capture customers who have not yet bought or to cross sell and upsell even after the transaction is complete.Customers use technologies such as HBase to incrementally monitor where customers are going.Algorithms can then be run on incremental data at a variety of time scales.
  21. An online media group within a larger brand name company has multiple separately branded and operated sitesEach has different systems for logs including ad logs and ops logs and different techniques for processing them.Hadoop provides a centralized platform for all of these properties to collect their system logs, ad logs and ops logsHadoop is also loaded with website feeds from 3rd party providers and operational metricsThis creates a standard platform for analytics and reportingThey’re soon turning on exploratory access and will provide centralized storage services for all properties
  22. A mobile ad platform measures standard metrics but most of the data is arbitrary text since it can be defined by 3rd party developersThere are multiple SLAs for reporting to advertisers as well as for data accuracyLog data is collected into HDFS and prepped then loaded into HBaseHBase is used to serve results to advertisers in a similar fashion to general purpose online analytics services
  23. And online gaming vendor has multiple silos for each user interaction (registration, payments, game play, web interaction)The most popular games are very dynamic (simulating real world sports)The first goal is to grant multiple business access to all of the dataIn particular the game play metrics (telemetry data) is extremely detailed, similar to sensor dataThe second goal is for exploratory analysis for example looking at distributions in game play behavior or for event triggersA lot of the initial analysis is basic count distinct on a wide variety of attributes and combinations of attributes to look for correlated behaviorsHadoop is also used to compute online statistic such as leaderboards
  24. Search quality is measured by the users ability to not only find what they want but complete the transaction or take a next stepUnderstanding the users goals is very difficult and the search trends vary over timeFundamentally improving the service and assessing quality means logging everything into HDFS and rolling up your sleevesThis customer uses Hive mostly for aggregation and sqoops the results into an RDBMS to publish to end usersAnalytics have now become a critical part of the service (e.g. generating predictive search)Now they are focusing on where analytic needs are growing and what new data about searches the business wants to see
  25. Recommendation engines are popular applications on HadoopThere are a wide variety of constantly changing sources and the data is always messyAt data ingestion this requires filtering and fixed of poorly formatted dataThese process are constantly changing as the data changesData is then normalized and matched across data sourcesIn some cases this means interpretation and filling in fields, in other cases it involved aggregation across fuzzy matched identifiersThese also require quality checks
  26. Measuring influence on the internet involves collecting a fire hose of data that includes opinions, references and linksThink of this as a very messy and very dynamic page rank but you’re ranking people and brandsHadoop is used to prep all the data – identify meta data and distinct topics (which change)Hadoop is also used to score the social graph and filter out bots and spamThis is all tied together with pig and java and coordinate with OozieData is then batch served in CSV and loaded into HBase to back an API