SlideShare uma empresa Scribd logo
1 de 72
Big Data with
    HBase and
    Hadoop at Adobe
    Cosmin Lehene
    Programatica, November, 2010




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   1
Who am I


Cosmin Lehene

Adobe Services and Infrastructure Team = SaaS services
HBase and Hadoop contributor


clehene@adobe.com
@clehene


                                     h p://hstack.org
                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   2
                                                                                         2
Why I am here today


§     Riding the elephant since 2008


§     Analytics, BI, Machine Learning
§     Images, Videos, Flash, Web, etc.




                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   3
                                                                                         3
Opaque Data (logs, archives)


§     Web traffic
§     Business events
§     User interactions
§     Infrastructure data
          §  Database logs, web server logs, etc.

§     Etc.



                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   4
                                                                                         4
h p://commons.wikimedia.org/wiki/File:AWI-core-archive_hg.jpg                            ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   5
                                                                                         5
h p://www.google.com/images?q=data+visualization                                         6
                                                                                              ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   6
                                                                                              6
Can I


§     JOIN everything?
§     Increase user engagement?
§     Increase conversion rate?


§     Make $$$? J
§     Fast and cheap?


                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   7
                                                                                         7
Understand data and extract meaning
Real-time access to meaningful data




                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   8
                                                                                         8
Agenda




                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   9
                                                                                         9
noSQL 101
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   10
                                                                                          1
Scaling RDBMS


§     Scale up
          §  More memory

          §  More CPU

          §  Faster disks, SAN, etc.




§     Problems
          §  Expensive

          §            ere’s a limit

                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   11
                                                                                          1
Scaling RDBMS


§     Scale horizontally
          §  Replication (reads)

          §  Sharding/ Horizontal Partitioning (writes)

                  §    Server 1: a-m, Server 2: m-z
          §  Denormalization




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   12
                                                                                          1
Replication




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   13
                                                                                          1
Sharding




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   14
                                                                                          1
Sharding




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   15
                                                                                          1
Sharding & Replication




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   16
                                                                                          1
Scaling RDBMS problems


§     Hard to repartition/reshard
          §  Pre allocate shards 2, 3, 100

§     Query each shard
§     High operational costs
§     Eventual consistency




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   17
                                                                                          1
Enter noSQL – the beginning


§     Google: BigTable
§     Amazon: Dynamo
§     Memcached




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   18
                                                                                          1
Data Models


§     Key-value
§     Columnar/Tabular
§     Document oriented
§     Graph




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   19
                                                                                          1
Architectures


§     Distributed hash tables
§     Consistent Hashing
§     Gossip
§     Vector clocks
§     Locality groups
§     Partitioning, replication
§     etc.
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   20
                                                                                          2
Properties


§     Scalability
§     Failover
§     Durability
§     Consistency
§     Availability
§     Partition Tolerance
§     Etc.
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   21
                                                                                          2
Cartesian Product




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   22
                                                                                          2
What do all these have in common




§     Different data models
                             noSQL
§     Different architectures
§     Different properties
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   23
Hadoop




                              h p://hadoop.apache.org

§     HDFS (distributed fs)
§     Map-reduce (distributed processing)



                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   24
                                                                                          2
Adobe Media Player

    Increase video
    consumption




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   25
AMP

 §     Recommendations
 §     Related content
 §     Related users




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   26
                                                                                          2
Video logs

 §     X watched movie A (comedy)
 §     Y watched movie B (drama)
 §     Z watched movie C (thriller)
 §     Z watched movie A (comedy)
 §     X watched movie D (technology)
 §     Y watched movie C (thriller)


                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   27
                                                                                          2
Which users are alike?

 §     Compare every 2 users?
 §     5M vectors
 §     120 dimensions
 §     Distance is not enough – needed groups




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   28
                                                                                          2
How?




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   29
                                                                                          2
Custer projections


                                                                                          §  1 month

                                                                                          §  6GB

                                                                                          §  700k Users

                                                                                          §  114 genres

                                                                                          §  7 nodes

                                                                                          §  5 hours

                                                                                          §  27 clusters
                                                                                                            ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   30
                                                                                                            3
Game Constellations

                                                   §     Processing Shockwave logs




                                                                                            ®	





  Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   31
Lessons learned


 Need:
           §  Fine grain access

           §  Incremental updates

           §  Deal with changes in the original dataset

           §  Real-time data serving




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   32
                                                                                          3
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   33
                                                                                          3
h p://hbase.apache.org

 §     Sparse, distributed, persistent multidimensional
        sorted map
 §     Column oriented store
 §     Autosharding
 §     Data locality

                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   34
                                                                                          3
Data Model

  table: row: family: column: value: version
  	
  domain.com/x.swf	
                 swf:	
                          sfw:size = 1876 bytes | 1876 bytes	
                          swf:fps = 30	
                          swf:avm = 3	

                 html: 	
                          embed = dynamic	

                 status:	
                          last_crawl = 2010/11/26 | last_crawl = 2010/11/25	

  domain.com/y.swf	
  domain.com/z.swf	                                                                        ®	





 Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   35
                                                                                           3
API


§     Get
§     Put
§     Delete
§     Scan




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   36
Flash

    How is ash used




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   37
How is ash used in the “wild”?

 §     AVM popularity
 §     Frame rates
 §     Video formats
 §     SWF size
 §     Flex data structures
 §     …


                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   38
                                                                                          3
How




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   39
                                                                                          3
How




                                                                                          max 1000


                                                                                                     ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   40
                                                                                                     4
e hard way

 §     Hadoop
 §     Nutch
 §     HBase




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   41
                                                                                          4
Work ow

 §     Crawl:
           §    Nutch (seed: top-1m.csv Alexa)
           §    Detect ash embed, javascript
 §     Browse:
           §    Hadoop + FF + FP (chromeless)
           §    Dump stack traces, memory, swf bytes, etc.
 §     Process:
           §    Parse stack traces, rank, etc.
 §     Export:
           §    Hbase: swf table
           §    Md5, swf bytecode, memory, load time, etc.                               ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   42
                                                                                          4
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   43
                                                                                          4
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   44
                                                                                          4
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   45
                                                                                          4
Bene ts

 §     Security xes
 §     Optimization
 §     Prioritize based on real usage
 §     Testing




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   46
                                                                                          4
SaasBase – Hbase++ as a service

 §     Data storage (HBase + HDFS)
           §  Domains, tables,

           §  API: create, put, get, scan




 §     Analytics (HBase + Hadoop + query engine)
           §  Reports, dimensions, metrics

           §  API: query



                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   47
                                                                                          4
photoshop.com

    Image analytics




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   48
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   49
                                                                                          4
photoshop.com




 §     1B assets (images, videos, other)
           §  120M with EXIF metadata

 §     1.5 petabytes
 §     Home grown distributed storage




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   50
                                                                                          5
Intelligence

 §     Targeting users:
           §    Professionals or Amateurs?
           §    Where are pictures taken?



 §     Targeting partners:
           §    Popular cameras



 §     Tracking campaigns
           §    New accounts
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   51
                                                                                          5
5
                                                                                          2	

Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   52
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   53
                                                                                          5
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   54
                                                                                          5
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   55
                                                                                          5
Stats

 §     7 Machines (16 cores, 24 x 10K RPM SATA, 32GB
        RAM, 1Gbps)


 §     Map 700M records
 §     2hrs, 41mins
 §     Map output: 1.9B records (~80GB)



                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   56
                                                                                          5
Lessons

 §     SUM, COUNT, AVG, MIN, MAX, GROUP BY,
        HAVING, etc.
 §     Rollup, drilldown, segmentation
 -----------------------------------------------------------


 It’s all about Dimensions & Metrics



                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   57
                                                                                          5
Recap



 §     Hadoop + Mahout + PIG (User clusters)
 §     HBase + Hadoop + Nutch+ MySQL (Flash analytics)
 §     HBase + Hadoop (EXIF Explorer, image analytics)




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   58
                                                                                          5
Business Catalyst

    Analytics




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   59
BC




 §     End to end platform for online businesses
 §     E-commerce, Blogging, CRM, email marketing
 §     Analytics: web traffic, affiliates, sales, etc.




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   60
                                                                                          6
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   61
                                                                                          6
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   62
                                                                                          6
Successtrophe

 §     Analytics is troublesome
           §  SQL database was slow for analytics

 §     Over 50 different reports
 §     Over 100,000 websites
 §     Billions of page views




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   63
Requirements

 §     Fast incremental processing
 §     Custom reporting
 §     Filtering, segmentation, rollups, drilldowns
 §     Variable time ranges


 §  Fast


                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   64
                                                                                          6
Solution

 §     Continuous processing (every 10 minutes)
 §     Reports de nition: dimensions, metrics
 §     Real-time queries: directly from HBase




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   65
                                                                                          6
Work ow

 §     Import Logs ->HBase
 §     Incrementally process/index last 24 hours
 §     Serve from HBase
           §  Index scans

           §  Runtime aggregation




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   66
                                                                                          6
Stats

 §     1 datacenter, 10 months = 1 hour, 24 minutes
 §     > 3 Billion report items generated




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   67
                                                                                          6
Lessons

 §     UNIQUE is harder
           §  E.g :Unique visitors, Visitor loyalty

 §     Space vs. time
 §     Sorting magic




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   68
                                                                                          6
Not just web analytics


 X Analytics


 §     Feed in any le format (w3c, apache, tsv, etc.)
 §     Tag the dimensions and metrics
 §     Process (incremental)
 §     Query in real-time


                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   69
                                                                                          6
Nothing but the hstack

 §     structured data storage: HBase
 §          le storage HDFS
 §     data processing: Hadoop




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   70
                                                                                          7
Conclusions

 §     Keep data
 §     Understand data
 §     Explore data
 §     Extract meaning




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   71
                                                                                          7
h p://hstack.org
                                           h p://hbase.apache.org
                                      h p://hadoop.apache.org
                                      h p://mahout.apache.org
                                            h p://nutch.apache.org
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   72
                                                                                          7

Mais conteúdo relacionado

Mais procurados

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksKnoldus Inc.
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSDatabricks
 

Mais procurados (20)

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Introduction to influx db
Introduction to influx dbIntroduction to influx db
Introduction to influx db
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 

Semelhante a HBase and Hadoop at Adobe

董龙飞 - 新一代企业应用
董龙飞 - 新一代企业应用董龙飞 - 新一代企业应用
董龙飞 - 新一代企业应用d0nn9n
 
JAX2010 Flex Java technical session: interactive dashboard
JAX2010 Flex Java technical session: interactive dashboardJAX2010 Flex Java technical session: interactive dashboard
JAX2010 Flex Java technical session: interactive dashboardMichael Chaize
 
Adobe flash platform java
Adobe flash platform javaAdobe flash platform java
Adobe flash platform javaCh'ti JUG
 
Adobe flash platform java
Adobe flash platform javaAdobe flash platform java
Adobe flash platform javaMichael Chaize
 
NLJUG: Content Management, Standards, Opensource & JCP
NLJUG: Content Management, Standards, Opensource & JCPNLJUG: Content Management, Standards, Opensource & JCP
NLJUG: Content Management, Standards, Opensource & JCPDavid Nuescheler
 
Flash performance tuning (EN)
Flash performance tuning (EN)Flash performance tuning (EN)
Flash performance tuning (EN)Andy Hall
 
Oop2012 keynote Design Driven Development
Oop2012 keynote Design Driven DevelopmentOop2012 keynote Design Driven Development
Oop2012 keynote Design Driven DevelopmentMichael Chaize
 
Innovation and the Adobe Flash Platform
Innovation and the Adobe Flash PlatformInnovation and the Adobe Flash Platform
Innovation and the Adobe Flash PlatformMichael Chaize
 
Flex and the city in London - Keynote
Flex and the city in London - KeynoteFlex and the city in London - Keynote
Flex and the city in London - KeynoteMichael Chaize
 
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendship
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendshipFlex, Adobe AIR, and PHP: the beginning of a beautiful friendship
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendshipelliando dias
 
AJUBY Open Source Application Builder
AJUBY Open Source Application BuilderAJUBY Open Source Application Builder
AJUBY Open Source Application Builderajuby
 
eLearning Suite 6 Workflow
eLearning Suite 6 WorkfloweLearning Suite 6 Workflow
eLearning Suite 6 WorkflowKirsten Rourke
 
Process in the Age of Digital Innovation
Process in the Age of Digital InnovationProcess in the Age of Digital Innovation
Process in the Age of Digital InnovationCharles Duncan jr.
 
Xebia adobe flash mobile applications
Xebia adobe flash mobile applicationsXebia adobe flash mobile applications
Xebia adobe flash mobile applicationsMichael Chaize
 
Adobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom KrchaAdobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom Krchamochimedia
 
Xplatform mobile development
Xplatform mobile developmentXplatform mobile development
Xplatform mobile developmentMichael Chaize
 

Semelhante a HBase and Hadoop at Adobe (20)

董龙飞 - 新一代企业应用
董龙飞 - 新一代企业应用董龙飞 - 新一代企业应用
董龙飞 - 新一代企业应用
 
JAX2010 Flex Java technical session: interactive dashboard
JAX2010 Flex Java technical session: interactive dashboardJAX2010 Flex Java technical session: interactive dashboard
JAX2010 Flex Java technical session: interactive dashboard
 
Adobe flash platform java
Adobe flash platform javaAdobe flash platform java
Adobe flash platform java
 
Adobe flash platform java
Adobe flash platform javaAdobe flash platform java
Adobe flash platform java
 
NLJUG: Content Management, Standards, Opensource & JCP
NLJUG: Content Management, Standards, Opensource & JCPNLJUG: Content Management, Standards, Opensource & JCP
NLJUG: Content Management, Standards, Opensource & JCP
 
Flash performance tuning (EN)
Flash performance tuning (EN)Flash performance tuning (EN)
Flash performance tuning (EN)
 
Oop2012 keynote Design Driven Development
Oop2012 keynote Design Driven DevelopmentOop2012 keynote Design Driven Development
Oop2012 keynote Design Driven Development
 
Innovation and the Adobe Flash Platform
Innovation and the Adobe Flash PlatformInnovation and the Adobe Flash Platform
Innovation and the Adobe Flash Platform
 
Flex and the city in London - Keynote
Flex and the city in London - KeynoteFlex and the city in London - Keynote
Flex and the city in London - Keynote
 
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendship
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendshipFlex, Adobe AIR, and PHP: the beginning of a beautiful friendship
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendship
 
Hello Gumbo
Hello GumboHello Gumbo
Hello Gumbo
 
Jax2001 adobe keynote
Jax2001 adobe keynoteJax2001 adobe keynote
Jax2001 adobe keynote
 
As2 vs as3
As2 vs as3As2 vs as3
As2 vs as3
 
MMT 28: Adobe »Edge to the Flash«
MMT 28: Adobe »Edge to the Flash«MMT 28: Adobe »Edge to the Flash«
MMT 28: Adobe »Edge to the Flash«
 
AJUBY Open Source Application Builder
AJUBY Open Source Application BuilderAJUBY Open Source Application Builder
AJUBY Open Source Application Builder
 
eLearning Suite 6 Workflow
eLearning Suite 6 WorkfloweLearning Suite 6 Workflow
eLearning Suite 6 Workflow
 
Process in the Age of Digital Innovation
Process in the Age of Digital InnovationProcess in the Age of Digital Innovation
Process in the Age of Digital Innovation
 
Xebia adobe flash mobile applications
Xebia adobe flash mobile applicationsXebia adobe flash mobile applications
Xebia adobe flash mobile applications
 
Adobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom KrchaAdobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom Krcha
 
Xplatform mobile development
Xplatform mobile developmentXplatform mobile development
Xplatform mobile development
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

HBase and Hadoop at Adobe

  • 1. Big Data with HBase and Hadoop at Adobe Cosmin Lehene Programatica, November, 2010 Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 1
  • 2. Who am I Cosmin Lehene Adobe Services and Infrastructure Team = SaaS services HBase and Hadoop contributor clehene@adobe.com @clehene h p://hstack.org ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 2 2
  • 3. Why I am here today §  Riding the elephant since 2008 §  Analytics, BI, Machine Learning §  Images, Videos, Flash, Web, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 3 3
  • 4. Opaque Data (logs, archives) §  Web traffic §  Business events §  User interactions §  Infrastructure data §  Database logs, web server logs, etc. §  Etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 4 4
  • 5. h p://commons.wikimedia.org/wiki/File:AWI-core-archive_hg.jpg ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 5 5
  • 6. h p://www.google.com/images?q=data+visualization 6 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 6 6
  • 7. Can I §  JOIN everything? §  Increase user engagement? §  Increase conversion rate? §  Make $$$? J §  Fast and cheap? ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 7 7
  • 8. Understand data and extract meaning Real-time access to meaningful data ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 8 8
  • 9. Agenda ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 9 9
  • 10. noSQL 101 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 10 1
  • 11. Scaling RDBMS §  Scale up §  More memory §  More CPU §  Faster disks, SAN, etc. §  Problems §  Expensive §  ere’s a limit ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 11 1
  • 12. Scaling RDBMS §  Scale horizontally §  Replication (reads) §  Sharding/ Horizontal Partitioning (writes) §  Server 1: a-m, Server 2: m-z §  Denormalization ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 12 1
  • 13. Replication ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 13 1
  • 14. Sharding ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 14 1
  • 15. Sharding ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 15 1
  • 16. Sharding & Replication ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 16 1
  • 17. Scaling RDBMS problems §  Hard to repartition/reshard §  Pre allocate shards 2, 3, 100 §  Query each shard §  High operational costs §  Eventual consistency ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 17 1
  • 18. Enter noSQL – the beginning §  Google: BigTable §  Amazon: Dynamo §  Memcached ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 18 1
  • 19. Data Models §  Key-value §  Columnar/Tabular §  Document oriented §  Graph ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 19 1
  • 20. Architectures §  Distributed hash tables §  Consistent Hashing §  Gossip §  Vector clocks §  Locality groups §  Partitioning, replication §  etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 20 2
  • 21. Properties §  Scalability §  Failover §  Durability §  Consistency §  Availability §  Partition Tolerance §  Etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 21 2
  • 22. Cartesian Product ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 22 2
  • 23. What do all these have in common §  Different data models noSQL §  Different architectures §  Different properties ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 23
  • 24. Hadoop h p://hadoop.apache.org §  HDFS (distributed fs) §  Map-reduce (distributed processing) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 24 2
  • 25. Adobe Media Player Increase video consumption Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 25
  • 26. AMP §  Recommendations §  Related content §  Related users ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 26 2
  • 27. Video logs §  X watched movie A (comedy) §  Y watched movie B (drama) §  Z watched movie C (thriller) §  Z watched movie A (comedy) §  X watched movie D (technology) §  Y watched movie C (thriller) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 27 2
  • 28. Which users are alike? §  Compare every 2 users? §  5M vectors §  120 dimensions §  Distance is not enough – needed groups ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 28 2
  • 29. How? ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 29 2
  • 30. Custer projections §  1 month §  6GB §  700k Users §  114 genres §  7 nodes §  5 hours §  27 clusters ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 30 3
  • 31. Game Constellations §  Processing Shockwave logs ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 31
  • 32. Lessons learned Need: §  Fine grain access §  Incremental updates §  Deal with changes in the original dataset §  Real-time data serving ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 32 3
  • 33. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 33 3
  • 34. h p://hbase.apache.org §  Sparse, distributed, persistent multidimensional sorted map §  Column oriented store §  Autosharding §  Data locality ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 34 3
  • 35. Data Model table: row: family: column: value: version domain.com/x.swf swf: sfw:size = 1876 bytes | 1876 bytes swf:fps = 30 swf:avm = 3 html: embed = dynamic status: last_crawl = 2010/11/26 | last_crawl = 2010/11/25 domain.com/y.swf domain.com/z.swf ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 35 3
  • 36. API §  Get §  Put §  Delete §  Scan ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 36
  • 37. Flash How is ash used Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 37
  • 38. How is ash used in the “wild”? §  AVM popularity §  Frame rates §  Video formats §  SWF size §  Flex data structures §  … ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 38 3
  • 39. How ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 39 3
  • 40. How max 1000 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 40 4
  • 41. e hard way §  Hadoop §  Nutch §  HBase ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 41 4
  • 42. Work ow §  Crawl: §  Nutch (seed: top-1m.csv Alexa) §  Detect ash embed, javascript §  Browse: §  Hadoop + FF + FP (chromeless) §  Dump stack traces, memory, swf bytes, etc. §  Process: §  Parse stack traces, rank, etc. §  Export: §  Hbase: swf table §  Md5, swf bytecode, memory, load time, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 42 4
  • 43. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 43 4
  • 44. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 44 4
  • 45. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 45 4
  • 46. Bene ts §  Security xes §  Optimization §  Prioritize based on real usage §  Testing ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 46 4
  • 47. SaasBase – Hbase++ as a service §  Data storage (HBase + HDFS) §  Domains, tables, §  API: create, put, get, scan §  Analytics (HBase + Hadoop + query engine) §  Reports, dimensions, metrics §  API: query ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 47 4
  • 48. photoshop.com Image analytics Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 48
  • 49. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 49 4
  • 50. photoshop.com §  1B assets (images, videos, other) §  120M with EXIF metadata §  1.5 petabytes §  Home grown distributed storage ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 50 5
  • 51. Intelligence §  Targeting users: §  Professionals or Amateurs? §  Where are pictures taken? §  Targeting partners: §  Popular cameras §  Tracking campaigns §  New accounts ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 51 5
  • 52. 5 2 Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 52
  • 53. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 53 5
  • 54. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 54 5
  • 55. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 55 5
  • 56. Stats §  7 Machines (16 cores, 24 x 10K RPM SATA, 32GB RAM, 1Gbps) §  Map 700M records §  2hrs, 41mins §  Map output: 1.9B records (~80GB) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 56 5
  • 57. Lessons §  SUM, COUNT, AVG, MIN, MAX, GROUP BY, HAVING, etc. §  Rollup, drilldown, segmentation ----------------------------------------------------------- It’s all about Dimensions & Metrics ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 57 5
  • 58. Recap §  Hadoop + Mahout + PIG (User clusters) §  HBase + Hadoop + Nutch+ MySQL (Flash analytics) §  HBase + Hadoop (EXIF Explorer, image analytics) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 58 5
  • 59. Business Catalyst Analytics Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 59
  • 60. BC §  End to end platform for online businesses §  E-commerce, Blogging, CRM, email marketing §  Analytics: web traffic, affiliates, sales, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 60 6
  • 61. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 61 6
  • 62. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 62 6
  • 63. Successtrophe §  Analytics is troublesome §  SQL database was slow for analytics §  Over 50 different reports §  Over 100,000 websites §  Billions of page views ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 63
  • 64. Requirements §  Fast incremental processing §  Custom reporting §  Filtering, segmentation, rollups, drilldowns §  Variable time ranges §  Fast ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 64 6
  • 65. Solution §  Continuous processing (every 10 minutes) §  Reports de nition: dimensions, metrics §  Real-time queries: directly from HBase ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 65 6
  • 66. Work ow §  Import Logs ->HBase §  Incrementally process/index last 24 hours §  Serve from HBase §  Index scans §  Runtime aggregation ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 66 6
  • 67. Stats §  1 datacenter, 10 months = 1 hour, 24 minutes §  > 3 Billion report items generated ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 67 6
  • 68. Lessons §  UNIQUE is harder §  E.g :Unique visitors, Visitor loyalty §  Space vs. time §  Sorting magic ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 68 6
  • 69. Not just web analytics X Analytics §  Feed in any le format (w3c, apache, tsv, etc.) §  Tag the dimensions and metrics §  Process (incremental) §  Query in real-time ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 69 6
  • 70. Nothing but the hstack §  structured data storage: HBase §  le storage HDFS §  data processing: Hadoop ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 70 7
  • 71. Conclusions §  Keep data §  Understand data §  Explore data §  Extract meaning ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 71 7
  • 72. h p://hstack.org h p://hbase.apache.org h p://hadoop.apache.org h p://mahout.apache.org h p://nutch.apache.org ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 72 7