SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
: Analyzing Large-Scale
                User Data with Hadoop and HBase

                 Aaron Kimball – CTO




                                             Odiago, Inc.
Developed By:
is…
      • A large-scale storage, serving, and analysis
        platform
      • For user- or other entity-centric data




Developed By:
use cases
      • Product/content recommendations
                – “Because you liked book X, you may like book Y”
      • Ad targeting
                – “Because of your interest in sports, check out…”
      • Social network analysis
                – “You may know these people…”
      • Fraud detection, anti-spam, search
        personalization…

Developed By:
Use case characteristics
      • Have a large number of users
      • Want to store (large) transaction data as well
        as derived data (e.g., recommendations)
      • Need to serve recommendations interactively
      • Require a combination of offline and on-the-
        fly computation



Developed By:
A typical workflow




Developed By:
Challenges
      • Support real-time retrieval of profile data
      • Store a long transactional data history
      • Keep related data logically and physically close
      • Update data in a timely fashion without
        wasting computation
      • Fault tolerance
      • Data schema changes over time


Developed By:
architecture




                          Certified Technology product
Developed By:
HBase data model
      • Data in cells, addressed by four “coordinates”
                – Row Id (primary key)
                – Column family
                – Column “qualifier”
                – Timestamp




Developed By:
Schema free: not what you want
      • HBase may not impose a schema, but your
        data still has one
      • Up to the application to determine how to
        organize & interpret data
      • You still need to pick a serialization system




Developed By:
Schemas = trade-offs
      • Different schemas enable efficient
        storage/retrieval/analysis of different types of
        data
      • Physical organization of data still makes a big
        difference
                – Especially with respect to read/write patterns




Developed By:
WibiData workloads
      • Large number of fat rows (one per user)
      • Each row updated relatively few times/day
                – Though updates may involve large records
      • Raw data written to one set of columns
      • Processed results read from another
                – Often with an interactive latency requirement
      • Needs to support complex data types


Developed By:
Serialization with
      • Apache Avro provides flexible serialization
      • All data written along with its “writer schema”
      • Reader schema may differ from the writer’s

      {
                "type": "record",
                "name": "LongList",
                "fields" : [
                  {"name": "value", "type": "long"},
                  {"name": "next", "type": ["LongList", "null"]}
                ]
      }


Developed By:
Serialization with
      • No code generation required
      • Producers and consumers of data can migrate
        independently
      • Data format migrations do not require
        structural changes to underlying data




Developed By:
WibiData: An extended data model



                <column>
                  <name>email</name>
                  <description>Email address</description>
                  <schema>"string"</schema>
                </column>



      • Columns or whole families have common Avro
        schemas for evolvable storage and retrieval

Developed By:
WibiData: An extended data model




      • Column families are a logical concept
      • Data is physically arranged in locality groups
      • Row ids are hashed for uniform write pressure


Developed By:
WibiData: An extended data model




      • Wibi uses 3-d storage
      • Data is often sorted by timestamp



Developed By:
Analyzing data: Producers




      • Producers create derived column values
      • Produce operator works on one row at a time
                – Can be run in MapReduce, or on a one-off basis
      • Produce is a row mutation operator

Developed By:
Analyzing data: Gatherers




   • Gatherers aggregate data across all rows
   • Always run within MapReduce
   • A bridge between rows and (key, value) pairs


Developed By:
Interactive access: REST API
                PUT request                GET request




      • REST API provides interactive access
      • Producers can be triggered “on demand” to
        create fresh recommendations


Developed By:
Example: Ad Targeting




Developed By:
Example: Ad Targeting



                       Producer




Developed By:
Example: Ad Targeting



                       Producer




Developed By:
Gathering Category Associations
      • Gather observed behavior




Developed By:
Gathering Category Associations
      • Associate interests, clicks




Developed By:
Gathering Category Associations
      • … for all pairs




Developed By:
Gathering Category Associations
      • And aggregate across all users
                Map phase:        Reduce phase:




Developed By:
Conclusions
      • Hadoop, HBase, Avro form the core of a large-
        scale machine learning/analysis platform
      • How you set up your schema matters
      • Producer/gatherer programming model allows
        computations over tables to be expressed
        naturally; works with MapReduce




Developed By:
www.wibidata.com / @wibidata
                    Aaron Kimball – aaron@odiago.com




Developed By:

Mais conteúdo relacionado

Mais procurados

Oracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureOracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureArthur Gimpel
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integrationibi
 
MongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL DatabaseMongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL DatabaseGaurav Awasthi
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatternsAnurag S
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon
 
Hadoop foundation for analytics
Hadoop foundation for analyticsHadoop foundation for analytics
Hadoop foundation for analyticsHariniA7
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveIBM Cloud Data Services
 
Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)ArangoDB Database
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsHisham Arafat
 
Drupal Training Topics
Drupal Training TopicsDrupal Training Topics
Drupal Training Topicsvibrantuser
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQLTony Tam
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Yahoo Developer Network
 

Mais procurados (20)

Oracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureOracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data Architecture
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
Practical Use of a NoSQL Database
Practical Use of a NoSQL DatabasePractical Use of a NoSQL Database
Practical Use of a NoSQL Database
 
IBM Visualization Data Explorer
IBM Visualization Data ExplorerIBM Visualization Data Explorer
IBM Visualization Data Explorer
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
MongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL DatabaseMongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL Database
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatterns
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
 
Hadoop foundation for analytics
Hadoop foundation for analyticsHadoop foundation for analytics
Hadoop foundation for analytics
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The Move
 
Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
MahoutNew
MahoutNewMahoutNew
MahoutNew
 
Drupal Training Topics
Drupal Training TopicsDrupal Training Topics
Drupal Training Topics
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQL
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Database
DatabaseDatabase
Database
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
Hadoop..
Hadoop..Hadoop..
Hadoop..
 
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
 

Semelhante a Analyzing Large-Scale User Data with Hadoop and HBase

Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Building Recommendation Platforms with Hadoop
Building Recommendation Platforms with HadoopBuilding Recommendation Platforms with Hadoop
Building Recommendation Platforms with HadoopJayant Shekhar
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013Dipti Borkar
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Apache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data StreamingApache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data StreamingShivji Kumar Jha
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
New big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxNew big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxVanshGupta597842
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Graph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnalsGraph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnalsMSDEVMTL
 

Semelhante a Analyzing Large-Scale User Data with Hadoop and HBase (20)

Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Building Recommendation Platforms with Hadoop
Building Recommendation Platforms with HadoopBuilding Recommendation Platforms with Hadoop
Building Recommendation Platforms with Hadoop
 
Apache drill
Apache drillApache drill
Apache drill
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
NoSQL-Overview
NoSQL-OverviewNoSQL-Overview
NoSQL-Overview
 
Apache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data StreamingApache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data Streaming
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
New big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxNew big data architecture in hadoop.pptx
New big data architecture in hadoop.pptx
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Graph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnalsGraph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnals
 

Mais de WibiData

Data Evolution on HBase with Kiji
Data Evolution on HBase with KijiData Evolution on HBase with Kiji
Data Evolution on HBase with KijiWibiData
 
Exploring the Enron Email Dataset with Kiji and Hive
Exploring the Enron Email Dataset with Kiji and HiveExploring the Enron Email Dataset with Kiji and Hive
Exploring the Enron Email Dataset with Kiji and HiveWibiData
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBaseWibiData
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseWibiData
 
Building Personalized Applications at Scale
Building Personalized Applications at ScaleBuilding Personalized Applications at Scale
Building Personalized Applications at ScaleWibiData
 
Building Personalized Applications with HBase
Building Personalized Applications with HBaseBuilding Personalized Applications with HBase
Building Personalized Applications with HBaseWibiData
 

Mais de WibiData (6)

Data Evolution on HBase with Kiji
Data Evolution on HBase with KijiData Evolution on HBase with Kiji
Data Evolution on HBase with Kiji
 
Exploring the Enron Email Dataset with Kiji and Hive
Exploring the Enron Email Dataset with Kiji and HiveExploring the Enron Email Dataset with Kiji and Hive
Exploring the Enron Email Dataset with Kiji and Hive
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBase
 
Building Personalized Applications at Scale
Building Personalized Applications at ScaleBuilding Personalized Applications at Scale
Building Personalized Applications at Scale
 
Building Personalized Applications with HBase
Building Personalized Applications with HBaseBuilding Personalized Applications with HBase
Building Personalized Applications with HBase
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Analyzing Large-Scale User Data with Hadoop and HBase

  • 1.
  • 2. : Analyzing Large-Scale User Data with Hadoop and HBase Aaron Kimball – CTO Odiago, Inc. Developed By:
  • 3. is… • A large-scale storage, serving, and analysis platform • For user- or other entity-centric data Developed By:
  • 4. use cases • Product/content recommendations – “Because you liked book X, you may like book Y” • Ad targeting – “Because of your interest in sports, check out…” • Social network analysis – “You may know these people…” • Fraud detection, anti-spam, search personalization… Developed By:
  • 5. Use case characteristics • Have a large number of users • Want to store (large) transaction data as well as derived data (e.g., recommendations) • Need to serve recommendations interactively • Require a combination of offline and on-the- fly computation Developed By:
  • 7. Challenges • Support real-time retrieval of profile data • Store a long transactional data history • Keep related data logically and physically close • Update data in a timely fashion without wasting computation • Fault tolerance • Data schema changes over time Developed By:
  • 8. architecture Certified Technology product Developed By:
  • 9. HBase data model • Data in cells, addressed by four “coordinates” – Row Id (primary key) – Column family – Column “qualifier” – Timestamp Developed By:
  • 10. Schema free: not what you want • HBase may not impose a schema, but your data still has one • Up to the application to determine how to organize & interpret data • You still need to pick a serialization system Developed By:
  • 11. Schemas = trade-offs • Different schemas enable efficient storage/retrieval/analysis of different types of data • Physical organization of data still makes a big difference – Especially with respect to read/write patterns Developed By:
  • 12. WibiData workloads • Large number of fat rows (one per user) • Each row updated relatively few times/day – Though updates may involve large records • Raw data written to one set of columns • Processed results read from another – Often with an interactive latency requirement • Needs to support complex data types Developed By:
  • 13. Serialization with • Apache Avro provides flexible serialization • All data written along with its “writer schema” • Reader schema may differ from the writer’s { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["LongList", "null"]} ] } Developed By:
  • 14. Serialization with • No code generation required • Producers and consumers of data can migrate independently • Data format migrations do not require structural changes to underlying data Developed By:
  • 15. WibiData: An extended data model <column> <name>email</name> <description>Email address</description> <schema>"string"</schema> </column> • Columns or whole families have common Avro schemas for evolvable storage and retrieval Developed By:
  • 16. WibiData: An extended data model • Column families are a logical concept • Data is physically arranged in locality groups • Row ids are hashed for uniform write pressure Developed By:
  • 17. WibiData: An extended data model • Wibi uses 3-d storage • Data is often sorted by timestamp Developed By:
  • 18. Analyzing data: Producers • Producers create derived column values • Produce operator works on one row at a time – Can be run in MapReduce, or on a one-off basis • Produce is a row mutation operator Developed By:
  • 19. Analyzing data: Gatherers • Gatherers aggregate data across all rows • Always run within MapReduce • A bridge between rows and (key, value) pairs Developed By:
  • 20. Interactive access: REST API PUT request GET request • REST API provides interactive access • Producers can be triggered “on demand” to create fresh recommendations Developed By:
  • 22. Example: Ad Targeting Producer Developed By:
  • 23. Example: Ad Targeting Producer Developed By:
  • 24. Gathering Category Associations • Gather observed behavior Developed By:
  • 25. Gathering Category Associations • Associate interests, clicks Developed By:
  • 26. Gathering Category Associations • … for all pairs Developed By:
  • 27. Gathering Category Associations • And aggregate across all users Map phase: Reduce phase: Developed By:
  • 28. Conclusions • Hadoop, HBase, Avro form the core of a large- scale machine learning/analysis platform • How you set up your schema matters • Producer/gatherer programming model allows computations over tables to be expressed naturally; works with MapReduce Developed By:
  • 29. www.wibidata.com / @wibidata Aaron Kimball – aaron@odiago.com Developed By: