SlideShare uma empresa Scribd logo
1 de 15
Baixar para ler offline
Hadoop @ eBuddy
eBuddy
Web based chat (Started in 2003)
● Initially no statistics, msn only
● Started basic logging in 2004
● Today
  ○ 34.467.010.693 login records (34x109)
  ○ It takes about 40min to select them all.
XMS (Launched May 23, 2011)
● Today
   ○ 1.334.794.121 records (1,3x109)
Website (google analytics)
Banners (openx)
Warehousing needs
● Product owners
  ○ Comparing product version
     ■ avg duration
     ■ msg sent/received
  ○ Churn analysis
  ○ Feature analysis
● Marketing
  ○ What countries should we focus on
  ○ What people should we target?
● Sales
  ○ Sell banners in countries/products.
● Operations/Dev
  ○ Help solve bugs
  ○ Blocked in countries/providers
Interesting to know
● Developers are Java centric
● Hosting in the US but BI people in Amsterdam
● 18 hadoop nodes each having
    ○ 16 cores
    ○ 24G ram
    ○ 4x400G HD's
●   We make money with banners
    ○ So don't expect deep pockets
Warehouse timeline
● Traditional rdbms (2004)
● Custom mapreduce code (2008)
  ○ Joining two files (merge join/map join?)
  ○ Repeating code
  ○ Consider abstraction
  ○ Changing data changing code?
● Pig scripts (2008/2009)
  ○ Much simpler to read but domain specific
● Hive (2009)
  ○ Generic sql but with some limitations
  ○ Existing tools can be used
Hive
● Hey I already know this:
select *
from table1 t1
  left outer join table2 t2 on (t1.id = t2.id)
where t2.id is null;


● Java programmers will like this:
  ○ Spring JdbcTemplates
  ○ Existing jdbc tools (SQuirreL)
  ○ Syntax highlighting
  ○ Code completion
Present
● App servers log to mysql
  ○ Brittle but it works
● Hive
  ○ Sql (most developers know this)
  ○ Partition pruning issues
  ○ No rollup queries
● ETL
  ○ Star schema
  ○ Fair scheduling (ETL vs BI)
     ■ reserved for etl pool
     ■ don't start reducers until 90% mappers done
  ○ Lzo on all jobs
● MicroStrategy (odbc)
● SQuirreL (jdbc)
Future
● Look at users from a to z
  ○ website logs
  ○ banners
● Cassandra handler for hive
  ○ Looking at contact lists (not just size)
● Streaming ETL
  ○ flume
      ■ No more mysql & scripts
      ■ Directly write into the correct partition
  ○ avro
      ■ Less schema related problems
  ○ snappy
      ■ Lightweight compression
Questions?
Hive partition pruning
● Won't work
select count(*)
from chatsessions cs
  inner join calendar c on (c.cldr_id = cs.login_cldr_id)
where c.iso_date = '2012-06-14';


● Will work
select cldr_id from calendar where iso_date = '2012-06-14';
select count(*) from chatsessions where login_cldr_id in (1234);
Left outer join in Pig
A = LOAD 'file1' USING PigStorage(',') AS (a1:int,a2:chararray);
B = LOAD 'file2' USING PigStorage(',') AS (b1:int,b2:chararray);
C = COGROUP A BY a1, B BY b1 OUTER;
X = FILTER C BY IsEmpty(B);
Z = FOREACH X GENERATE flatten(A.a2);
DUMP Z;
● avro & hive: https://issues.apache.org/jira/browse/HIVE-
  895

● flume:
   https://cwiki.apache.org/FLUME/

Mais conteúdo relacionado

Mais procurados

Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...PROIDEA
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей РыбакOntico
 
Talend connect BE Vincent Harcq - Talend ESB - DI
Talend connect BE Vincent Harcq - Talend  ESB - DITalend connect BE Vincent Harcq - Talend  ESB - DI
Talend connect BE Vincent Harcq - Talend ESB - DIVincent Harcq
 
Neo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in MalmöNeo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in MalmöCraig Taverner
 
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Alexey Rybak
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
 
Intro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucIntro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucFraugster
 
FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018Fast Reports
 
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...Sanjog Kumar Dash
 
Distributed unique id generation
Distributed unique id generationDistributed unique id generation
Distributed unique id generationTung Nguyen
 
Challenges in knowledge graph visualization
Challenges in knowledge graph visualizationChallenges in knowledge graph visualization
Challenges in knowledge graph visualizationGraphAware
 
ConvNetJS & CaffeJS
ConvNetJS & CaffeJSConvNetJS & CaffeJS
ConvNetJS & CaffeJSAnyline
 
Cypher for Apache Spark
Cypher for Apache SparkCypher for Apache Spark
Cypher for Apache SparkopenCypher
 
Customer segmentation scbcn17
Customer segmentation scbcn17Customer segmentation scbcn17
Customer segmentation scbcn17Julio Martinez
 
Efficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informationsEfficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informationsDimitris Akridas
 
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
The immutable database datomic
The immutable database   datomicThe immutable database   datomic
The immutable database datomicLaurence Chen
 

Mais procurados (20)

Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
Talend connect BE Vincent Harcq - Talend ESB - DI
Talend connect BE Vincent Harcq - Talend  ESB - DITalend connect BE Vincent Harcq - Talend  ESB - DI
Talend connect BE Vincent Harcq - Talend ESB - DI
 
Neo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in MalmöNeo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in Malmö
 
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Ad Placement Challenge
Ad Placement ChallengeAd Placement Challenge
Ad Placement Challenge
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Intro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucIntro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana Goriuc
 
FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018
 
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Distributed unique id generation
Distributed unique id generationDistributed unique id generation
Distributed unique id generation
 
Challenges in knowledge graph visualization
Challenges in knowledge graph visualizationChallenges in knowledge graph visualization
Challenges in knowledge graph visualization
 
ConvNetJS & CaffeJS
ConvNetJS & CaffeJSConvNetJS & CaffeJS
ConvNetJS & CaffeJS
 
Cypher for Apache Spark
Cypher for Apache SparkCypher for Apache Spark
Cypher for Apache Spark
 
Customer segmentation scbcn17
Customer segmentation scbcn17Customer segmentation scbcn17
Customer segmentation scbcn17
 
Efficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informationsEfficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informations
 
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
 
The immutable database datomic
The immutable database   datomicThe immutable database   datomic
The immutable database datomic
 

Destaque

Extending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post TypesExtending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post TypesUtsav Singh Rathour
 
When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012Utsav Singh Rathour
 
Must see & experience while in australia
Must see & experience while in australiaMust see & experience while in australia
Must see & experience while in australiaMaiju Heinonen
 
Claro luna partitura
Claro luna partituraClaro luna partitura
Claro luna partituraNa Re
 
Nr16 atividades e operações perigosas
Nr16 atividades e operações perigosasNr16 atividades e operações perigosas
Nr16 atividades e operações perigosasCarlos Colombo
 
Ttg on twitter (1)
Ttg on twitter (1)Ttg on twitter (1)
Ttg on twitter (1)drpdwilkins
 
wine and grape with france regions.......
wine and grape with france regions.......wine and grape with france regions.......
wine and grape with france regions.......vikas dobhal
 
WordCamps and how you can make the most of it
WordCamps and how you can make the most of itWordCamps and how you can make the most of it
WordCamps and how you can make the most of itUtsav Singh Rathour
 
What are child themes, and why use them
What are child themes, and why use themWhat are child themes, and why use them
What are child themes, and why use themUtsav Singh Rathour
 

Destaque (19)

La familia
La familiaLa familia
La familia
 
Hive jdbc
Hive jdbcHive jdbc
Hive jdbc
 
Power profesiones
Power profesionesPower profesiones
Power profesiones
 
Extending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post TypesExtending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post Types
 
Alimentos saludable
Alimentos saludableAlimentos saludable
Alimentos saludable
 
Introducao blue solar
Introducao blue solarIntroducao blue solar
Introducao blue solar
 
Working with WordPress themes
Working with WordPress themesWorking with WordPress themes
Working with WordPress themes
 
When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012
 
Must see & experience while in australia
Must see & experience while in australiaMust see & experience while in australia
Must see & experience while in australia
 
Claro luna partitura
Claro luna partituraClaro luna partitura
Claro luna partitura
 
Nr16 atividades e operações perigosas
Nr16 atividades e operações perigosasNr16 atividades e operações perigosas
Nr16 atividades e operações perigosas
 
Ttg on twitter (1)
Ttg on twitter (1)Ttg on twitter (1)
Ttg on twitter (1)
 
Power profesiones
Power profesionesPower profesiones
Power profesiones
 
Power profesiones
Power profesionesPower profesiones
Power profesiones
 
wine and grape with france regions.......
wine and grape with france regions.......wine and grape with france regions.......
wine and grape with france regions.......
 
WordCamps and how you can make the most of it
WordCamps and how you can make the most of itWordCamps and how you can make the most of it
WordCamps and how you can make the most of it
 
Plan anual 2015 cc ee noveno
Plan anual 2015 cc ee novenoPlan anual 2015 cc ee noveno
Plan anual 2015 cc ee noveno
 
What are child themes, and why use them
What are child themes, and why use themWhat are child themes, and why use them
What are child themes, and why use them
 
Branding strategy
Branding strategyBranding strategy
Branding strategy
 

Semelhante a Hadoop @ eBuddy

TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
Dart the better Javascript 2015
Dart the better Javascript 2015Dart the better Javascript 2015
Dart the better Javascript 2015Jorg Janke
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodbDeep Kapadia
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAlex Pinkin
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopTamas K Lengyel
 
BlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedbacksinfomicien
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipsterJulien Dubois
 
Scaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLScaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLOSInet
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series databasefelixbarny
 
Kibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod serversKibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod serversHYS Enterprise
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDBPingCAP
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseGruter
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
 

Semelhante a Hadoop @ eBuddy (20)

TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Dart the better Javascript 2015
Dart the better Javascript 2015Dart the better Javascript 2015
Dart the better Javascript 2015
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshop
 
BlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedback
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipster
 
Scaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLScaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQL
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Scaling xtext
Scaling xtextScaling xtext
Scaling xtext
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Kibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod serversKibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod servers
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 

Último

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Hadoop @ eBuddy

  • 2. eBuddy Web based chat (Started in 2003) ● Initially no statistics, msn only ● Started basic logging in 2004 ● Today ○ 34.467.010.693 login records (34x109) ○ It takes about 40min to select them all. XMS (Launched May 23, 2011) ● Today ○ 1.334.794.121 records (1,3x109) Website (google analytics) Banners (openx)
  • 3. Warehousing needs ● Product owners ○ Comparing product version ■ avg duration ■ msg sent/received ○ Churn analysis ○ Feature analysis ● Marketing ○ What countries should we focus on ○ What people should we target? ● Sales ○ Sell banners in countries/products. ● Operations/Dev ○ Help solve bugs ○ Blocked in countries/providers
  • 4.
  • 5.
  • 6. Interesting to know ● Developers are Java centric ● Hosting in the US but BI people in Amsterdam ● 18 hadoop nodes each having ○ 16 cores ○ 24G ram ○ 4x400G HD's ● We make money with banners ○ So don't expect deep pockets
  • 7. Warehouse timeline ● Traditional rdbms (2004) ● Custom mapreduce code (2008) ○ Joining two files (merge join/map join?) ○ Repeating code ○ Consider abstraction ○ Changing data changing code? ● Pig scripts (2008/2009) ○ Much simpler to read but domain specific ● Hive (2009) ○ Generic sql but with some limitations ○ Existing tools can be used
  • 8. Hive ● Hey I already know this: select * from table1 t1 left outer join table2 t2 on (t1.id = t2.id) where t2.id is null; ● Java programmers will like this: ○ Spring JdbcTemplates ○ Existing jdbc tools (SQuirreL) ○ Syntax highlighting ○ Code completion
  • 9. Present ● App servers log to mysql ○ Brittle but it works ● Hive ○ Sql (most developers know this) ○ Partition pruning issues ○ No rollup queries ● ETL ○ Star schema ○ Fair scheduling (ETL vs BI) ■ reserved for etl pool ■ don't start reducers until 90% mappers done ○ Lzo on all jobs ● MicroStrategy (odbc) ● SQuirreL (jdbc)
  • 10. Future ● Look at users from a to z ○ website logs ○ banners ● Cassandra handler for hive ○ Looking at contact lists (not just size) ● Streaming ETL ○ flume ■ No more mysql & scripts ■ Directly write into the correct partition ○ avro ■ Less schema related problems ○ snappy ■ Lightweight compression
  • 12. Hive partition pruning ● Won't work select count(*) from chatsessions cs inner join calendar c on (c.cldr_id = cs.login_cldr_id) where c.iso_date = '2012-06-14'; ● Will work select cldr_id from calendar where iso_date = '2012-06-14'; select count(*) from chatsessions where login_cldr_id in (1234);
  • 13.
  • 14. Left outer join in Pig A = LOAD 'file1' USING PigStorage(',') AS (a1:int,a2:chararray); B = LOAD 'file2' USING PigStorage(',') AS (b1:int,b2:chararray); C = COGROUP A BY a1, B BY b1 OUTER; X = FILTER C BY IsEmpty(B); Z = FOREACH X GENERATE flatten(A.a2); DUMP Z;
  • 15. ● avro & hive: https://issues.apache.org/jira/browse/HIVE- 895 ● flume: https://cwiki.apache.org/FLUME/