SlideShare uma empresa Scribd logo
1 de 18
@
Who is talking?
• Tomáš Červenka


• Quick Bio:
   • Slovakia
   • Cambridge CompSci + Management
   • Google – Adsense for TV
   • VisualDNA – Software Engineer -> … -> CTO


• @tomascervenka
                                                 2
What is this talk about?
• What is Hive?
   • What is it useful for?
   • What is it not useful for?
• Where to start?
   • Amazon EMR + S3
   • Simple example
• How do we use Hive at VisualDNA?
   • What is VisualDNA, anyway?
   • Use cases: reporting, analytics, ML
   • Tips and tricks
• Q&A
                                           3
What is Hive?
• Data warehousing solution built on top of Hadoop
• Input format agnostic: can read CSV, Json, Thrift, SequenceTable…
• Initially developed at Facebook, became part of Apache Hadoop


• In simple terms gives you SQL-like interface to query Big Data.
   • HiveQL together with custom mappers and reducers give you
     enough flexibility to write most data processing back-ends.


• Hive compiles your HiveQL queries to a set of MapReduce jobs and
 uses Hadoop to deliver the results of the query.
                                                                      4
Why is HiveQL important?




                           http://howfuckedismydatabase.com/nosql/
                                                                 5
What is Hive useful for?
• Big Data analytics
   • Running queries over large semi-structured datasets
   • Makes filtering, aggregation, joins etc. very easy
   • Hides the complexity of MR => used by non-developers


• Big Data processing
   • Efficient and effective way to write data pipelines
   • Easy way to parallelise computationally complex queries
   • Scales nicely with amount of data and cluster size
                                                               6
What is Hive not useful for?
• Real time analytics or processing
   • Even small queries can take tens of seconds or minutes
   • Can’t build Hive (or Hadoop for that matter) into real-time flow


• Algorithms which are difficult to parallelise
   • Almost everything can be expressed in a number of MR steps
   • Almost always MR is sub-optimal
   • If your data is small, R or scripting is often better and faster


• Another downside: Hive (on EMR) tends to be a pain to debug           7
How to start with Hive?
• Build your own Hadoop cluster + install Hive
   • The “right” way to do it, might take some time for multi-node setup


• Spinning up an EMR cluster
   • The quick and cheap way to do it.
   • You need an Amazon AWS account and some data on S3.
   • You need an EMR ruby library installed and configured locally.
   • You need to spin up an EMR cluster in interactive mode. Voila.
   $ emr --create --alive --name ”MY JOB" --hive-interactive --num-instances 8 -
   -instance-type cc1.4xlarge --hive-versions 0.8.1 --bootstrap-action "s3://my-
   bucket/emr-bootstrap"
                                                                                   8
Getting Started with Hive
• SSH into your cluster (your namenode)
   • $ emr --ssh j-AHF0QE733K8F

• Run screen (you’ll quickly find out why)
   • $ screen

• Run Hive
   • $ hive

• Welcome to Hive interface!
• Monitor Hive
   • $ elinks http://localhost:9100

• Terminate Hive
   • $ emr --terminate j-AHF0QE733K8F
                                             9
Example – CTR by ad by day
add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;

CREATE EXTERNAL TABLE
    events (
         time string,
         action string,
         id string)
PARTITIONED BY
    (d string)
ROW FORMAT SERDE
    'com.amazon.elasticmapreduce.JsonSerde’
WITH SERDEPROPERTIES ('paths'=’time,action,id')
LOCATION
    's3://my-bucket/events/’;

ALTER TABLE events ADD PARTITION (d='2012-07-09');
ALTER TABLE events ADD PARTITION (d='2012-07-08');
ALTER TABLE events ADD PARTITION (d='2012-07-07');
                                                                     10
Example – CTR by ad by day
CREATE EXTERNAL TABLE
    ad_stats (d string, id string, impressions bigint, clicks bigint, ctr float)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n'
LOCATION 's3://my-bucket/ad-stats/';

INSERT OVERWRITE TABLE ad_stats
SELECT i.d, c.id, impressions, clicks, clicks/impressions
FROM (
    SELECT d, id, count(1) as clicks
    FROM events
    WHERE action = 'CLICK’
    GROUP BY d, id
) c FULL OUTER JOIN (
    SELECT d, id, count(1) as impressions
    FROM events
    WHERE action = 'IMPRESSION’
    GROUP BY d, id
) i ON (c.d = i.d AND c.id = i.id);
                                                                                   11
What is VisualDNA, anyway?
• Leading audience profiling technology and audience data provider
• ≈ 100 million monthly uniques reach globally
• Data for ad targeting, risk, personalisation, recommendations…


• Use Cassandra, Hadoop, Hive, Redis, Scala, PHP, Java in production
• Running on a mix of AWS and physical HW in London

              We’re hiring (like mad)
           Back-End, Front-End and Research Engineers
                     www.visualdna.com/careers
                                                                       12
How do we use Hive?
• Main data source is: events
• Events are associated with users actions (mainly)
   • Conversions, pageviews, impressions, clicks, syncs…
   • Contain user ID, timestamp, browser info, geo, event info…
• Roughly 50M of them a day = 50 GB of text
   • JSON format, one JSON object per line, validated input
   • Coming from 8 events trackers, rotated every 5 mins
   • Partitioned by date (d=2012-07-09)
   • Storing all of them on S3. Never deleting anything.
                                                                  13
Use case #1: Analytics
• Analytics queries on our events table
   • How many people started each quiz in the last 3 months?
   • Give me the IDs of people visiting football section on Mirror today.
   • Give me a histogram of frequency of visits per user
   • …
• Best thing about Hive: non-developers use it (after we wrote a wiki)
   • Can simplify further by using Karmasphere on AWS
• Downsides:
   • Takes time to spin-up the cluster on AWS.
   • Takes time to execute simple queries. Very big queries often fail.
   • Replacing a lot of the “how many” queries by Redis.
                                                                            14
Use case #2: Reporting pipeline
• Interactive mode in Hive is only part of the picture.
• Hive can also run scripted queries for you:
   •   $ emr --create --hive-script --name ”Test” --num-instances 2 --slave-
       instance-type cc1.4xlarge --master-instance-type c1.medium --arg
       hive_script.q --args "-d,PARTITION=2012-07-09,-d,RUNDATE=2012-07-10”
   •   Note: arguments are accessible in the hive query: ${PARTITION}
   •   Rule of thumb: always run queries by hand first, script them if you’re sure they work

• Reporting is repeated analytics => similar queries, but ran regularly
• Hive drives a lot of our reporting tools and provides data for Redis
• We use cron + bash scripts to schedule, run and monitor Hive jobs
   • Poll emr for status of the job until finished (success or fail)
   • Suggestions for better tools?                                                             15
Use case #3: Inference Engine (ML)
• Inference Engine helps us scale audience data to 100M+ profiles
• In principle, extrapolates quiz profile data over user behaviour online
• At its heart, it’s a few hundred lines of Hive queries
• Every day, fetches users from Cassandra and sifts through events:
   • Update profiles for pages visited by profiled users yesterday
   • Update profiles for users based on their behaviour yesterday
• Input is about 2M users, 50M events; output is 5-10M user profiles
• Runs in < 3 hours with 10 large instances -> parallelises nicely
   • Could use Apache Mahout, but was single-threaded back then
• Biggest issues? Global sorts, running out of memory/disk on joins.
                                                                            16
Tips and Tricks
• Performance related
   • On AWS, S3 is often the bottleneck. Use cc1.* or cc2.* instances.
       • Copy from S3 to internal table if you query it multiple times.
       • Use compression for output. Plenty of CPU cycles for this.
   • Use SequenceTable format and internal tables where applicable.
   • Use MapJoin wherever possible (SELECT     /*+ MAPJOIN(table)*/).

   • Avoid SerDe-s and TRANSFORM mappers if possible.
   • Don’t sort (ORDER BY) unless you really have to => 1 reducer.
   • Partition your data (input and/or output) if you can.
• Might make your life easier
   • If queries start stalling, add more instances. Debugging is painful.
   • Use arguments to pass in commands / partitions (if you need to).
                                                                            17
Q&A



• Thank you for your time!

• Hope this was a bit useful – let me know your feedback.

• Any questions?




                                                            18

Mais conteúdo relacionado

Mais procurados

(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep Dive(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep DiveAmazon Web Services
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Toshihiro Suzuki
 
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopKai Sasaki
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 
Apache Mesos at Twitter (Texas LinuxFest 2014)
Apache Mesos at Twitter (Texas LinuxFest 2014)Apache Mesos at Twitter (Texas LinuxFest 2014)
Apache Mesos at Twitter (Texas LinuxFest 2014)Chris Aniszczyk
 
Building your own slack bot on the AWS stack
Building your own slack bot on the AWS stackBuilding your own slack bot on the AWS stack
Building your own slack bot on the AWS stackTorontoNodeJS
 
Deep Learning with AWS (November 2016)
Deep Learning with AWS (November 2016)Deep Learning with AWS (November 2016)
Deep Learning with AWS (November 2016)Julien SIMON
 
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...NETWAYS
 
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data AnalyticsAmazon Web Services
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsTips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsAshish Thapliyal
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Yukinori Suda
 
Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014Christopher Curtin
 
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryInteractive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryAshish Thapliyal
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for HadoopHadoop User Group
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsYifeng Jiang
 

Mais procurados (20)

(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep Dive(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep Dive
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤
 
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Apache Mesos at Twitter (Texas LinuxFest 2014)
Apache Mesos at Twitter (Texas LinuxFest 2014)Apache Mesos at Twitter (Texas LinuxFest 2014)
Apache Mesos at Twitter (Texas LinuxFest 2014)
 
Building your own slack bot on the AWS stack
Building your own slack bot on the AWS stackBuilding your own slack bot on the AWS stack
Building your own slack bot on the AWS stack
 
Deep Learning with AWS (November 2016)
Deep Learning with AWS (November 2016)Deep Learning with AWS (November 2016)
Deep Learning with AWS (November 2016)
 
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
 
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsTips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight Deployments
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)
 
Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014
 
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryInteractive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 

Destaque

상 넣는 기계
상 넣는 기계상 넣는 기계
상 넣는 기계moonjunu
 
Sesion 16 toma de decisiones
Sesion 16   toma de decisionesSesion 16   toma de decisiones
Sesion 16 toma de decisionesMarco Carrillo
 
Campeonato Regional de Juvenis de Futsal - Caderno da Prova
Campeonato Regional de Juvenis de Futsal - Caderno da ProvaCampeonato Regional de Juvenis de Futsal - Caderno da Prova
Campeonato Regional de Juvenis de Futsal - Caderno da ProvaNuno Vieira
 
Unitat 1.4 resum
Unitat 1.4 resumUnitat 1.4 resum
Unitat 1.4 resumescolalapau
 
Caderno do Torneio Regional de Inter Associações Sub -14 Futsal
Caderno do Torneio Regional de Inter Associações Sub -14 FutsalCaderno do Torneio Regional de Inter Associações Sub -14 Futsal
Caderno do Torneio Regional de Inter Associações Sub -14 FutsalNuno Vieira
 
Resume_Rajib Chowdhury..PDF
Resume_Rajib Chowdhury..PDFResume_Rajib Chowdhury..PDF
Resume_Rajib Chowdhury..PDFRajib Chowdhury
 
William Shakespeare
William ShakespeareWilliam Shakespeare
William ShakespeareJon Dav
 
Bahasa Indonesia Kelas VII SMP MTs 2016 P & K
Bahasa Indonesia Kelas VII SMP MTs 2016 P & KBahasa Indonesia Kelas VII SMP MTs 2016 P & K
Bahasa Indonesia Kelas VII SMP MTs 2016 P & KAmphie Yuurisman
 

Destaque (17)

상 넣는 기계
상 넣는 기계상 넣는 기계
상 넣는 기계
 
Megan's resume
Megan's resumeMegan's resume
Megan's resume
 
BRTI Website
BRTI WebsiteBRTI Website
BRTI Website
 
Doc
DocDoc
Doc
 
Sesion 16 toma de decisiones
Sesion 16   toma de decisionesSesion 16   toma de decisiones
Sesion 16 toma de decisiones
 
Campeonato Regional de Juvenis de Futsal - Caderno da Prova
Campeonato Regional de Juvenis de Futsal - Caderno da ProvaCampeonato Regional de Juvenis de Futsal - Caderno da Prova
Campeonato Regional de Juvenis de Futsal - Caderno da Prova
 
Undangan p yon
Undangan p yonUndangan p yon
Undangan p yon
 
Unitat 1.4 resum
Unitat 1.4 resumUnitat 1.4 resum
Unitat 1.4 resum
 
Caderno do Torneio Regional de Inter Associações Sub -14 Futsal
Caderno do Torneio Regional de Inter Associações Sub -14 FutsalCaderno do Torneio Regional de Inter Associações Sub -14 Futsal
Caderno do Torneio Regional de Inter Associações Sub -14 Futsal
 
E commerce
E commerceE commerce
E commerce
 
Field exp. 11.2015
Field exp. 11.2015Field exp. 11.2015
Field exp. 11.2015
 
CV S.K.Panda
CV S.K.PandaCV S.K.Panda
CV S.K.Panda
 
Trabajo en clase 2
Trabajo en clase 2Trabajo en clase 2
Trabajo en clase 2
 
Resume_Rajib Chowdhury..PDF
Resume_Rajib Chowdhury..PDFResume_Rajib Chowdhury..PDF
Resume_Rajib Chowdhury..PDF
 
Cuadro de tesis
Cuadro de tesisCuadro de tesis
Cuadro de tesis
 
William Shakespeare
William ShakespeareWilliam Shakespeare
William Shakespeare
 
Bahasa Indonesia Kelas VII SMP MTs 2016 P & K
Bahasa Indonesia Kelas VII SMP MTs 2016 P & KBahasa Indonesia Kelas VII SMP MTs 2016 P & K
Bahasa Indonesia Kelas VII SMP MTs 2016 P & K
 

Semelhante a First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA

Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profitRodrigo Campos
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatternsgrepalex
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSAmazon Web Services
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 
Mapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudMapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudChris Dagdigian
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Intro to node and mongodb 1
Intro to node and mongodb   1Intro to node and mongodb   1
Intro to node and mongodb 1Mohammad Qureshi
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbMongoDB APAC
 

Semelhante a First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA (20)

Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
Data Science
Data ScienceData Science
Data Science
 
Top ten-list
Top ten-listTop ten-list
Top ten-list
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Mapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudMapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the Cloud
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Intro to node and mongodb 1
Intro to node and mongodb   1Intro to node and mongodb   1
Intro to node and mongodb 1
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA

  • 1. @
  • 2. Who is talking? • Tomáš Červenka • Quick Bio: • Slovakia • Cambridge CompSci + Management • Google – Adsense for TV • VisualDNA – Software Engineer -> … -> CTO • @tomascervenka 2
  • 3. What is this talk about? • What is Hive? • What is it useful for? • What is it not useful for? • Where to start? • Amazon EMR + S3 • Simple example • How do we use Hive at VisualDNA? • What is VisualDNA, anyway? • Use cases: reporting, analytics, ML • Tips and tricks • Q&A 3
  • 4. What is Hive? • Data warehousing solution built on top of Hadoop • Input format agnostic: can read CSV, Json, Thrift, SequenceTable… • Initially developed at Facebook, became part of Apache Hadoop • In simple terms gives you SQL-like interface to query Big Data. • HiveQL together with custom mappers and reducers give you enough flexibility to write most data processing back-ends. • Hive compiles your HiveQL queries to a set of MapReduce jobs and uses Hadoop to deliver the results of the query. 4
  • 5. Why is HiveQL important? http://howfuckedismydatabase.com/nosql/ 5
  • 6. What is Hive useful for? • Big Data analytics • Running queries over large semi-structured datasets • Makes filtering, aggregation, joins etc. very easy • Hides the complexity of MR => used by non-developers • Big Data processing • Efficient and effective way to write data pipelines • Easy way to parallelise computationally complex queries • Scales nicely with amount of data and cluster size 6
  • 7. What is Hive not useful for? • Real time analytics or processing • Even small queries can take tens of seconds or minutes • Can’t build Hive (or Hadoop for that matter) into real-time flow • Algorithms which are difficult to parallelise • Almost everything can be expressed in a number of MR steps • Almost always MR is sub-optimal • If your data is small, R or scripting is often better and faster • Another downside: Hive (on EMR) tends to be a pain to debug 7
  • 8. How to start with Hive? • Build your own Hadoop cluster + install Hive • The “right” way to do it, might take some time for multi-node setup • Spinning up an EMR cluster • The quick and cheap way to do it. • You need an Amazon AWS account and some data on S3. • You need an EMR ruby library installed and configured locally. • You need to spin up an EMR cluster in interactive mode. Voila. $ emr --create --alive --name ”MY JOB" --hive-interactive --num-instances 8 - -instance-type cc1.4xlarge --hive-versions 0.8.1 --bootstrap-action "s3://my- bucket/emr-bootstrap" 8
  • 9. Getting Started with Hive • SSH into your cluster (your namenode) • $ emr --ssh j-AHF0QE733K8F • Run screen (you’ll quickly find out why) • $ screen • Run Hive • $ hive • Welcome to Hive interface! • Monitor Hive • $ elinks http://localhost:9100 • Terminate Hive • $ emr --terminate j-AHF0QE733K8F 9
  • 10. Example – CTR by ad by day add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar; CREATE EXTERNAL TABLE events ( time string, action string, id string) PARTITIONED BY (d string) ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde’ WITH SERDEPROPERTIES ('paths'=’time,action,id') LOCATION 's3://my-bucket/events/’; ALTER TABLE events ADD PARTITION (d='2012-07-09'); ALTER TABLE events ADD PARTITION (d='2012-07-08'); ALTER TABLE events ADD PARTITION (d='2012-07-07'); 10
  • 11. Example – CTR by ad by day CREATE EXTERNAL TABLE ad_stats (d string, id string, impressions bigint, clicks bigint, ctr float) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' LOCATION 's3://my-bucket/ad-stats/'; INSERT OVERWRITE TABLE ad_stats SELECT i.d, c.id, impressions, clicks, clicks/impressions FROM ( SELECT d, id, count(1) as clicks FROM events WHERE action = 'CLICK’ GROUP BY d, id ) c FULL OUTER JOIN ( SELECT d, id, count(1) as impressions FROM events WHERE action = 'IMPRESSION’ GROUP BY d, id ) i ON (c.d = i.d AND c.id = i.id); 11
  • 12. What is VisualDNA, anyway? • Leading audience profiling technology and audience data provider • ≈ 100 million monthly uniques reach globally • Data for ad targeting, risk, personalisation, recommendations… • Use Cassandra, Hadoop, Hive, Redis, Scala, PHP, Java in production • Running on a mix of AWS and physical HW in London We’re hiring (like mad) Back-End, Front-End and Research Engineers www.visualdna.com/careers 12
  • 13. How do we use Hive? • Main data source is: events • Events are associated with users actions (mainly) • Conversions, pageviews, impressions, clicks, syncs… • Contain user ID, timestamp, browser info, geo, event info… • Roughly 50M of them a day = 50 GB of text • JSON format, one JSON object per line, validated input • Coming from 8 events trackers, rotated every 5 mins • Partitioned by date (d=2012-07-09) • Storing all of them on S3. Never deleting anything. 13
  • 14. Use case #1: Analytics • Analytics queries on our events table • How many people started each quiz in the last 3 months? • Give me the IDs of people visiting football section on Mirror today. • Give me a histogram of frequency of visits per user • … • Best thing about Hive: non-developers use it (after we wrote a wiki) • Can simplify further by using Karmasphere on AWS • Downsides: • Takes time to spin-up the cluster on AWS. • Takes time to execute simple queries. Very big queries often fail. • Replacing a lot of the “how many” queries by Redis. 14
  • 15. Use case #2: Reporting pipeline • Interactive mode in Hive is only part of the picture. • Hive can also run scripted queries for you: • $ emr --create --hive-script --name ”Test” --num-instances 2 --slave- instance-type cc1.4xlarge --master-instance-type c1.medium --arg hive_script.q --args "-d,PARTITION=2012-07-09,-d,RUNDATE=2012-07-10” • Note: arguments are accessible in the hive query: ${PARTITION} • Rule of thumb: always run queries by hand first, script them if you’re sure they work • Reporting is repeated analytics => similar queries, but ran regularly • Hive drives a lot of our reporting tools and provides data for Redis • We use cron + bash scripts to schedule, run and monitor Hive jobs • Poll emr for status of the job until finished (success or fail) • Suggestions for better tools? 15
  • 16. Use case #3: Inference Engine (ML) • Inference Engine helps us scale audience data to 100M+ profiles • In principle, extrapolates quiz profile data over user behaviour online • At its heart, it’s a few hundred lines of Hive queries • Every day, fetches users from Cassandra and sifts through events: • Update profiles for pages visited by profiled users yesterday • Update profiles for users based on their behaviour yesterday • Input is about 2M users, 50M events; output is 5-10M user profiles • Runs in < 3 hours with 10 large instances -> parallelises nicely • Could use Apache Mahout, but was single-threaded back then • Biggest issues? Global sorts, running out of memory/disk on joins. 16
  • 17. Tips and Tricks • Performance related • On AWS, S3 is often the bottleneck. Use cc1.* or cc2.* instances. • Copy from S3 to internal table if you query it multiple times. • Use compression for output. Plenty of CPU cycles for this. • Use SequenceTable format and internal tables where applicable. • Use MapJoin wherever possible (SELECT /*+ MAPJOIN(table)*/). • Avoid SerDe-s and TRANSFORM mappers if possible. • Don’t sort (ORDER BY) unless you really have to => 1 reducer. • Partition your data (input and/or output) if you can. • Might make your life easier • If queries start stalling, add more instances. Debugging is painful. • Use arguments to pass in commands / partitions (if you need to). 17
  • 18. Q&A • Thank you for your time! • Hope this was a bit useful – let me know your feedback. • Any questions? 18