SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
Introduction to Data Analysis with Hadoop and Hive
                                   Jonathan Seidman
                                         ChicagoDB
                                   February 21 | 2011
About Me


•  Lead Engineer on Business Intelligence/Data Infrastructure team at
   Orbitz, former member of Machine Learning team	

•  Co-organizer/founder of Chicago Hadoop User Group (http://
   www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/)	

•  Recovering Java developer	

•  jseidman@orbitz.com	

•  @jseidman	

•  @OrbitzTalent	





                                                                        page 2
Why Hadoop and Hive?	





                          page 3
Some Hadoop “Clichés” (Which are still true…)


Hadoop allows you to store and process data that was
 previously impractical because of cost, technical issues,
 etc.	





                                                             page 4
Utterly redonkulous amounts of money 	





$ per managed TB	





                                                                 page 5
Utterly redonkulous amounts of money	





                                More reasonable amounts of money	

$ per managed TB	





                                                                      page 6
Adding data to our data warehouse also requires a lengthy
 plan/implement/deploy cycle.	



Because of the expense and time our data teams need to be
 very judicious about which data gets added. This means
 that potentially valuable data may not be saved.	





                                                            page 7
Hadoop brings our cost per TB down to $1500 (or even less)	





                                                                page 8
Hadoop Distributed File System

HDFS provides economical, reliable, fault tolerant and
 scalable storage of very large datasets across machines in
 a cluster.	





                                                              page 9
Some Hadoop “Clichés” (Which are still true…)


 Hadoop places no constraints on how data is processed.	





                                                             page 10
Some Hadoop “Clichés” (Which are still true…)


 Hadoop makes it relatively easy to efficiently process all the
  data stored in HDFS.	



 MapReduce is a programming model for efficient
  distributed processing. Designed to reliably perform
  computations on large volumes of data in parallel.	



 MapRedue Removes much of the burden of writing
  distributed computations.	




                                                            page 11
The Problem with MapReduce

•          package org.myorg;
•    2.
•    3.    import java.io.IOException;
•    4.    import java.util.*;
•    5.
•    6.    import org.apache.hadoop.fs.Path;
•    7.    import org.apache.hadoop.conf.*;
•    8.    import org.apache.hadoop.io.*;
•    9.    import org.apache.hadoop.mapred.*;
•    10.   import org.apache.hadoop.util.*;
•    11.
•    12.   public class WordCount {
•    13.
•    14.       public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
•    15.        private final static IntWritable one = new IntWritable(1);
•    16.           private Text word = new Text();
•    17.
•    18.           public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
•    19.            String line = value.toString();
•    20.               StringTokenizer tokenizer = new StringTokenizer(line);
•    21.               while (tokenizer.hasMoreTokens()) {
•    22.                   word.set(tokenizer.nextToken());
•    23.                   output.collect(word, one);
•    24.               }
•    25.           }
•    26.       }
•    27.
•    28.       public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
•    29.        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
•    30.               int sum = 0;
•    31.               while (values.hasNext()) {
•    32.                   sum += values.next().get();
•    33.               }
•    34.               output.collect(key, new IntWritable(sum));
•    35.           }
•    36.       }
•    37.
•    38.       public static void main(String[] args) throws Exception {
•    39.           JobConf conf = new JobConf(WordCount.class);
•    40.           conf.setJobName("wordcount");
•    41.
•    42.           conf.setOutputKeyClass(Text.class);
•    43.           conf.setOutputValueClass(IntWritable.class);
•    44.
•    45.           conf.setMapperClass(Map.class);
•    46.           conf.setCombinerClass(Reduce.class);
•    47.           conf.setReducerClass(Reduce.class);
•    48.
•    49.           conf.setInputFormat(TextInputFormat.class);
•    50.           conf.setOutputFormat(TextOutputFormat.class);
•    51.
•    52.           FileInputFormat.setInputPaths(conf, new Path(args[0]));
•    53.           FileOutputFormat.setOutputPath(conf, new Path(args[1]));
•    54.
•    55.           JobClient.runJob(conf);
•    57.       }
•    58.   }




                                                                                                                                                                page 12
Hive Overview

Hive is an open-source data warehousing solution built on top of
  Hadoop which allows for easy data summarization, ad-hoc querying
  and analysis of large datasets stored in Hadoop.	



Developed at Facebook to provide a structured data model over Hadoop
 data.	



Simplifies Hadoop data analysis – users can use a familiar SQL model
  rather than writing low level custom code.	



Hive queries are compiled into Hadoop MapReduce jobs.	



Designed for scalability, not low latency. 	


                                                                       page 13
Hive provides the basis for a new data analysis infrastructure.	



We currently run Hive 0.6.0 with Cloudera CDH2 (Hadoop 0.20.1)	





                                                                     page 14
Hive Architecture (Simplified)




                                 page 15
Hive Overview – Comparison to Traditional DBMS Systems


Although Hive uses a model familiar to database users, it does not
  support a full relational model and only supports a subset of SQL.	



Schema on read vs. schema on write	



What Hadoop/Hive offers is highly scalable and fault-tolerant
 processing of very large data sets.	



Hive However is moving more and more towards being a parallel
  DBMS.	





                                                                          page 16
Hive - Data Model


Tables – analogous to tables in a standard RDBMS.	



Partitions and buckets – Allow Hive to prune data during
 query processing.	





                                                           page 17
Not Yet, But Soon


Multiple databases	



Views	



Indexes	





                        page 18
Hive – Data Types

Supports primitive types such as int, double, and string.	



Also supports complex types such as structs, maps (key/
 value tuples), and arrays (indexable lists).	





                                                               page 19
Extensible Storage Model


Row formats determine how records are stored.	



Row format is defined by a SerDe (Serializer-Deserializer).	



Container format is determined by the file format.	





                                                                page 20
Hive – Hive Query Language

HiveQL – Supports basic SQL-like operations such as select, join,
  aggregate, union, sub-queries, etc.	



HiveQL queries are compiled into MapReduce processes.	



Supports embedding custom MapReduce scripts.	



Built in support for standard relational, arithmetic, and boolean
 operators.	



Supports aggregate functions, including statistical functions (avg,
  standard deviation, covariance, percentiles).	



                                                                      page 21
Hive – User Defined Functions

HiveQL is extensible through user defined functions
 implemented in Java. 	



Also supports aggregation functions.	



Provides table functions when more than one value needs to
  be returned.	





                                                             page 22
Hive – User Defined Functions
Example UDF – Find hotel’s position in an impression list:

package com.orbitz.hive;!
import org.apache.hadoop.hive.ql.exec.UDF;!
import org.apache.hadoop.io.Text;!


/**!
 * returns hotel_id's position given a hotel_id and impression list!
 */!
public final class GetPos extends UDF {!
       public Text evaluate(final Text hotel_id, final Text impressions) {!
            if (hotel_id == null || impressions == null)!
                return null;!


            String[] hotels = impressions.toString().split(";");!
            String position;!
            String id = hotel_id.toString();!
            int begin=0, end=0;!


            for (int i=0; i<hotels.length; i++) {!
                 begin = hotels[i].indexOf(",");!
                 end = hotels[i].lastIndexOf(",");!
                 position = hotels[i].substring(begin+1,end);!
                 if (id.equals(hotels[i].substring(0,begin)))!
                     return new Text(position);!
            }!
            return null;!
       }!
}!




                                                                              page 23
Hive – User Defined Functions

hive> add jar path-to-jar/pos.jar; !
hive> create temporary function getpos as
 'com.orbitz.hive.GetPos';!
hive> select getpos(‘1’,
 ‘1,3,100.00;2,1,100.00’);!
…!
hive> 3 !




                                            page 24
Hive MapReduce

Allows analysis not possible through standard HiveQL
 queries.	



Can be implemented in any language.	





                                                       page 25
Hive MapReduce

•  #!/usr/bin/python


  import sys


  for line in sys.stdin:

          line = line.replace(';', '|')

          impressions = line.split('|')

          for impression in impressions:

                  fields = "".join(impression).split(',')

                  print "%st%s" % (fields[0], fields[1])


  hive>   ADD FILE /home/jseidman/parse_impressions.py;

  hive>   FROM

      >     hotel_searches         

      >   SELECT

      >     TRANSFORM(impressions)               

      >   USING

      >     'parse_impressions.py'                

      >   AS

      >     hotel, pos;





                                                             page 26
Processing Web Analytics Logs

Hive provides the infrastructure to support analysis of web
 analytics logs stored in Hadoop	



Used to support analysis for machine learning tasks, cache
 optimization, keyword performance, etc.	





                                                              page 27
Processing Flow – Step 1




                           page 28
Processing Flow – Step 2




                           page 29
Processing Flow – Step 3




                           page 30
Processing Flow – Step 4




                           page 31
Processing Flow – Step 5




                           page 32
Processing Flow – Step 6




                           page 33
Importing Prepared Data to Hive

$HIVE_HOME/bin/hive -e "LOAD DATA INPATH !
  ’/output/part-00000' OVERWRITE INTO!
  TABLE hotel_searches PARTITION(dt='$YEAR-$MONTH-$DAY')"!


CREATE TABLE hotel_searches( !
 session_id STRING, host STRING, visitors_ip STRING,
 search_date STRING, search_time STRING, dept_date STRING,
 ret_date STRING, destination STRING, location_id STRING,
 number_of_guests INT, number_of_rooms INT, !
  impressions STRING)!
PARTITIONED BY (dt STRING)!
ROW FORMAT DELIMITED!
FIELDS TERMINATED BY 't’!
STORED AS TEXTFILE;!



                                                             page 34
Exporting Data from Hive Tables


hive> INSERT OVERWRITE LOCAL DIRECTORY !
    > '/tmp/searches.dat' !
    > SELECT * FROM hotel_searches; !




                                           page 35
Analyzing Prepared Data


Example - Find the Position of Each Booked Hotel in Search Results:	


   CREATE TABLE positions(!
     session_id STRING,!
     booked_hotel_id STRING,!
     position INT);!


   INSERT OVERWRITE TABLE

     positions!
   SELECT

     h.session_id, h.booked_hotel_id, i.position!
   FROM

     hotel_impressions i JOIN hotel_bookings h!
   ON

         (h.booked_hotel_id = i.hotel_id and h.session_id = i.session_id);!




                                                                              page 36
Analyzing Prepared Data


Example - Aggregate Booking Position by Location by Day:	

   CREATE TABLE position_aggregate_by_day(!
     location_id STRING,!
     booking_date STRING,!
     position INT,!
     pcount INT);!


   INSERT OVERWRITE TABLE!
     position_aggregate_by_day!
   SELECT!
     h.location_id, h.booking_date, i.position, count(1)!
   FROM!
     hotel_bookings h JOIN hotel_impressions i!
   ON!
    (i.hotel_id = h.booked_hotel_id and i.session_id = h.session_id)!
   GROUP BY!
     h.location_id, h.booking_date, i.position!




                                                                        page 37
Hive vs. Pig


Both are declarative languages, but Hive is SQL-like, Pig is
 a scripting language.	



Explicit schema vs. implicit schema.	



Hive metadata can be accessed by external tools.	





                                                               page 38
Hive vs. HBase


HBase is a column-based key value store as opposed to an
 SQL model.	



HBase offers lower latency and random access to data.	



Hive/HBase integration was recently released, allowing
 Hive queries to be executed over HBase tables.	





                                                           page 39
Hive – Lessons Learned


Job scheduling – Default Hadoop scheduling is FIFO. Consider using
  something like the fair scheduler.	



Multi-user Hive – Default install is single user. Multi-user installs
 require an external relational store.	



set mapred.reduce.tasks is your friend.	



Migrating Hive between clusters is not fun.	



Documentation is still a little sparse.	




                                                                        page 40
References


•  Hadoop project: http://hadoop.apache.org/	

•  Hive project: http://hadoop.apache.org/hive/	

•  Hive – A Petabyte Scale Data Warehouse Using Hadoop:
   http://i.stanford.edu/~ragho/hive-icde2010.pdf	

•  Hadoop The Definitive Guide, Second Edition, Tom White, O’Reilly
   Press, 2011	

•  Hive Evolution, John Sichi, November 2010: http://
   www.slideshare.net/jsichi/hive-evolution-apachecon-2010	





                                                                     page 41

Mais conteúdo relacionado

Mais procurados

Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Ken SASAKI
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache SqoopAvkash Chauhan
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Presto on YARNの導入・運用
Presto on YARNの導入・運用Presto on YARNの導入・運用
Presto on YARNの導入・運用cyberagent
 
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...NTT DATA Technology & Innovation
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
 
오픈소스 GIS 교육 - PostGIS
오픈소스 GIS 교육 - PostGIS오픈소스 GIS 교육 - PostGIS
오픈소스 GIS 교육 - PostGISJungHwan Yun
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
 
An Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAn Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAli MasudianPour
 
PostgreSQLでスケールアウト
PostgreSQLでスケールアウトPostgreSQLでスケールアウト
PostgreSQLでスケールアウトMasahiko Sawada
 
Introduction à la big data v3
Introduction à la big data v3 Introduction à la big data v3
Introduction à la big data v3 Mehdi TAZI
 
Introduction to MongoDB.pptx
Introduction to MongoDB.pptxIntroduction to MongoDB.pptx
Introduction to MongoDB.pptxSurya937648
 

Mais procurados (20)

MongoDB Sharding Fundamentals
MongoDB Sharding Fundamentals MongoDB Sharding Fundamentals
MongoDB Sharding Fundamentals
 
Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Hadoopの概念と基本的知識
Hadoopの概念と基本的知識
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Presto on YARNの導入・運用
Presto on YARNの導入・運用Presto on YARNの導入・運用
Presto on YARNの導入・運用
 
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
오픈소스 GIS 교육 - PostGIS
오픈소스 GIS 교육 - PostGIS오픈소스 GIS 교육 - PostGIS
오픈소스 GIS 교육 - PostGIS
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
An Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAn Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL database
 
PostgreSQLでスケールアウト
PostgreSQLでスケールアウトPostgreSQLでスケールアウト
PostgreSQLでスケールアウト
 
Introduction à la big data v3
Introduction à la big data v3 Introduction à la big data v3
Introduction à la big data v3
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Introduction to MongoDB.pptx
Introduction to MongoDB.pptxIntroduction to MongoDB.pptx
Introduction to MongoDB.pptx
 

Destaque

Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013SATOSHI TAGOMORI
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopCloudera, Inc.
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Large-scale social media analysis with Hadoop
Large-scale social media analysis with HadoopLarge-scale social media analysis with Hadoop
Large-scale social media analysis with Hadoopjakehofman
 
Analyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTVAnalyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTVNinou Haiko
 
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvreIBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvreNicolas Desachy
 
New trends in video analytics and surveillance systems for the mining industry
New trends in video analytics and surveillance systems for the mining industryNew trends in video analytics and surveillance systems for the mining industry
New trends in video analytics and surveillance systems for the mining industrySchneider Electric
 
An Introduction to Video Analytics
An Introduction to Video Analytics An Introduction to Video Analytics
An Introduction to Video Analytics Chartbeat
 
Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309DrVictorFang
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1Vimal Suthar
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Traffic data analysis using HADOOP
Traffic data analysis using HADOOPTraffic data analysis using HADOOP
Traffic data analysis using HADOOPKirthan S Holla
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveQubole
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock AnalysisVaibhav Jain
 

Destaque (20)

Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with Hadoop
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Large-scale social media analysis with Hadoop
Large-scale social media analysis with HadoopLarge-scale social media analysis with Hadoop
Large-scale social media analysis with Hadoop
 
Analyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTVAnalyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTV
 
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvreIBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
 
Intelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud ComputingIntelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud Computing
 
New trends in video analytics and surveillance systems for the mining industry
New trends in video analytics and surveillance systems for the mining industryNew trends in video analytics and surveillance systems for the mining industry
New trends in video analytics and surveillance systems for the mining industry
 
An Introduction to Video Analytics
An Introduction to Video Analytics An Introduction to Video Analytics
An Introduction to Video Analytics
 
Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309
 
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Traffic data analysis using HADOOP
Traffic data analysis using HADOOPTraffic data analysis using HADOOP
Traffic data analysis using HADOOP
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
 

Semelhante a Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011

Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight ServiceNeil Mackenzie
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive AnalyticsManish Chopra
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxiaeronlineexm
 

Semelhante a Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011 (20)

מיכאל
מיכאלמיכאל
מיכאל
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
 

Mais de Jonathan Seidman

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Jonathan Seidman
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_finalJonathan Seidman
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Jonathan Seidman
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Jonathan Seidman
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Jonathan Seidman
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Jonathan Seidman
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Jonathan Seidman
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Jonathan Seidman
 

Mais de Jonathan Seidman (12)

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_final
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Último (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011

  • 1. Introduction to Data Analysis with Hadoop and Hive Jonathan Seidman ChicagoDB February 21 | 2011
  • 2. About Me •  Lead Engineer on Business Intelligence/Data Infrastructure team at Orbitz, former member of Machine Learning team •  Co-organizer/founder of Chicago Hadoop User Group (http:// www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/) •  Recovering Java developer •  jseidman@orbitz.com •  @jseidman •  @OrbitzTalent page 2
  • 3. Why Hadoop and Hive? page 3
  • 4. Some Hadoop “Clichés” (Which are still true…) Hadoop allows you to store and process data that was previously impractical because of cost, technical issues, etc. page 4
  • 5. Utterly redonkulous amounts of money $ per managed TB page 5
  • 6. Utterly redonkulous amounts of money More reasonable amounts of money $ per managed TB page 6
  • 7. Adding data to our data warehouse also requires a lengthy plan/implement/deploy cycle. Because of the expense and time our data teams need to be very judicious about which data gets added. This means that potentially valuable data may not be saved. page 7
  • 8. Hadoop brings our cost per TB down to $1500 (or even less) page 8
  • 9. Hadoop Distributed File System HDFS provides economical, reliable, fault tolerant and scalable storage of very large datasets across machines in a cluster. page 9
  • 10. Some Hadoop “Clichés” (Which are still true…) Hadoop places no constraints on how data is processed. page 10
  • 11. Some Hadoop “Clichés” (Which are still true…) Hadoop makes it relatively easy to efficiently process all the data stored in HDFS. MapReduce is a programming model for efficient distributed processing. Designed to reliably perform computations on large volumes of data in parallel. MapRedue Removes much of the burden of writing distributed computations. page 11
  • 12. The Problem with MapReduce •  package org.myorg; •  2. •  3. import java.io.IOException; •  4. import java.util.*; •  5. •  6. import org.apache.hadoop.fs.Path; •  7. import org.apache.hadoop.conf.*; •  8. import org.apache.hadoop.io.*; •  9. import org.apache.hadoop.mapred.*; •  10. import org.apache.hadoop.util.*; •  11. •  12. public class WordCount { •  13. •  14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { •  15. private final static IntWritable one = new IntWritable(1); •  16. private Text word = new Text(); •  17. •  18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { •  19. String line = value.toString(); •  20. StringTokenizer tokenizer = new StringTokenizer(line); •  21. while (tokenizer.hasMoreTokens()) { •  22. word.set(tokenizer.nextToken()); •  23. output.collect(word, one); •  24. } •  25. } •  26. } •  27. •  28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { •  29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { •  30. int sum = 0; •  31. while (values.hasNext()) { •  32. sum += values.next().get(); •  33. } •  34. output.collect(key, new IntWritable(sum)); •  35. } •  36. } •  37. •  38. public static void main(String[] args) throws Exception { •  39. JobConf conf = new JobConf(WordCount.class); •  40. conf.setJobName("wordcount"); •  41. •  42. conf.setOutputKeyClass(Text.class); •  43. conf.setOutputValueClass(IntWritable.class); •  44. •  45. conf.setMapperClass(Map.class); •  46. conf.setCombinerClass(Reduce.class); •  47. conf.setReducerClass(Reduce.class); •  48. •  49. conf.setInputFormat(TextInputFormat.class); •  50. conf.setOutputFormat(TextOutputFormat.class); •  51. •  52. FileInputFormat.setInputPaths(conf, new Path(args[0])); •  53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); •  54. •  55. JobClient.runJob(conf); •  57. } •  58. } page 12
  • 13. Hive Overview Hive is an open-source data warehousing solution built on top of Hadoop which allows for easy data summarization, ad-hoc querying and analysis of large datasets stored in Hadoop. Developed at Facebook to provide a structured data model over Hadoop data. Simplifies Hadoop data analysis – users can use a familiar SQL model rather than writing low level custom code. Hive queries are compiled into Hadoop MapReduce jobs. Designed for scalability, not low latency. page 13
  • 14. Hive provides the basis for a new data analysis infrastructure. We currently run Hive 0.6.0 with Cloudera CDH2 (Hadoop 0.20.1) page 14
  • 16. Hive Overview – Comparison to Traditional DBMS Systems Although Hive uses a model familiar to database users, it does not support a full relational model and only supports a subset of SQL. Schema on read vs. schema on write What Hadoop/Hive offers is highly scalable and fault-tolerant processing of very large data sets. Hive However is moving more and more towards being a parallel DBMS. page 16
  • 17. Hive - Data Model Tables – analogous to tables in a standard RDBMS. Partitions and buckets – Allow Hive to prune data during query processing. page 17
  • 18. Not Yet, But Soon Multiple databases Views Indexes page 18
  • 19. Hive – Data Types Supports primitive types such as int, double, and string. Also supports complex types such as structs, maps (key/ value tuples), and arrays (indexable lists). page 19
  • 20. Extensible Storage Model Row formats determine how records are stored. Row format is defined by a SerDe (Serializer-Deserializer). Container format is determined by the file format. page 20
  • 21. Hive – Hive Query Language HiveQL – Supports basic SQL-like operations such as select, join, aggregate, union, sub-queries, etc. HiveQL queries are compiled into MapReduce processes. Supports embedding custom MapReduce scripts. Built in support for standard relational, arithmetic, and boolean operators. Supports aggregate functions, including statistical functions (avg, standard deviation, covariance, percentiles). page 21
  • 22. Hive – User Defined Functions HiveQL is extensible through user defined functions implemented in Java. Also supports aggregation functions. Provides table functions when more than one value needs to be returned. page 22
  • 23. Hive – User Defined Functions Example UDF – Find hotel’s position in an impression list: package com.orbitz.hive;! import org.apache.hadoop.hive.ql.exec.UDF;! import org.apache.hadoop.io.Text;! /**! * returns hotel_id's position given a hotel_id and impression list! */! public final class GetPos extends UDF {! public Text evaluate(final Text hotel_id, final Text impressions) {! if (hotel_id == null || impressions == null)! return null;! String[] hotels = impressions.toString().split(";");! String position;! String id = hotel_id.toString();! int begin=0, end=0;! for (int i=0; i<hotels.length; i++) {! begin = hotels[i].indexOf(",");! end = hotels[i].lastIndexOf(",");! position = hotels[i].substring(begin+1,end);! if (id.equals(hotels[i].substring(0,begin)))! return new Text(position);! }! return null;! }! }! page 23
  • 24. Hive – User Defined Functions hive> add jar path-to-jar/pos.jar; ! hive> create temporary function getpos as 'com.orbitz.hive.GetPos';! hive> select getpos(‘1’, ‘1,3,100.00;2,1,100.00’);! …! hive> 3 ! page 24
  • 25. Hive MapReduce Allows analysis not possible through standard HiveQL queries. Can be implemented in any language. page 25
  • 26. Hive MapReduce •  #!/usr/bin/python
 import sys
 for line in sys.stdin:
         line = line.replace(';', '|')
         impressions = line.split('|')
         for impression in impressions:
                 fields = "".join(impression).split(',')
                 print "%st%s" % (fields[0], fields[1])
 hive> ADD FILE /home/jseidman/parse_impressions.py;
 hive> FROM
     >   hotel_searches         
     > SELECT
     >   TRANSFORM(impressions)               
     > USING
     >   'parse_impressions.py'                
     > AS
     >   hotel, pos;
 page 26
  • 27. Processing Web Analytics Logs Hive provides the infrastructure to support analysis of web analytics logs stored in Hadoop Used to support analysis for machine learning tasks, cache optimization, keyword performance, etc. page 27
  • 28. Processing Flow – Step 1 page 28
  • 29. Processing Flow – Step 2 page 29
  • 30. Processing Flow – Step 3 page 30
  • 31. Processing Flow – Step 4 page 31
  • 32. Processing Flow – Step 5 page 32
  • 33. Processing Flow – Step 6 page 33
  • 34. Importing Prepared Data to Hive $HIVE_HOME/bin/hive -e "LOAD DATA INPATH ! ’/output/part-00000' OVERWRITE INTO! TABLE hotel_searches PARTITION(dt='$YEAR-$MONTH-$DAY')"! CREATE TABLE hotel_searches( ! session_id STRING, host STRING, visitors_ip STRING, search_date STRING, search_time STRING, dept_date STRING, ret_date STRING, destination STRING, location_id STRING, number_of_guests INT, number_of_rooms INT, ! impressions STRING)! PARTITIONED BY (dt STRING)! ROW FORMAT DELIMITED! FIELDS TERMINATED BY 't’! STORED AS TEXTFILE;! page 34
  • 35. Exporting Data from Hive Tables hive> INSERT OVERWRITE LOCAL DIRECTORY ! > '/tmp/searches.dat' ! > SELECT * FROM hotel_searches; ! page 35
  • 36. Analyzing Prepared Data Example - Find the Position of Each Booked Hotel in Search Results: CREATE TABLE positions(! session_id STRING,! booked_hotel_id STRING,! position INT);! INSERT OVERWRITE TABLE
 positions! SELECT
 h.session_id, h.booked_hotel_id, i.position! FROM
 hotel_impressions i JOIN hotel_bookings h! ON
 (h.booked_hotel_id = i.hotel_id and h.session_id = i.session_id);! page 36
  • 37. Analyzing Prepared Data Example - Aggregate Booking Position by Location by Day: CREATE TABLE position_aggregate_by_day(! location_id STRING,!   booking_date STRING,!   position INT,!   pcount INT);! INSERT OVERWRITE TABLE! position_aggregate_by_day! SELECT! h.location_id, h.booking_date, i.position, count(1)! FROM! hotel_bookings h JOIN hotel_impressions i! ON! (i.hotel_id = h.booked_hotel_id and i.session_id = h.session_id)! GROUP BY! h.location_id, h.booking_date, i.position! page 37
  • 38. Hive vs. Pig Both are declarative languages, but Hive is SQL-like, Pig is a scripting language. Explicit schema vs. implicit schema. Hive metadata can be accessed by external tools. page 38
  • 39. Hive vs. HBase HBase is a column-based key value store as opposed to an SQL model. HBase offers lower latency and random access to data. Hive/HBase integration was recently released, allowing Hive queries to be executed over HBase tables. page 39
  • 40. Hive – Lessons Learned Job scheduling – Default Hadoop scheduling is FIFO. Consider using something like the fair scheduler. Multi-user Hive – Default install is single user. Multi-user installs require an external relational store. set mapred.reduce.tasks is your friend. Migrating Hive between clusters is not fun. Documentation is still a little sparse. page 40
  • 41. References •  Hadoop project: http://hadoop.apache.org/ •  Hive project: http://hadoop.apache.org/hive/ •  Hive – A Petabyte Scale Data Warehouse Using Hadoop: http://i.stanford.edu/~ragho/hive-icde2010.pdf •  Hadoop The Definitive Guide, Second Edition, Tom White, O’Reilly Press, 2011 •  Hive Evolution, John Sichi, November 2010: http:// www.slideshare.net/jsichi/hive-evolution-apachecon-2010 page 41