SlideShare uma empresa Scribd logo
1 de 48
By,
Anuja Gunale-Kasle
CONTENTS:
PIG background
PIG Architecture
PIG Latin Basics
PIG Execution Modes
PIG Processing: loading and transforming data
PIG Built-in functions
Filtering, grouping, sorting
Data Installation of PIG and PIG Latin commands
Apache Pig was
originally developed
at Yahoo Research around
2006 for researchers to have
an ad hoc way of creating
and executing MapReduce
jobs on very large data sets.
In 2007, it was moved into
the Apache Software
Foundation.
 The story goes that the researchers
working on the project initially
referred to it simply as 'the
language'. Eventually they needed
to call it something.
 Off the top of his head, one
researcher suggested Pig, and the
name stuck.
 It is quirky yet memorable and easy
to spell.
 While some have hinted that the
name sounds coy or silly, it has
provided us with an entertaining
nomenclature, such as Pig Latin for
the language, Grunt for the shell,
and PiggyBank for the CPAN-like
shared repository.
 Pig tutorial provides basic and advanced concepts of Pig. Our Pig tutorial
is designed for beginners and professionals.
 Pig is a high-level data flow platform for executing Map Reduce programs
of Hadoop. It was developed by Yahoo. The language for Pig is pig Latin.
 Pig provides an engine for executing data flows in parallel on Hadoop. It
includes a language, Pig Latin, for expressing these data flows.
 Pig Latin includes operators for many of the traditional data operations
(join, sort, filter, etc.), as well as the ability for users to develop their own
functions for reading, processing, and writing data.
 Pig is an Apache open source project. This means users are free to
download it as source or binary, use it for themselves, contribute to it,
and—under the terms of the Apache License—use it in their products and
change it as they see fit.
Apache Pig is a high-level data flow platform for executing
MapReduce programs of Hadoop. The language used for Pig
is Pig Latin.
The Pig scripts get internally converted to Map Reduce jobs
and get executed on data stored in HDFS. Apart from that,
Pig can also execute its job in Apache Tez or Apache Spark.
Pig can handle any type of data, i.e., structured, semi-
structured or unstructured and stores the corresponding
results into Hadoop Data File System. Every task which can
be achieved using PIG can also be achieved using java used
in MapReduce.
FEATURES OF APACHE PIG
Join Datasets
Sort Datasets
Filter
Data Types
Group By
User Defined Functions
ADVANTAGES OF APACHE PIG
•Less code - The Pig consumes less line of
code to perform any operation.
•Reusability - The Pig code is flexible
enough to reuse again.
•Nested data types - The Pig provides a
useful concept of nested data types like
tuple, bag, and map.
HIVE VS PIG VS SQL – WHEN TO USE WHAT?
When to Use Hive
 Facebook widely uses Apache Hive for the analytical purposes. Furthermore, they usually
promote Hive language due to its extensive feature list and similarities with SQL. Here are
some of the scenarios when Apache Hive is ideal to use:
• To query large datasets: Apache Hive is specially used for analytics purposes on
huge datasets. It is an easy way to approach and quickly carry out complex querying on
datasets and inspect the datasets stored in the Hadoop ecosystem.
• For extensibility: Apache Hive contains a range of user APIs that help in building
the custom behaviour for the query engine.
• For someone familiar with SQL concepts: If you are familiar with SQL, Hive
will be very easy to use as you will see many similarities between the two. Hive uses the
clauses like select, where, order by, group by, etc. similar to SQL.
• To work on Structured Data: In case of structured data, Hive is widely adopted
everywhere.
• To analyse historical data: Apache Hive is a great tool for analysis and querying
of the data which is historical and collected over a period.
When to Use Pig
 Apache Pig, developed by Yahoo Research in the year 2006 is famous for its extensibility and
optimization scope. This language uses a multi-query approach that reduces the time in data
scanning. It usually runs on a client side of clusters of Hadoop. It is also quite easy to use when you
are familiar with the SQL ecosystem.You can use Apache Pig for the following special scenarios:
• To use as an ETL tool: Apache Pig is an excellent ETL (Extract-Transform-Load) tool for big
data. It is a data flow system that uses Pig Latin, a simple language for data queries and
manipulation.
• As a programmer with the scripting knowledge: The programmers with the
scripting knowledge can learn how to use Apache Pig very easily and efficiently.
• For fast processing: Apache Pig is faster than Hive because it uses a multi-query approach.
Apache Pig is famous worldwide for its speed.
• When you don’t want to work with Schema: In case of Apache Pig, there is no need
for creating a schema for the data loading related work.
• For SQL like functions: It has many functions related to SQL along with the cogroup
function.
When to Use SQL
 SQL is a general purpose database management language used around the globe. It has
been updating itself as per the user expectations for decades. It is declarative and hence
focuses explicitly on ‘what’ is needed.
 It is popularly used for the transactional as well as analytical queries. When the
requirements are not too demanding, SQL works as an excellent tool. Here are few
scenarios –
• For better performance: SQL is famous for its ability to pull data quickly and
frequently. It supports OLAP (Online Analytical Processing) applications and performs
better for these applications. Hive is slow in case of online transactional needs.
• When the datasets are small: SQL works well with small datasets and
performs much better for smaller amounts of data. It also has many ways for the
optimisation of data.
• For frequent data manipulation: If your requirement needs frequent
modification in records or you need to update a large number of records frequently, SQL
can perform these activities well. SQL also provides an entirely interactive experience to
the user.
APACHE PIG RUN MODES
 Apache Pig executes in two modes: Local Mode and MapReduce Mode.
Local Mode
• It executes in a single JVM and is used for development experimenting and
prototyping.
• Here, files are installed and run using localhost.
• The local mode works on a local file system. The input and output data stored
in the local file system.
 The command for local mode grunt shell:
1.$ pig-x local
MapReduce Mode
• The MapReduce mode is also known as Hadoop Mode.
• It is the default mode.
• In this Pig renders Pig Latin into MapReduce jobs and executes them on the
cluster.
• It can be executed against semi-distributed or fully distributed Hadoop
installation.
• Here, the input and output data are present on HDFS.
 The command for Map reduce mode:
WAYS TO EXECUTE PIG PROGRAM
These are the following ways of executing a Pig program on
local and MapReduce mode: -
• Interactive Mode - In this mode, the Pig is executed in the
Grunt shell. To invoke Grunt shell, run the pig command.
Once the Grunt mode executes, we can provide Pig Latin
statements and command interactively at the command line.
• Batch Mode - In this mode, we can run a script file having a
.pig extension. These files contain Pig Latin commands.
• Embedded Mode - In this mode, we can define our own
functions. These functions can be called as UDF (User
Defined Functions). Here, we use programming languages
like Java and Python.
PIG
ARCHITE
CTURE
The language used to analyse data in Hadoop using Pig is known
as Pig Latin.
It is a high-level data processing language which provides a rich
set of data types and operators to perform various operations on
the data.
To perform a particular task Programmers using Pig,
programmers need to write a Pig script using the Pig Latin
language, and execute them using any of the execution
mechanisms (Grunt Shell, UDFs, Embedded).
After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce the
desired output.
Internally, Apache Pig converts these scripts into a series of
MapReduce jobs, and thus, it makes the programmer’s job easy.
The architecture of Apache Pig is shown below.
APACHE PIG COMPONENTS
As shown in the figure, there are various components in the Apache
Pig framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax
of the script, does type checking, and other miscellaneous checks.
The output of the parser will be a DAG (directed acyclic graph), which
represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which
carries out the logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical
plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to
Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on
Hadoop producing the desired results.
PIG LATIN DATA MODEL
The data model of Pig Latin is
fully nested and it allows
complex non-atomic datatypes
such as map and tuple.
Given below is the
diagrammatical representation
of Pig Latin’s data model.
An atomic value is one that is indivisible within the
context of a database field definition (e.g. integer,
real, code of some sort etc.)
Field values that are not atomic are of two
undesirable types (Elmasri & Navathe 1989
p.139,41):
Undesirable - non atomic field types: Composite.
Multivalued.
Atom
Any single value in Pig Latin, irrespective of their data,
type is known as an Atom. It is stored as string and
can be used as string and number. int, long, float,
double, chararray, and bytearray are the atomic values
of Pig. A piece of data or a simple atomic value is
known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is
known as a tuple, the fields can be of any type. A tuple
is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples.
In other words, a collection of tuples (non-unique) is known as a
bag.
Each tuple can have any number of fields (flexible schema).
A bag is represented by ‘{}’.
 It is similar to a table in RDBMS, but unlike a table in RDBMS, it
is not necessary that every tuple contain the same number of
fields or that the fields in the same position (column) have the
same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known
as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs.
The key needs to be of type chararray and should be
unique. The value might be of any type. It is represented
by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin
are unordered (there is no guarantee that tuples are
processed in any particular order).
Grunt shell is a shell command.
The Grunts shell of Apace pig is mainly used to write
pig Latin scripts.
Pig script can be executed with grunt shell which is
native shell provided by Apache pig to execute pig
queries.
We can invoke shell commands using sh and fs.
JOB EXECUTION FLOW
The developer creates the scripts, and then it goes to
the local file system as functions.
Moreover, when the developers submit Pig Script, it
contacts with Pig Latin Compiler.
The compiler then splits the task and run a series of MR
jobs.
Meanwhile, Pig Compiler retrieves data from the HDFS.
The output file again goes to the HDFS after running MR
jobs.
a. Pig Execution Modes
We can run Pig in two execution modes.
These modes depend upon where the Pig script is going
to run. It also depends on where the data is residing.
We can thus store data on a single machine or in a
distributed environment like Clusters.
The three different modes to run Pig programs are:
Non-interactive shell or script mode- The user has to
create a file, load the code and execute the script.
 Then comes the Grunt shell or interactive shell for
running Apache Pig commands.
Hence, the last one named as embedded mode, which
we can use JDBC to run SQL programs from Java.
b. Pig Local mode
However, in this mode, pig implements on single
JVM and access the file system.
This mode is better for dealing with the small data
sets.
Meanwhile, the parallel mapper execution is
impossible.
The older version of the Hadoop is not thread-safe.
While the user can provide –x local to get into Pig
local mode of execution.
Therefore, Pig always looks for the local file system
path while loading data.
c. Pig Map Reduce Mode
In this mode, a user could have proper Hadoop
cluster setup and installations on it. By default,
Apache Pig installs as in MR mode. The Pig
also translates the queries into Map reduce
jobs and runs on top of Hadoop cluster. Hence,
this mode as a Map reduce runs on a
distributed cluster.
The statements like LOAD, STORE read the
data from the HDFS file system and to show
output. These Statements are also used to
d. Storing Results
The intermediate data generates during the
processing of MR jobs.
Pig stores this data in a non-permanent location
on HDFS storage.
The temporary location then created inside
HDFS for storing this intermediate data.
We can use DUMP for getting the final results
to the output screen.
The output results stored using STORE
operator.
Type Description
int 4-byte integer
long 8-byte integer
float 4-byte (single precision) floating point
double 8-byte (double precision) floating point
bytearray Array of bytes; blob
chararray String (“hello world”)
boolean True/False (case insensitive)
datetime A date and time
biginteger Java BigInteger
bigdecimal Java BigDecimal
Type Description
Tuple Ordered set of fields (a “row / record”)
Bag Collection of tuples (a “resultset / table”)
Map A set of key-value pairs
Keys must be of type chararray
BinStorage
Loads and stores data in machine-readable (binary) format
PigStorage
Loads and stores data as structured, field delimited text files
TextLoader
Loads unstructured data in UTF-8 format
PigDump
Stores data in UTF-8 format
YourOwnFormat!
via UDFs
Loads data from an HDFS file
var = LOAD 'employees.txt';
var = LOAD 'employees.txt' AS (id, name, salary);
var = LOAD 'employees.txt' using PigStorage()
AS (id, name, salary);
Each LOAD statement defines a new bag
Each bag can have multiple elements (atoms)
Each element can be referenced by name or position ($n)
A bag is immutable
A bag can be aliased and referenced later
STORE
Writes output to an HDFS file in a specified directory
grunt> STORE processed INTO 'processed_txt';
 Fails if directory exists
 Writes output files, part-[m|r]-xxxxx, to the directory
PigStorage can be used to specify a field delimiter
DUMP
Write output to screen
grunt> DUMP processed;
FOREACH
Applies expressions to every record in a bag
FILTER
Filters by expression
GROUP
Collect records with the same key
ORDER BY
Sorting
DISTINCT
Removes duplicates
Use the FILTER operator to restrict tuples or rows of data
Basic syntax:
alias2 = FILTER alias1 BY expression;
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FILTER alias1 BY (col1 == 8) OR (NOT
(col2+col3 > col1));
DUMP alias2;
(4,2,1) (8,3,4) (7,2,5) (8,4,3)
 Use the GROUP…ALL operator to group data
 Use GROUP when only one relation is involved
 Use COGROUP with multiple relations are involved
 Basic syntax:
alias2 = GROUP alias1 ALL;
 Example:
DUMP alias1;
(John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F)
(Joe,18,3.8F)
alias2 = GROUP alias1 BY col2;
DUMP alias2;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Use the ORDER…BY operator to sort a relation based on one
or more fields
Basic syntax:
alias = ORDER alias BY field_alias [ASC|DESC];
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = ORDER alias1 BY col3 DESC;
DUMP alias2;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
Use the DISTINCT operator to remove duplicate tuples in
a relation.
Basic syntax:
alias2 = DISTINCT alias1;
Example:
DUMP alias1;
(8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3)
alias2= DISTINCT alias1;
DUMP alias2;
(8,3,4) (1,2,3) (4,3,3)
FLATTEN
Used to un-nest tuples as well as bags
INNER JOIN
Used to perform an inner join of two or more relations based on
common field values
OUTER JOIN
Used to perform left, right or full outer joins
SPLIT
Used to partition the contents of a relation into two or more
relations
SAMPLE
Used to select a random data sample with the stated sample
size
Use the JOIN operator to perform an inner, equi-
join join of two or more relations based on common
field values
The JOIN operator always performs an inner join
Inner joins ignore null keys
Filter null keys before the join
JOIN and COGROUP operators perform similar
functions
 JOIN creates a flat set of output records
COGROUP creates a nested set of output records
DUMP Alias1;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
DUMP Alias2;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
Join Alias1 by Col1 to
Alias2 by Col1
Alias3 = JOIN Alias1 BY
Col1, Alias2 BY Col1;
Dump Alias3;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
Use the OUTER JOIN operator to perform left, right, or full
outer joins
 Pig Latin syntax closely adheres to the SQL standard
The keyword OUTER is optional
 keywords LEFT, RIGHT and FULL will imply left outer, right outer
and full outer joins respectively
Outer joins will only work provided the relations which need
to produce nulls (in the case of non-matching keys) have
schemas
Outer joins will only work for two-way joins
 To perform a multi-way outer join perform multiple two-way outer
join statements
Natively written in Java, packaged as a jar file
Other languages include Jython, JavaScript, Ruby,
Groovy, and Python
Register the jar with the REGISTER statement
Optionally, alias it with the DEFINE statement
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
DEFINE can be used to work with UDFs and also
streaming commands
Useful when dealing with complex input/output formats
/* read and write comma-delimited data */
DEFINE Y 'stream.pl' INPUT(stdin USING
PigStreaming(','))
OUTPUT(stdout USING PigStreaming(','));
A = STREAM X THROUGH Y;
/* Define UDFs to a more readable format */
DEFINE MAXNUM
org.apache.pig.piggybank.evaluation.math.MAX;
A = LOAD ‘student_data’ AS (name:chararray, gpa1:float,
gpa2:double);
B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2);
DUMP B;
THANK YOU…

Mais conteúdo relacionado

Mais procurados

Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewSivashankar Ganapathy
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentationpuneet yadav
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistenceVenkat Datla
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
mnesia脑裂问题综述
mnesia脑裂问题综述mnesia脑裂问题综述
mnesia脑裂问题综述Feng Yu
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 

Mais procurados (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Hadoop
HadoopHadoop
Hadoop
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
mnesia脑裂问题综述
mnesia脑裂问题综述mnesia脑裂问题综述
mnesia脑裂问题综述
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Big data
Big dataBig data
Big data
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 

Semelhante a Apache PIG

Introduction to pig.
Introduction to pig.Introduction to pig.
Introduction to pig.Triloki Gupta
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsKrishnaVeni451953
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdfssuser92282c
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramViswanath Gangavaram
 
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.pptlecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.pptYashJadhav496388
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxmrudulasb
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 

Semelhante a Apache PIG (20)

Apache pig
Apache pigApache pig
Apache pig
 
Introduction to pig.
Introduction to pig.Introduction to pig.
Introduction to pig.
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Pig
PigPig
Pig
 
What is apache_pig
What is apache_pigWhat is apache_pig
What is apache_pig
 
What is apache_pig
What is apache_pigWhat is apache_pig
What is apache_pig
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
What is apache pig
What is apache pigWhat is apache pig
What is apache pig
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.pptlecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
 
Apache pig
Apache pigApache pig
Apache pig
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptx
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 

Último

Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsapna80328
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptxmohitesoham12
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfNainaShrivastava14
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 

Último (20)

Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveying
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptx
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 

Apache PIG

  • 2. CONTENTS: PIG background PIG Architecture PIG Latin Basics PIG Execution Modes PIG Processing: loading and transforming data PIG Built-in functions Filtering, grouping, sorting Data Installation of PIG and PIG Latin commands
  • 3. Apache Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad hoc way of creating and executing MapReduce jobs on very large data sets. In 2007, it was moved into the Apache Software Foundation.
  • 4.  The story goes that the researchers working on the project initially referred to it simply as 'the language'. Eventually they needed to call it something.  Off the top of his head, one researcher suggested Pig, and the name stuck.  It is quirky yet memorable and easy to spell.  While some have hinted that the name sounds coy or silly, it has provided us with an entertaining nomenclature, such as Pig Latin for the language, Grunt for the shell, and PiggyBank for the CPAN-like shared repository.
  • 5.  Pig tutorial provides basic and advanced concepts of Pig. Our Pig tutorial is designed for beginners and professionals.  Pig is a high-level data flow platform for executing Map Reduce programs of Hadoop. It was developed by Yahoo. The language for Pig is pig Latin.  Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows.  Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data.  Pig is an Apache open source project. This means users are free to download it as source or binary, use it for themselves, contribute to it, and—under the terms of the Apache License—use it in their products and change it as they see fit.
  • 6. Apache Pig is a high-level data flow platform for executing MapReduce programs of Hadoop. The language used for Pig is Pig Latin. The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored in HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache Spark. Pig can handle any type of data, i.e., structured, semi- structured or unstructured and stores the corresponding results into Hadoop Data File System. Every task which can be achieved using PIG can also be achieved using java used in MapReduce.
  • 7. FEATURES OF APACHE PIG Join Datasets Sort Datasets Filter Data Types Group By User Defined Functions
  • 8. ADVANTAGES OF APACHE PIG •Less code - The Pig consumes less line of code to perform any operation. •Reusability - The Pig code is flexible enough to reuse again. •Nested data types - The Pig provides a useful concept of nested data types like tuple, bag, and map.
  • 9.
  • 10.
  • 11. HIVE VS PIG VS SQL – WHEN TO USE WHAT? When to Use Hive  Facebook widely uses Apache Hive for the analytical purposes. Furthermore, they usually promote Hive language due to its extensive feature list and similarities with SQL. Here are some of the scenarios when Apache Hive is ideal to use: • To query large datasets: Apache Hive is specially used for analytics purposes on huge datasets. It is an easy way to approach and quickly carry out complex querying on datasets and inspect the datasets stored in the Hadoop ecosystem. • For extensibility: Apache Hive contains a range of user APIs that help in building the custom behaviour for the query engine. • For someone familiar with SQL concepts: If you are familiar with SQL, Hive will be very easy to use as you will see many similarities between the two. Hive uses the clauses like select, where, order by, group by, etc. similar to SQL. • To work on Structured Data: In case of structured data, Hive is widely adopted everywhere. • To analyse historical data: Apache Hive is a great tool for analysis and querying of the data which is historical and collected over a period.
  • 12. When to Use Pig  Apache Pig, developed by Yahoo Research in the year 2006 is famous for its extensibility and optimization scope. This language uses a multi-query approach that reduces the time in data scanning. It usually runs on a client side of clusters of Hadoop. It is also quite easy to use when you are familiar with the SQL ecosystem.You can use Apache Pig for the following special scenarios: • To use as an ETL tool: Apache Pig is an excellent ETL (Extract-Transform-Load) tool for big data. It is a data flow system that uses Pig Latin, a simple language for data queries and manipulation. • As a programmer with the scripting knowledge: The programmers with the scripting knowledge can learn how to use Apache Pig very easily and efficiently. • For fast processing: Apache Pig is faster than Hive because it uses a multi-query approach. Apache Pig is famous worldwide for its speed. • When you don’t want to work with Schema: In case of Apache Pig, there is no need for creating a schema for the data loading related work. • For SQL like functions: It has many functions related to SQL along with the cogroup function.
  • 13. When to Use SQL  SQL is a general purpose database management language used around the globe. It has been updating itself as per the user expectations for decades. It is declarative and hence focuses explicitly on ‘what’ is needed.  It is popularly used for the transactional as well as analytical queries. When the requirements are not too demanding, SQL works as an excellent tool. Here are few scenarios – • For better performance: SQL is famous for its ability to pull data quickly and frequently. It supports OLAP (Online Analytical Processing) applications and performs better for these applications. Hive is slow in case of online transactional needs. • When the datasets are small: SQL works well with small datasets and performs much better for smaller amounts of data. It also has many ways for the optimisation of data. • For frequent data manipulation: If your requirement needs frequent modification in records or you need to update a large number of records frequently, SQL can perform these activities well. SQL also provides an entirely interactive experience to the user.
  • 14.
  • 15. APACHE PIG RUN MODES  Apache Pig executes in two modes: Local Mode and MapReduce Mode. Local Mode • It executes in a single JVM and is used for development experimenting and prototyping. • Here, files are installed and run using localhost. • The local mode works on a local file system. The input and output data stored in the local file system.  The command for local mode grunt shell: 1.$ pig-x local MapReduce Mode • The MapReduce mode is also known as Hadoop Mode. • It is the default mode. • In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster. • It can be executed against semi-distributed or fully distributed Hadoop installation. • Here, the input and output data are present on HDFS.  The command for Map reduce mode:
  • 16. WAYS TO EXECUTE PIG PROGRAM These are the following ways of executing a Pig program on local and MapReduce mode: - • Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To invoke Grunt shell, run the pig command. Once the Grunt mode executes, we can provide Pig Latin statements and command interactively at the command line. • Batch Mode - In this mode, we can run a script file having a .pig extension. These files contain Pig Latin commands. • Embedded Mode - In this mode, we can define our own functions. These functions can be called as UDF (User Defined Functions). Here, we use programming languages like Java and Python.
  • 18. The language used to analyse data in Hadoop using Pig is known as Pig Latin. It is a high-level data processing language which provides a rich set of data types and operators to perform various operations on the data. To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output. Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy. The architecture of Apache Pig is shown below.
  • 19. APACHE PIG COMPONENTS As shown in the figure, there are various components in the Apache Pig framework. Let us take a look at the major components. Parser Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges. Optimizer The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.
  • 20. Compiler The compiler compiles the optimized logical plan into a series of MapReduce jobs. Execution engine Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.
  • 21. PIG LATIN DATA MODEL The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.
  • 22. An atomic value is one that is indivisible within the context of a database field definition (e.g. integer, real, code of some sort etc.) Field values that are not atomic are of two undesirable types (Elmasri & Navathe 1989 p.139,41): Undesirable - non atomic field types: Composite. Multivalued.
  • 23. Atom Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field. Example − ‘raja’ or ‘30’ Tuple A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS. Example − (Raja, 30)
  • 24. Bag A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’.  It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type. Example − {(Raja, 30), (Mohammad, 45)} A bag can be a field in a relation; in that context, it is known as inner bag. Example − {Raja, 30, {9848022338, raja@gmail.com,}}
  • 25. Map A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’ Example − [name#Raja, age#30] Relation A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order).
  • 26. Grunt shell is a shell command. The Grunts shell of Apace pig is mainly used to write pig Latin scripts. Pig script can be executed with grunt shell which is native shell provided by Apache pig to execute pig queries. We can invoke shell commands using sh and fs.
  • 27. JOB EXECUTION FLOW The developer creates the scripts, and then it goes to the local file system as functions. Moreover, when the developers submit Pig Script, it contacts with Pig Latin Compiler. The compiler then splits the task and run a series of MR jobs. Meanwhile, Pig Compiler retrieves data from the HDFS. The output file again goes to the HDFS after running MR jobs.
  • 28. a. Pig Execution Modes We can run Pig in two execution modes. These modes depend upon where the Pig script is going to run. It also depends on where the data is residing. We can thus store data on a single machine or in a distributed environment like Clusters. The three different modes to run Pig programs are: Non-interactive shell or script mode- The user has to create a file, load the code and execute the script.  Then comes the Grunt shell or interactive shell for running Apache Pig commands. Hence, the last one named as embedded mode, which we can use JDBC to run SQL programs from Java.
  • 29. b. Pig Local mode However, in this mode, pig implements on single JVM and access the file system. This mode is better for dealing with the small data sets. Meanwhile, the parallel mapper execution is impossible. The older version of the Hadoop is not thread-safe. While the user can provide –x local to get into Pig local mode of execution. Therefore, Pig always looks for the local file system path while loading data.
  • 30. c. Pig Map Reduce Mode In this mode, a user could have proper Hadoop cluster setup and installations on it. By default, Apache Pig installs as in MR mode. The Pig also translates the queries into Map reduce jobs and runs on top of Hadoop cluster. Hence, this mode as a Map reduce runs on a distributed cluster. The statements like LOAD, STORE read the data from the HDFS file system and to show output. These Statements are also used to
  • 31. d. Storing Results The intermediate data generates during the processing of MR jobs. Pig stores this data in a non-permanent location on HDFS storage. The temporary location then created inside HDFS for storing this intermediate data. We can use DUMP for getting the final results to the output screen. The output results stored using STORE operator.
  • 32. Type Description int 4-byte integer long 8-byte integer float 4-byte (single precision) floating point double 8-byte (double precision) floating point bytearray Array of bytes; blob chararray String (“hello world”) boolean True/False (case insensitive) datetime A date and time biginteger Java BigInteger bigdecimal Java BigDecimal
  • 33. Type Description Tuple Ordered set of fields (a “row / record”) Bag Collection of tuples (a “resultset / table”) Map A set of key-value pairs Keys must be of type chararray
  • 34. BinStorage Loads and stores data in machine-readable (binary) format PigStorage Loads and stores data as structured, field delimited text files TextLoader Loads unstructured data in UTF-8 format PigDump Stores data in UTF-8 format YourOwnFormat! via UDFs
  • 35. Loads data from an HDFS file var = LOAD 'employees.txt'; var = LOAD 'employees.txt' AS (id, name, salary); var = LOAD 'employees.txt' using PigStorage() AS (id, name, salary); Each LOAD statement defines a new bag Each bag can have multiple elements (atoms) Each element can be referenced by name or position ($n) A bag is immutable A bag can be aliased and referenced later
  • 36. STORE Writes output to an HDFS file in a specified directory grunt> STORE processed INTO 'processed_txt';  Fails if directory exists  Writes output files, part-[m|r]-xxxxx, to the directory PigStorage can be used to specify a field delimiter DUMP Write output to screen grunt> DUMP processed;
  • 37. FOREACH Applies expressions to every record in a bag FILTER Filters by expression GROUP Collect records with the same key ORDER BY Sorting DISTINCT Removes duplicates
  • 38. Use the FILTER operator to restrict tuples or rows of data Basic syntax: alias2 = FILTER alias1 BY expression; Example: DUMP alias1; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) alias2 = FILTER alias1 BY (col1 == 8) OR (NOT (col2+col3 > col1)); DUMP alias2; (4,2,1) (8,3,4) (7,2,5) (8,4,3)
  • 39.  Use the GROUP…ALL operator to group data  Use GROUP when only one relation is involved  Use COGROUP with multiple relations are involved  Basic syntax: alias2 = GROUP alias1 ALL;  Example: DUMP alias1; (John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F) (Joe,18,3.8F) alias2 = GROUP alias1 BY col2; DUMP alias2; (18,{(John,18,4.0F),(Joe,18,3.8F)}) (19,{(Mary,19,3.8F)}) (20,{(Bill,20,3.9F)})
  • 40. Use the ORDER…BY operator to sort a relation based on one or more fields Basic syntax: alias = ORDER alias BY field_alias [ASC|DESC]; Example: DUMP alias1; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) alias2 = ORDER alias1 BY col3 DESC; DUMP alias2; (7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
  • 41. Use the DISTINCT operator to remove duplicate tuples in a relation. Basic syntax: alias2 = DISTINCT alias1; Example: DUMP alias1; (8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3) alias2= DISTINCT alias1; DUMP alias2; (8,3,4) (1,2,3) (4,3,3)
  • 42. FLATTEN Used to un-nest tuples as well as bags INNER JOIN Used to perform an inner join of two or more relations based on common field values OUTER JOIN Used to perform left, right or full outer joins SPLIT Used to partition the contents of a relation into two or more relations SAMPLE Used to select a random data sample with the stated sample size
  • 43. Use the JOIN operator to perform an inner, equi- join join of two or more relations based on common field values The JOIN operator always performs an inner join Inner joins ignore null keys Filter null keys before the join JOIN and COGROUP operators perform similar functions  JOIN creates a flat set of output records COGROUP creates a nested set of output records
  • 44. DUMP Alias1; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) DUMP Alias2; (2,4) (8,9) (1,3) (2,7) (2,9) (4,6) (4,9) Join Alias1 by Col1 to Alias2 by Col1 Alias3 = JOIN Alias1 BY Col1, Alias2 BY Col1; Dump Alias3; (1,2,3,1,3) (4,2,1,4,6) (4,3,3,4,6) (4,2,1,4,9) (4,3,3,4,9) (8,3,4,8,9) (8,4,3,8,9)
  • 45. Use the OUTER JOIN operator to perform left, right, or full outer joins  Pig Latin syntax closely adheres to the SQL standard The keyword OUTER is optional  keywords LEFT, RIGHT and FULL will imply left outer, right outer and full outer joins respectively Outer joins will only work provided the relations which need to produce nulls (in the case of non-matching keys) have schemas Outer joins will only work for two-way joins  To perform a multi-way outer join perform multiple two-way outer join statements
  • 46. Natively written in Java, packaged as a jar file Other languages include Jython, JavaScript, Ruby, Groovy, and Python Register the jar with the REGISTER statement Optionally, alias it with the DEFINE statement REGISTER /src/myfunc.jar; A = LOAD 'students'; B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
  • 47. DEFINE can be used to work with UDFs and also streaming commands Useful when dealing with complex input/output formats /* read and write comma-delimited data */ DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(',')) OUTPUT(stdout USING PigStreaming(',')); A = STREAM X THROUGH Y; /* Define UDFs to a more readable format */ DEFINE MAXNUM org.apache.pig.piggybank.evaluation.math.MAX; A = LOAD ‘student_data’ AS (name:chararray, gpa1:float, gpa2:double); B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2); DUMP B;