Mais conteúdo relacionado
Semelhante a Introduction to pig (20)
Introduction to pig
- 1. Apache Pig – Introduction and
Hands-on
Ravi Mutyala
Systems Architect, Hortonworks
Twitter: @rmutyala
© Hortonworks Inc. 2012
- 3. Topics
• What is Pig?
• Why Pig ?
• Language Features
• Labs
• 0.10.0 Features
• Features in the pipeline
•Q &A
Page 3
© Hortonworks Inc. 2012
- 4. What is Pig?
• System for processing large unstructured Data
• Uses HDFS and MapReduce
• Data flow Language
• Directional Asymptotic Graph
• Started at Yahoo! Research
• Joined Apache incubator in 2007
• Graduated to Subproject of Hadoop in 2008
• Top level project in Apache since 2010
Page 4
© Hortonworks Inc. 2012
- 5. Pig Philosophy
• Pigs eat anything
• Pigs live anywhere
• Pigs are domesticated animals
• Pigs can fly
Page 5
© Hortonworks Inc. 2012
- 6. Components
• Pig Engine – Parser, Optimizer and distributed query
execution
• Grunt – CLI shell
• Pig Latin – Procedural Language
Page 6
© Hortonworks Inc. 2012
- 7. Why Pig ?
• High level language that increases programmer
productivity.
• Designed for Parallel Data flow.
• Reduces complexity by abstracting low level Map and
Reduce jobs and Map Reduce job chaining
• Can be run on a client/gateway machine with no
configuration on the cluster
• Multiple versions of Pig can co-exist as long as they
are compatible with Hadoop version.
Page 7
© Hortonworks Inc. 2012
- 8. Running Pig
Pig Latin script executes in 3 modes
• MapReduce: Code executes as MapReduce on a
Hadoop Cluster
$ pig myscript.pig
• Local: Code executes locally in a single JVM using
local data
$ pig –x local myscript.pig
• Interactive: pig with no script starts the grunt shell
where commands can be run interactively
Page 8
© Hortonworks Inc. 2012
- 9. GRUNT shell
• fs -ls
• fs -cat filename
• fs -copyFromLocal localfile hdfsfile
Page 9
© Hortonworks Inc. 2012
- 10. Data Types
• Scalar Types
– int, long, float, double, chararray, bytearray, boolean, datetime
• Complex Types
– Map. Collection of key value pairs
– [name#alan, age#30]
– Tuple. Ordered set of values
– (alan,40,engineering)
– Bags. Unordered collection of tuples
– {(alan,40,engineering),(bob,45,sales)}
Page 10
© Hortonworks Inc. 2012
- 11. • Relations and a set of operations that work on
relations
• Schema for relations is optional
• $0… $n can be used for fields in relations
• null means the data in undefined.
• Any missing or invalid fields are loaded as null
Page 11
© Hortonworks Inc. 2012
- 12. Input and Output
• A = LOAD ‘file’ USING PigStorage(‘,’) AS
(data1:datatype1, data2:datatype2.. )
• STORE A INTO ‘file2’ using PigStorage(‘,’)
• DUMP A
• DESCRIBE A
Page 12
© Hortonworks Inc. 2012
- 13. Relational Operations
• GROUP A BY A.age;
• FOREACH B GENERATE A.$1 – A.$3;
• FILTER A BY A.$1 > 10;
• ORDER A BY A.$1 DESC, A.$2;
• JOIN A BY A.$1, B BY B.$5;
• JOIN A BY (A.$1, A.$5) LEFT OUTER, B BY (B.$2,
B.$3);
Page 13
© Hortonworks Inc. 2012
- 14. • LIMIT A 10;
• SAMPLE A 0.1;
• GROUP A BY A.$1 PARALLEL 10;
• User Definited Functions AND piggybank
register 'your_path_to_piggybank/piggybank.jar';
divs = load 'NYSE_dividends’;
backwards = foreach divs generate
org.apache.pig.piggybank.evaluation.string.Reverse($1);
Page 14
© Hortonworks Inc. 2012
- 15. • Invoking static java methods
• FLATTEN
• TOKENIZE
Page 15
© Hortonworks Inc. 2012
- 16. 0.10.0 Features
• Ruby UDFs
• PigStorage with schemas
• Additional UDF improvements
• Language Improvements
– Boolean type
– otherwise
– Maps, Bags and Tuples can be generated without UDFs
– Register collection of jars
• Performance Improvements
Page 16
© Hortonworks Inc. 2012
- 17. Current work in progress
• DataTime datatype
• CUBE, ROLLUP and RANK operators
• Native support for windows
• Lower memory footprint
Page 17
© Hortonworks Inc. 2012
- 18. References
• Labs are from
– https://github.com/alanfgates/programmingpig
– https://github.com/michiard/CLOUDS-LAB
• 0.10.0 Features and current WIP
– http://www.slideshare.net/hortonworks/pig-out-to-hadoop by Alan
Gates
Page 18
© Hortonworks Inc. 2012
- 19. Hortonworks Training
The expert source for
Apache Hadoop training & certification
Role-based Developer and Administration training
– Coursework built and maintained by the core Apache Hadoop development team.
– The “right” course, with the most extensive and realistic hands-on materials
– Provide an immersive experience into real-world Hadoop scenarios
– Public and Private courses available
Comprehensive Apache Hadoop Certification
– Become a trusted and valuable
Apache Hadoop expert
Page 19
© Hortonworks Inc. 2012
- 20. Thank You!
Questions & Answers
Ravi Mutyala
Systems Architect
Hortonworks
Twitter: @rmutyala
www.hortonworks.com
Page 20
© Hortonworks Inc. 2012