4. What is Big Data?
Data Source Size
Gigabytes -
Wikipedia Database Dump 9GB Normal size for relational
databases
Open Street Map 19GB
Terabytes -
Relational databases may
Common Crawl 81TB start to experience scaling
issues
1000 Genomes 200TB
Petabytes -
Relational databases
Large Hadron Collider 15PB annually struggle to scale without a
lot of fine tuning
Tuesday, April 9, 13
5. Working With Data
Expectation Reality
• Different File Formats
• Missing Values
• Inconsistent Schema
• Loosely Structured
• Lots of it
Tuesday, April 9, 13
6. MapReduce
• Map - Emit key/
value pairs from
data
• Reduce - Collect
data with common
keys
• Tries to minimize
moving data
between nodes
Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview
Tuesday, April 9, 13
7. MapReduce Issues
• Very low-level abstraction
• Cumbersome Java API
• Unfamiliar to data analysts
• Rudimentary support for data pipelines
Tuesday, April 9, 13
8. Pig
• Eats anything
• SQL-like, procedural data flow language
• Extensible with Java, Jython, Groovy, Ruby
or JavaScript
• Provides opportunities to optimize
workflows
Tuesday, April 9, 13
13. Relational Operators
LIMIT GROUP FILTER CROSS
COGROUP JOIN STORE DISTINCT
FOREACH LOAD ORDER UNION
Tuesday, April 9, 13
14. Built In Functions
COS SIN AVG SUM
COUNT RANDOM LOWER UPPER
CONCAT MAX MIN TOKENIZE
Tuesday, April 9, 13
15. User Defined Functions
• Easy way to add arbitrary code to Pig
• Eval - Filter, aggregate, or evaluate
• Storage - Load/Store data
• Full support for Java and Jython
• Experimental support for Groovy, Ruby and
JavaScript
Tuesday, April 9, 13
27. Space Catalog
• 14,000+ objects in public catalog
• Use Two Line Element sets to propagate
out positions and velocities
• Can generate over 100 million positions &
velocities per day
Tuesday, April 9, 13
28. Two Line Elements
ISS (ZARYA)
1 25544U 98067A 08264.51782528 −.00002182 00000-0 -11606-4 0 2927
2 25544 51.6416 247.4627 0006703 130.5360 325.0288 15.72125391563537
• Use Python script to convert to Pig friendly TSV
• Create Python UDF to parse TLE into parameters
• Use Python UDF with Java libraries to propagate out
positions
Tuesday, April 9, 13
29. Python UDFs
• Easy way to extend Pig with new functions
• Uses Jython which is at Python 2.5
• Cannot take advantage of libraries with C
dependencies (e.g. numpy, scikits, etc...)
• Can use Java classes
Tuesday, April 9, 13
30. TLE parsing
BSTAR Drag
54-61 -11606-4
(Decimal Assumed)
def
parse_tle_number(tle_number_string):
split_string
=
tle_number_string.split('-‐')
if
len(split_string)
==
3:
new_number
=
'-‐'
+
str(split_string[1])
+
'e-‐'
+
str(int(split_string[2])+1)
elif
len(split_string)
==
2:
new_number
=
str(split_string[0])
+
'e-‐'
+
str(int(split_string[1])+1)
elif
len(split_string)
==
1:
new_number
=
'0.'
+
str(split_string[0])
else:
raise
TypeError('Input
is
not
in
the
TLE
float
format')
return
float(new_number)
Full parser at https://gist.github.com/shawnhermans/4569360
Tuesday, April 9, 13
46. Other Useful Tools
• Python-dateutil : Super-duper date parser
• Oozie : Hadoop workflow engine
• Piggybank and Elephant Bird : 3rd party pig
libraries
• Chardet: Character detection library for
Python
Tuesday, April 9, 13
47. Parting Thoughts
• Great ETL tool/language
• Flexible enough to write general purpose
MapReduce jobs
• Limited, but emerging 3rd party libraries
• Jython for UDFs is extremely limiting (Spark?)
Twitter: @shawnhermans
Email: shawnhermans@gmail.com
Tuesday, April 9, 13