Pig at LinkedIn by Chris Riccomini from LinkedIn
Pig is an integral part of data analytics at LinkedIn. Learn about LinkedIn’s analytic stack, and see how Pig is used to design, develop, and deliver data products at LinkedIn. We’ll explore a successful example of Pig deployment at LinkedIn, pain points, and integration with Azkaban, Voldemort, Hadoop, and the rest of LinkedIn’s ecosystem.
Chris Riccomini
Senior Data Scientist at LinkedIn
Involved in People you may know, Who’s viewed my profile, Avatara, and Distributed Computing at LinkedIn
Worked in PayPal’s anti-fraud team as a data visualization engineer
talking about linkedin’s analytic environment, motivation for pig at linkedin, how we integrated it, and pig in the future
aster, hadoop, voldemort, azkaban, pig
aster, hadoop, voldemort, azkaban, pig
40% of jobs run are pig
# of production products that use pig
pymk, ads, profile stats, jobs for you, talent match, groups you might like, browse maps, experimentation platform
in early 2009 we were working on converting PYMK from Aster to Hadoop
everything was java based
tired of writing joins, filters, etc (glue)
built and deployed pig on laptop while at a conference
wrote serializer in a few days
significantly sped up delivery time for PYMK
motivation was not ad-hoc/sql/business analytics.
motivation was product analytics, and PRODUCTION products.
stability was key.
reproducability was key.
simplicity/understandability was key. (both the scripts and the system itself)
"if it runs now, it will always run"
as streaming became more popular, pig is still used as glue, but complex jobs are now just python instead of java.
we use "voldemort" serialization (binary json) .. basically the same as avro
not much csv (pigstorage) used
some pain was involved in writing/updating the serializer (0.3 interface was insufficient)
we use "voldemort" serialization (binary json) .. basically the same as avro
not much csv (pigstorage) used
some pain was involved in writing/updating the serializer (0.3 interface was insufficient)
we use "voldemort" serialization (binary json) .. basically the same as avro
not much csv (pigstorage) used
some pain was involved in writing/updating the serializer (0.3 interface was insufficient)
we use pig to read from and write to voldemort
all writes are currently done with read-only stores
reads are done using roshan's voldemort loader func
can also use roshan's voldemort store func to write directly to read-write stores
one problem that we had with pig was how to handle folders partitioned by date (yyyy/mm/dd)
people were querying the root directory, and filtering out only the days they needed
other people were writing custom jobs that would add only the sub folders that they were interested in as input paths
our solution was to add a filter parameter to voldemort
views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1');
member_position = LOAD '/data/etl/replicated/member/member_position/#LATEST' USING VoldemortStorage();
one problem that we had with pig was how to handle folders partitioned by date (yyyy/mm/dd)
people were querying the root directory, and filtering out only the days they needed
other people were writing custom jobs that would add only the sub folders that they were interested in as input paths
our solution was to add a filter parameter to voldemort
views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1');
member_position = LOAD '/data/etl/replicated/member/member_position/#LATEST' USING VoldemortStorage();
one problem that we had with pig was how to handle folders partitioned by date (yyyy/mm/dd)
people were querying the root directory, and filtering out only the days they needed
other people were writing custom jobs that would add only the sub folders that they were interested in as input paths
our solution was to add a filter parameter to voldemort
views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1');
member_position = LOAD '/data/etl/replicated/member/member_position/#LATEST' USING VoldemortStorage();
one problem that we had with pig was how to handle folders partitioned by date (yyyy/mm/dd)
people were querying the root directory, and filtering out only the days they needed
other people were writing custom jobs that would add only the sub folders that they were interested in as input paths
our solution was to add a filter parameter to voldemort
views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1');
member_position = LOAD '/data/etl/replicated/member/member_position/#LATEST' USING VoldemortStorage();
one problem that we had with pig was how to handle folders partitioned by date (yyyy/mm/dd)
people were querying the root directory, and filtering out only the days they needed
other people were writing custom jobs that would add only the sub folders that they were interested in as input paths
our solution was to add a filter parameter to voldemort
views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1');
member_position = LOAD '/data/etl/replicated/member/member_position/#LATEST' USING VoldemortStorage();
one problem that we had with pig was how to handle folders partitioned by date (yyyy/mm/dd)
people were querying the root directory, and filtering out only the days they needed
other people were writing custom jobs that would add only the sub folders that they were interested in as input paths
our solution was to add a filter parameter to voldemort
views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1');
member_position = LOAD '/data/etl/replicated/member/member_position/#LATEST' USING VoldemortStorage();
we use azkaban (like a very simple version of oozie)
azkaban contains a "pig" job. specify type=pig, pig.script=path/to/pig/script.pig
supports parameter passing between azkaban properties and pig parameters
azkaban also provides resource locking
and dependencies
and scheduling
makes it very easy to write a production pig job
write pig file
write job file
throw pig and job file into a zip
upload the zip
we use azkaban (like a very simple version of oozie)
azkaban contains a "pig" job. specify type=pig, pig.script=path/to/pig/script.pig
supports parameter passing between azkaban properties and pig parameters
azkaban also provides resource locking
and dependencies
and scheduling
makes it very easy to write a production pig job
write pig file
write job file
throw pig and job file into a zip
upload the zip
we use azkaban (like a very simple version of oozie)
azkaban contains a "pig" job. specify type=pig, pig.script=path/to/pig/script.pig
supports parameter passing between azkaban properties and pig parameters
azkaban also provides resource locking
and dependencies
and scheduling
makes it very easy to write a production pig job
write pig file
write job file
throw pig and job file into a zip
upload the zip
we use azkaban (like a very simple version of oozie)
azkaban contains a "pig" job. specify type=pig, pig.script=path/to/pig/script.pig
supports parameter passing between azkaban properties and pig parameters
azkaban also provides resource locking
and dependencies
and scheduling
makes it very easy to write a production pig job
write pig file
write job file
throw pig and job file into a zip
upload the zip
we use azkaban (like a very simple version of oozie)
azkaban contains a "pig" job. specify type=pig, pig.script=path/to/pig/script.pig
supports parameter passing between azkaban properties and pig parameters
azkaban also provides resource locking
and dependencies
and scheduling
makes it very easy to write a production pig job
write pig file
write job file
throw pig and job file into a zip
upload the zip
just starting to use pig for ad hoc analysis
mostly engineers using it now
some business analytics are starting to use it
also looking at hive
pig 0.8, avro, hive, UDFs
dates
the promise of pig as a generic map reduce language (not just hadoop)
fix the data structures
more json
dates
the promise of pig as a generic map reduce language (not just hadoop)
fix the data structures
more json
dates
the promise of pig as a generic map reduce language (not just hadoop)
fix the data structures
more json
dates
the promise of pig as a generic map reduce language (not just hadoop)
fix the data structures
more json
dates
the promise of pig as a generic map reduce language (not just hadoop)
fix the data structures
more json