AWS Community Day CPH - Three problems of Terraform
Hadoop Jute Record Python
1. Hadoop Record Reader in Python HUG: Nov 18 2009 Paul Tarjan http://paulisageek.com @ptarjan http://github.com/ptarjan/hadoop_record
2. Hey Jute… Tabs and newlines are good and all For lots of data, don’t do that
3. don’t make it bad... Hadoop has a native data storage format called Hadoop Record or “Jute” org.apache.hadoop.record http://en.wikipedia.org/wiki/Jute
4. take a data structure… There is a Data Definition Language! module links { class Link { ustringURL; booleanisRelative; ustringanchorText; }; }
5. and make it better… And a compiler $ rcc -lc++ inclrec.jrtestrec.jr namespace inclrec { class RI : public hadoop::Record { private: int32_t I32; double D; std::string S;
6. remember, to only use C++/Java $rcc--help Usage: rcc --language [java|c++] ddl-files
7. then you can start to make it better… I wanted it in python Need 2 parts. Parsing library and DDL translator I only did the first part If you need second part, let me know
9. you were made to go out and get her… http://github.com/ptarjan/hadoop_record
10. the minute you let her under your skin… I bet you thought I was done with “Hey Jude” references, eh? How I built it Ply == lex and yacc Parser == 234 lines including tests! Outputs generic data types You have to do the class transform yourself You can use my lex and yacc stuff in your language of choice
11. and any time you feel the pain… Parsing the binary format is hard Vector vsstruct??? struct= "s{" record *("," record) "}" vector = "v{" [record *("," record)] "}" LazyString – don’t decode if not needed 99% of my hadoop time was decoding strings I didn’t need Binary on disk -> CSV -> python == wastefull Hadoopupacks zip files – name it .mod
12. nanananana Future work DDL Converter Integrate it officially Record writer (should be easy) SequenceFileAsOutputFormat Integrate your feedback