Hadoop Jute Record Python

•Download as PPTX, PDF•

2 likes•1,623 views

Paul Tarjan

My talk for the Hadoop User Group Nov 18 2009 about: Parsing hadoop records using python

Technology Business

Hadoop Record Reader in Python HUG: Nov 18 2009 Paul Tarjan http://paulisageek.com @ptarjan http://github.com/ptarjan/hadoop_record

Hey Jute… Tabs and newlines are good and all For lots of data, don’t do that

don’t make it bad... Hadoop has a native data storage format called Hadoop Record or “Jute” org.apache.hadoop.record http://en.wikipedia.org/wiki/Jute

take a data structure… There is a Data Definition Language! module links { class Link { ustringURL; booleanisRelative; ustringanchorText; }; }

$and make it better… And a compiler $ rcc -lc++ inclrec.jrtestrec.jr namespace inclrec { class RI : public hadoop::Record { private: int32_t I32; double D; std::string S;$

remember, to only use C++/Java $rcc--help Usage: rcc --language [java|c++] ddl-files

then you can start to make it better… I wanted it in python Need 2 parts. Parsing library and DDL translator I only did the first part If you need second part, let me know

you were made to go out and get her… http://github.com/ptarjan/hadoop_record

the minute you let her under your skin… I bet you thought I was done with “Hey Jude” references, eh? How I built it Ply == lex and yacc Parser == 234 lines including tests! Outputs generic data types You have to do the class transform yourself You can use my lex and yacc stuff in your language of choice

and any time you feel the pain… Parsing the binary format is hard Vector vsstruct??? struct= "s{" record *("," record) "}" vector = "v{" [record *("," record)] "}" LazyString – don’t decode if not needed 99% of my hadoop time was decoding strings I didn’t need Binary on disk -> CSV -> python == wastefull Hadoopupacks zip files – name it .mod

nanananana Future work DDL Converter Integrate it officially Record writer (should be easy) SequenceFileAsOutputFormat Integrate your feedback

What's hot

sphinx-i18n — The True StoryRobert Lehmann

Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}

Business logic with PostgreSQL and PythonHubert Piotrowski

Getting started with PostGIS geographic databaseEDINA, University of Edinburgh

Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAJISC GECO

Visualizing and Analyzing HDF-EOS5 and HDF5 data with NCLThe HDF-EOS Tools and Information Center

Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 dataThe HDF-EOS Tools and Information Center

Data analysis on hadoopFrank Y

DUG'20: 07 - Storing High-Energy Physics data in DAOSAndrey Kudryavtsev

Meetup Elasticsearch 13 novembre 2014Jean-Pierre Paris

Using HDF5 and Python: The H5py moduleThe HDF-EOS Tools and Information Center

Tokyocabinetguestf96ccd

Geo Package and OWS Context at FOSS4G PDXLuis Bermudez

Working with Shared Libraries in PerlIdo Kanner

Docopt, beautiful command-line options for R, user2014Edwin de Jonge

Substituting HDF5 tools with Python/H5py scriptsThe HDF-EOS Tools and Information Center

20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee

NASA HDF/HDF-EOS Data for Dummies (and Developers)The HDF-EOS Tools and Information Center

anticorrpVictor Ni?u

Pybind11 - SciPy 2021Henry Schreiner

What's hot (20)

sphinx-i18n — The True Story

Code as Data workshop: Using source{d} Engine to extract insights from git re...

Business logic with PostgreSQL and Python

Getting started with PostGIS geographic database

Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA

Visualizing and Analyzing HDF-EOS5 and HDF5 data with NCL

Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data

Data analysis on hadoop

DUG'20: 07 - Storing High-Energy Physics data in DAOS

Meetup Elasticsearch 13 novembre 2014

Using HDF5 and Python: The H5py module

Tokyocabinet

Geo Package and OWS Context at FOSS4G PDX

Working with Shared Libraries in Perl

Docopt, beautiful command-line options for R, user2014

Substituting HDF5 tools with Python/H5py scripts

20141111 파이썬으로 Hadoop MR프로그래밍

NASA HDF/HDF-EOS Data for Dummies (and Developers)

anticorrp

Pybind11 - SciPy 2021

Viewers also liked

Semantic SearchmonkeyPaul Tarjan

Hands on HadoopPaul Tarjan

How To Be A HackerPaul Tarjan

Hacku Intro 2009Paul Tarjan

Yahoo! HackU 2010Paul Tarjan

SearchMonkeyPaul Tarjan

Soleus Audio Manager HelpChris CHOU

Yahoo Developer Network overviewChristian Heilmann

Trompe L’Oeil & Decorazioni Pignotti Pisanuguest79d1a6

Promoting Excellence Network - Graduate Attributes at CQUniversity AustraliaDamien Clark

Viewers also liked (10)

Semantic Searchmonkey

Hands on Hadoop

How To Be A Hacker

Hacku Intro 2009

Yahoo! HackU 2010

SearchMonkey

Soleus Audio Manager Help

Yahoo Developer Network overview

Trompe L’Oeil & Decorazioni Pignotti Pisanu

Promoting Excellence Network - Graduate Attributes at CQUniversity Australia

Similar to Hadoop Jute Record Python

Massively Parallel Process with Prodedural Python by Ian HustonPyData

Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt

Python Powered Data Science at Pivotal (PyData 2013)Srivatsan Ramanujam

Language-agnostic data analysis workflows and reproducible researchAndrew Lowe

Hadoop Pig: MapReduce the easy way!Nathan Bijnens

Intro to-puppetF.L. Jonathan Araña Cruz

An Overview of HadoopAsif Ali

Lamp1Nadhi ya

Lamp1Reka

LampReka

HCatalogGetInData

Big data using Hadoop, Hive, Sqoop with Installationmellempudilavanya999

Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent

Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData

The Beauty And The Beast Php N W09Bastian Feder

Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse

Unit V.pdfKennyPratheepKumar

PigRamakrishna kapa

Playing with Hadoop (NPW2013)Søren Lund

Similar to Hadoop Jute Record Python (20)

Massively Parallel Process with Prodedural Python by Ian Huston

Massively Parallel Processing with Procedural Python (PyData London 2014)

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...

Python Powered Data Science at Pivotal (PyData 2013)

Language-agnostic data analysis workflows and reproducible research

Hadoop Pig: MapReduce the easy way!

Intro to-puppet

An Overview of Hadoop

Lamp1

Lamp

HCatalog

Big data using Hadoop, Hive, Sqoop with Installation

Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...

Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...

The Beauty And The Beast Php N W09

Quadrupling your elephants - RDF and the Hadoop ecosystem

Unit V.pdf

Pig

Playing with Hadoop (NPW2013)

Recently uploaded

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Histor y of HAM Radio presentation slidevu2urc

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

GenAI Risks & Security Meetup 01052024.pdflior mazor

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

A Year of the Servo Reboot: Where Are We Now?Igalia

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Artificial Intelligence: Facts and MythsJoaquim Jorge

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison

Histor y of HAM Radio presentation slide

Driving Behavioral Change for Information Management through Data-Driven Gree...

GenAI Risks & Security Meetup 01052024.pdf

Scaling API-first – The story of a global engineering organization

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Axa Assurance Maroc - Insurer Innovation Award 2024

A Year of the Servo Reboot: Where Are We Now?

Exploring the Future Potential of AI-Enabled Smartphone Processors

Boost Fertility New Invention Ups Success Rates.pdf

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Handwritten Text Recognition for manuscripts and early printed texts

presentation ICT roal in 21st century education

Artificial Intelligence: Facts and Myths

GenCyber Cyber Security Day Presentation

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

AWS Community Day CPH - Three problems of Terraform

Hadoop Jute Record Python

1. Hadoop Record Reader in Python HUG: Nov 18 2009 Paul Tarjan http://paulisageek.com @ptarjan http://github.com/ptarjan/hadoop_record

2. Hey Jute… Tabs and newlines are good and all For lots of data, don’t do that

3. don’t make it bad... Hadoop has a native data storage format called Hadoop Record or “Jute” org.apache.hadoop.record http://en.wikipedia.org/wiki/Jute

4. take a data structure… There is a Data Definition Language! module links { class Link { ustringURL; booleanisRelative; ustringanchorText; }; }

5. and make it better… And a compiler $ rcc -lc++ inclrec.jrtestrec.jr namespace inclrec { class RI : public hadoop::Record { private: int32_t I32; double D; std::string S;

6. remember, to only use C++/Java $rcc--help Usage: rcc --language [java|c++] ddl-files

7. then you can start to make it better… I wanted it in python Need 2 parts. Parsing library and DDL translator I only did the first part If you need second part, let me know

8. Hey Jute don't be afraid…

9. you were made to go out and get her… http://github.com/ptarjan/hadoop_record

10. the minute you let her under your skin… I bet you thought I was done with “Hey Jude” references, eh? How I built it Ply == lex and yacc Parser == 234 lines including tests! Outputs generic data types You have to do the class transform yourself You can use my lex and yacc stuff in your language of choice

11. and any time you feel the pain… Parsing the binary format is hard Vector vsstruct??? struct= "s{" record *("," record) "}" vector = "v{" [record *("," record)] "}" LazyString – don’t decode if not needed 99% of my hadoop time was decoding strings I didn’t need Binary on disk -> CSV -> python == wastefull Hadoopupacks zip files – name it .mod

12. nanananana Future work DDL Converter Integrate it officially Record writer (should be easy) SequenceFileAsOutputFormat Integrate your feedback

Hadoop Jute Record Python

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Hadoop Jute Record Python

Similar to Hadoop Jute Record Python (20)

Recently uploaded

Recently uploaded (20)

Hadoop Jute Record Python