2. Thanks for Coming
• Extracting and organizing unstructured data may be less exciting than
creating visualizations, but it’s also important!
• Civic applications include:
• Open Government / Government Transparency
• Data Journalism
• This work also has commercial applications, which is why expensive
enterprise software has been created to address this problem.
3. Some of Our Challenges
• Government Financial Statements
• IRS Form 990s (Non-Profit Disclosures)
• House of Representative Financial Disclosures
• Compiling a History of Torture
8. Three Inter-Related Problems …
• Extracting data from PDFs that contain embedded text
• Using Optical Character Recognition (OCR) to generate text from PDFs
of scans or photographs
• Transforming unstructured text and numbers into a form that can be
readily analyzed. A related IT term is ETL (Extract-Transform-Load)
9. … and some Open Source Solutions
• Extracting data from PDFs that contain embedded text
PDFBox, Poppler
• Using Optical Character Recognition (OCR) to generate text from PDFs
of scans or photographs
Tesseract
• Transforming unstructured text and numbers into a form that can be
readily analyzed. A related IT term is ETL (Extract-Transform-Load)
Tabula (for table identification), OpenRefine
10. … or Licensed Solutions
• Extracting data from PDFs that contain embedded text
PDFLib Text Extraction Tool
• Using Optical Character Recognition (OCR) to generate text from PDFs
of scans or photographs
ABBYY (FineReader of Cloud SDK)
• Transforming unstructured text and numbers into a form that can be
readily analyzed. A related IT term is ETL (Extract-Transform-Load)
SIMX Text Converter
11. My Advice
• Choose a pre-specified challenge or pick another type of PDF that interests
you
• Establish a clear idea of what data you want to extract and how to arrange it
• Determine which of the three operations have to be performed on the PDFs
• Test a couple of tools with the PDFs you’re working with to see which work
better or decide to do your own scratch development
• Put together your solution, test it and check it into GitHub
• Don’t get discouraged: you may still have a great project even if you can only
achieve partial automation. This is about reducing manual work, not
necessarily eliminating it.
12. Rules
• Trying to keep rules at a minimum!
• You can work at RallyPad or anywhere else
• Unless I hear a groundswell of protest, hours at RallyPad will be:
• Tonight until 10:30
• Tomorrow from 8:00 to 6:00
• Sunday from 8:00 until judging which will start at Noon
• To be eligible for judging:
• Your code must be fully open and checked into Github by Noon Sunday
• You can use open source or licensed components, but if you use the
latter, trial limitations must not handicap your solution at judging