4. Data Gathering
• Public data websites
o data.gov.in
o databank.worldbank.org
• Social websites
o facebook.com
o twitter.com
• Blogs / websites /etc via scrapping
Twitter: @thinrhino
4
5. Data cleaning
• Eg: openrefine
o OpenRefine (ex-Google Refine) is a powerful tool for working with messy
data, cleaning it, transforming it from one format into another, extending it
with web services, and linking it to databases like Freebase
o openrefine.org
Twitter: @thinrhino
5
6. Classic Unix Tools
• sed /awk
• Shell scripts
• GNU parallel
o Examples:
o cat rands20M.txt | awk '{s+=$1} END {print s}’
o cat rands20M.txt | parallel --pipe awk '{s+=$1}END{print
s}' | awk '{s+=$1} END {print s}’
o wc -l bigfile.txt
o cat bigfile.txt | parallel
{print s}'
Twitter: @thinrhino
--pipe wc -l | awk '{s+=$1} END
6