One way to approach MapReduce jobs in Hadoop is to use streaming.
In other words, use any kind of script which can be run from a command
line and read/write data via stdin and stdout:
The following examples use Python scripts for Hadoop Streaming. One
really great benefit is that then you can dev/test/debug your MapReduce
code on small data sets from a command line simply by using pipes:
cat input.txt | mapper.py | sort | reducer.py
BTW, there are much better ways to handle Hadoop Streaming in Python
on Elastic MapReduce – for example, using the “boto” library. However,
these examples are kept simple so they’ll fit into a tech talk!