Mapper 1: RFC822 Parser
map_parse.py takes a list of URI for where to read email messages, parses
each message, then emits multiple kinds of output tuples:
(doc_id, msg_uri, date)
(sender, receiver, doc_id)
(term, term_freq, doc_id)
(term, co_term, doc_id)
Note that our dataset includes approximately 500,000 email messages, with an
average of about 100 words in each message.
Also, there are 10E+5 unique terms. That will tend to be a constant in English
texts, which is great to know when configuring capacity.