SlideShare a Scribd company logo
1 of 53
Bringing Data Science to the Speakers of
Every Language
Robert Munro, PhD
CEO, Idibon
ACM Conference on Knowledge Discovery and Data Mining
(KDD), 2014
idibon
About me: technology and global development
CEO, Idibon
Global text analytics in
50+ languages
Working with leaders in
industry & social good
Industry: CTO / CIO
Energy infrastructure in Liberia and
Sierra Leone
Global epidemic tracking
Crowdsourcing and natural language
processing for disaster response
Other
Ph.D. in NLP from Stanford
Bicycled 20+ countries
idibon
Recommendations for language
processing for social good
Look beyond English
Inherent benefit understanding and
support speakers of every language
Employ people in those languages
Crowdsourced workers speak 100s of
languages, and want to use them
Embrace the variation
You can’t rely on consistent spellings, but
you can learn to model the diversity
idibon
How many languages are in the connected world?
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720
540 500
Year
#oflanguagesHow has this changed?
idibon
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720
540 500
How many languages are in the connected world?
How has this changed?
Year
#oflanguages
idibon
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720
540 500
How many languages are in the connected world?
Year
#oflanguages
idibon
How many languages are in the connected world?
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720
540 500
Year
#oflanguages
Putting a phone in
the hands of
everyone on the
planet is the easy
part
Understanding
everyone is going
to be more
complicated
idibon
Every human
communication this year
Source: Ethnologue, Nationalencyklopedin
idibon
7% of our communications are
digital, most is still direct
spoken language
idibon
If every online picture is worth a
thousand words, it would double
social media.
Every
picture
idibon
Every 3 months, the world's text
messages exceed the word count
of every book.
Every
book. Ever.
Source: Google Books
idibon
Print communication is smaller
than anything shown.
idibon
Print communication is smaller
than anything shown.
Ditto any one social network.
idibon
The Twitter “firehose” is about
the size of the dot above the “i” in
English.
Beyond the processing
capacity of most organizations.
Might not be a representative
sample of all human activity for
your area of interest.
idibon
There are more than 6,000 other
languages.
Only the top 1% are shown.
idibon
No language from the Americas
made the cut.
Quechua
idibon
Email spam would be larger than
every block except spoken
Mandarin (官话).
Source: Mashable
idibon
Short messages (SMS and IM)
make up 2% of the world’s
communications.
The largest and most
linguistically diverse form of
written communication that has
ever existed.
# PhDs focused on processing
large volumes of short messages
in low resource languages?
1
idibon
If the Facebook “like” is a one-word
language it is in the top 5% of
languages by word count.
idibon
Your browser probably won't
show Sundanese script
(ᮘᮘᮘᮘᮘᮘᮘ)
idibon
Combined.
Sundanese speakers outnumber the
populations New York, London,
Tokyo and Moscow.
idibon
You misread “Sundanese” as
"Sudanese" which is a variety of
Arabic
We have a blind spot for knowing
about the existence of languages.
idibon
This is the breakdown of
languages that most of our data is
moving towards
idibon
January 12, 2010
An earthquake
struck Haiti on
January 12, 2010
Most local services failed,
but most cell-towers
remained functional.
idibon
Messages start streaming in
idibon
Messages start streaming in
idibon
Mission 4636
Message
translated,
categorized &
geolocated
Locationis
refined &
actionableitems
are identified
“Fanm gen tranche pou fè yon
pitit nan Delmas 31”
“Fanm gen tranche pou fè yon
pitit nan Delmas 31”
Undergoing children delivery
Delmas 31
18.495746829274168,
72.31849193572998
Emergency
(18.4957, -72.3185)
“Fanm gen tranche pou fè yon
pitit nan Delmas 31”
Undergoing children delivery
Delmas 31
18.495746829274168,
72.31849193572998
Emergency
idibon
Global collaboration
2,000 volunteers, transferred to paid workers in Haiti
idibon
Lopital Sacre-Coeur
ki nan vil Okap, pre
pou li resevwa
moun malad e lap
mande pou moun
ki malad yo ale la.
“Sacre-Coeur
Hospital which
located in this
village of Okap is
ready to receive
those who are
injured. Therefore,
we are asking
those who are sick
to report to that
hospital.”
idibon
Lopital Sacre-Coeur
ki nan vil Okap, pre
pou li resevwa
moun malad e lap
mande pou moun
ki malad yo ale la.
“Sacre-Coeur
Hospital which
located in this
village of Okap is
ready to receive
those who are
injured. Therefore,
we are asking
those who are sick
to report to that
hospital.”
idibon
Lopital Sacre-Coeur
ki nan vil Okap, pre
pou li resevwa
moun malad e lap
mande pou moun
ki malad yo ale la.
“Sacre-Coeur
Hospital which
located in this
village of Okap is
ready to receive
those who are
injured. Therefore,
we are asking
those who are sick
to report to that
hospital.”
idibon
Local knowledge
Workers collaborating to find locations:
Dalila: I need Thomassin Apo please
Apo: Kenscoff Route: Lat: 18.495746829274168, Long:-
72.31849193572998
Apo: This Area after Petion-Ville and Pelerin 5 is not on
Google Map. We have no streets name
‘here’ = anywhere
Feedback from responders:
"just got emergency SMS, child delivery, USCG are
acting, and, the GPS coordinates of the location we
got from someone of your team were 100%
accurate!"
(18.4957, -72.3185)
Apo
Dalila
Haiti responders
The ability for someone to make a real-time
difference at any other place in the world:
Apo: I know this place like my pocket
Dalila: thank God u was here
idibon
How do we automate processing the
world’s data?
idibon
English
Generations of standardization in spelling and simple
morphology
Whole words suitable as features for NLP systems
Most other languages
Relatively complex morphology
Less (observed) standardized spellings
More dialectal variation
idibon
Haitian Krèyol
No standard (wide-spread) spellings
More or less French spellings
More or less phonetic spellings
Frequent words (esp pronouns) are shortened and
compounded
Regional slang / abbreviations
idibon
Haitian Krèyol
mèsi, mesi,
mèci, merci
C a p - H a ï t i e n
K a p a y i s y e n
idibon
The extent of the subword variation
>30 spellings of odwala (‘patient’) in Chichewa
>50% variants of ‘odwala’ occur only once in the data
used here:
Affixes and incorporation
‘kwaodwala’ -> ‘kwa + odwala’
‘ndiodwala’ -> ‘ndi odwala’ (official ‘ngodwala’ not present)
Phonological/Orthographic
‘odwara’ -> ‘odwala’
‘ndiwodwala’ -> ‘ndi (w) odwala’
idibon
Chichewa
The word odwala (‘patient’) in 600
text-messages in Chichewa and the
English translations
idibon
Modeling the variation gives accurate results
ndimmafunamanthwala
(‘I currently need medicine’)
ndimafunamantwala
ndi-ma-fun-a man-twala
ndi-ma-fun-a man-twala
ndi -fun man-twala
(“I need medicine”)
Category = “Request for aid”
ndi kufuni mantwara
(‘mywant of medicine’)
ndi kufuni mantwala
ndi-ku-fun-i man-twala
ndi-ku-fun-iman-twala
ndi -fun man-twala
(“I need medicine”)
Category = “Request for aid”
1) Normalize spellings
2) Segment
3) Identify predictors
1 in 5 classification
errors with raw
messages
1 in 20 classification
error post-processing.
Improves with scale.
idibon
Comparison with English
Accuracy:Micro-f
Percent of training data
idibon
Taking it to the world
idibon
The benefits of understanding everyone
Human diseases eradicated in the last 75 years:
Increase in air travel in the last 75 years:
s m a l l p o x
idibon
Reports of ‘strange new illnesses’ pre-date official records
H1N1 (Swine Flu)
months
(10% of world
infected)HIV
decades
(35 million
infected)
H1N5 (Bird Flu)
weeks
(>50% fatal)
idibon
…but the reports are in 1000s of languages
90% of ecological diversity 90% of linguistic diversity
в п р е д с т о я щ и й о с е н н е - з и м н и й п е р и од в Ук р а и н е
ож и д а ю т с я д в е э п и д е м и и г р и п п а
‫ن‬ ‫م‬ ‫د‬ ‫ي‬ ‫ز‬ ‫م‬‫ز‬ ‫ن‬ ‫و‬ ‫ل‬ ‫ف‬ ‫ن‬ ‫ا‬‫ر‬ ‫و‬ ‫ي‬ ‫ط‬ ‫ل‬ ‫ا‬ ‫ا‬‫ر‬ ‫ص‬ ‫م‬ ‫ي‬ ‫ف‬
香 港 现 1 例 H 5 N 1 禽 流 感 病 例 曾 游 上 海 南 京 等 地
idibon
Crowdsourcing, big data, and expert analysts
в п р е д с т о я щ и й о с е н н е - з и м н и й п е р и од в Ук р а и н е
ож и д а ю т с я д в е э п и д е м и и г р и п п а
‫ن‬ ‫م‬ ‫د‬ ‫ي‬ ‫ز‬ ‫م‬‫ز‬ ‫ن‬ ‫و‬ ‫ل‬ ‫ف‬ ‫ن‬ ‫ا‬‫ر‬ ‫و‬ ‫ي‬ ‫ط‬ ‫ل‬ ‫ا‬ ‫ا‬‫ر‬ ‫ص‬ ‫م‬ ‫ي‬ ‫ف‬
香 港 现 1 例 H 5 N 1 禽 流 感 病 例 曾 游 上 海 南 京 等 地
Most information is in plain language:
Multiple skill and processing strategies required.
idibon
Digital Disease Discovery
Big Data
machine learning:
extraction, filtering
& prioritization
Global
monitoring
Safer world
Analysts
several
domain
experts
Crowdsourcing
thousands of
native-language
speakers
Reports
millions per day:
many languages,
much noise
idibon
The impact of scalable monitoring
Found historical signals that pre-dated Google Flu
Trends by 3 weeks, CDC by 5 … on CNN
How can we filter/model media-driven amplification?
“I’m Jacqui Jeras with today’s cold and flu report ...
across the mid-Atlantic states, a little bit of an increase”
January 4, 2008 CNN Weather
idibon
The impact of scalable monitoring
Tracked Ebola in Uganda 5 days before World Health
Organization.
“we were able to pull in much richer data from a larger
number of sources, so we knew not just how many people
were infected, but what kind of transport they took when
they went from their village to the hospital in the nearest
main town.”
Robert Munro, UN General Assembly on “Big Data and
Global Development ”
Age + gender + village + went to hospital.
This is personally identifying & could lead to persecution.
Open data would have a negative impact.
idibon
The impact of scalable monitoring
Tracked E-Coli Outbreak in Germany 2 days before
ECDC.
How do we motivate information processing?
Margins are small: only for-profit big data and
crowdsourcing can have a sustained impact
idibon
Idibon’s current work
Hurricane Sandy
Idibon’s CTO ran FEMA’s Aerial
Damage Assessments.
We have >1,000,000 manual
tags on communications.
MIT Humanitarian Response Lab
Identifying reports about supply-
line interruptions.
Research data from a
combination of crowdsourcing
and natural language processing
idibon
Recommendations for language
processing for social good
Look beyond English
Inherent benefit understanding and
support speakers of every language
Employ people in those languages
Crowdsourced workers speak 100s of
languages, and want to use them
Embrace the variation
You can’t rely on consistent spellings, but
you can learn to model the diversity
idibon
Further reading
Munro, Robert. 2013. Crowdsourcing and the Crisis-Affected Community: lessons learned
and looking forward from Mission 4636. Journal of Information Retrieval 16(2).
Munro, Robert. 2012. Processing short message communications in low-resource
languages. PhD Thesis, Stanford University. Stanford, CA
Munro, Robert and Christopher Manning. 2012. Short message communications: users,
topics, and in-language processing. Second Annual Symposium on Computing for
Development (ACM DEV 2012), Atlanta.
Munro, Robert, Lucky Gunasekara, Stephanie Nevins, Lalith Polepeddi and Evan Rosen.
2012. Tracking Epidemics with Natural Language Processing and Crowdsourcing. Spring
Symposium for Association for the Advancement of Artificial Intelligence (AAAI), Stanford.
Munro, Robert. 2011 Subword and spatiotemporal models for identifying actionable
information in Haitian Kreyol. Fifteenth Conference on Computational Natural Language
Learning (CoNLL 2011), Portland.
Munro, Robert and Christopher Manning. 2010. Subword Variation in Text Message
Classification. Annual Conference of the North American Chapter of the Association for
Computational Linguistics (NAACL 2010), Los Angeles, CA.
Thank you
Robert Munro, PhD
CEO, Idibon

More Related Content

Similar to Bringing Data Science to the Speakers of Every Language

Marr webb-barr-lilac-2011-final
Marr webb-barr-lilac-2011-finalMarr webb-barr-lilac-2011-final
Marr webb-barr-lilac-2011-final
R B
 
Food For Thought Lesson
Food For Thought LessonFood For Thought Lesson
Food For Thought Lesson
bkind2animals
 
AFRICAN MANAGEMENT INITIATIVE (1)
AFRICAN MANAGEMENT INITIATIVE (1)AFRICAN MANAGEMENT INITIATIVE (1)
AFRICAN MANAGEMENT INITIATIVE (1)
Mary Kungu
 
L1 Global Popn Growth
L1 Global Popn GrowthL1 Global Popn Growth
L1 Global Popn Growth
SHS Geog
 
World population trends
World population trendsWorld population trends
World population trends
ljordan
 
CALPACT Webinar: Putting Culture Into Context: Communicating with Diverse Lat...
CALPACT Webinar: Putting Culture Into Context: Communicating with Diverse Lat...CALPACT Webinar: Putting Culture Into Context: Communicating with Diverse Lat...
CALPACT Webinar: Putting Culture Into Context: Communicating with Diverse Lat...
Center for Public Health Practice & Leadership at UC Berkeley
 

Similar to Bringing Data Science to the Speakers of Every Language (20)

CS12A - Lesson 1 : Human Population
CS12A - Lesson 1 : Human PopulationCS12A - Lesson 1 : Human Population
CS12A - Lesson 1 : Human Population
 
Demography
DemographyDemography
Demography
 
Marr webb-barr-lilac-2011-final[1]
Marr webb-barr-lilac-2011-final[1]Marr webb-barr-lilac-2011-final[1]
Marr webb-barr-lilac-2011-final[1]
 
Marr webb-barr-lilac-2011-final
Marr webb-barr-lilac-2011-finalMarr webb-barr-lilac-2011-final
Marr webb-barr-lilac-2011-final
 
Chris Gibbons - The coming communication crisis in U.S. healthcare
Chris Gibbons - The coming communication crisis in U.S. healthcareChris Gibbons - The coming communication crisis in U.S. healthcare
Chris Gibbons - The coming communication crisis in U.S. healthcare
 
Mediale Hamburg 2016: Content rettet Leben
Mediale Hamburg 2016: Content rettet LebenMediale Hamburg 2016: Content rettet Leben
Mediale Hamburg 2016: Content rettet Leben
 
Chapter 4-section-2-population geography
Chapter 4-section-2-population geographyChapter 4-section-2-population geography
Chapter 4-section-2-population geography
 
Food For Thought Lesson
Food For Thought LessonFood For Thought Lesson
Food For Thought Lesson
 
AFRICAN MANAGEMENT INITIATIVE (1)
AFRICAN MANAGEMENT INITIATIVE (1)AFRICAN MANAGEMENT INITIATIVE (1)
AFRICAN MANAGEMENT INITIATIVE (1)
 
Using Language to Change the World - Translators Without Borders
Using Language to Change the World - Translators Without BordersUsing Language to Change the World - Translators Without Borders
Using Language to Change the World - Translators Without Borders
 
Helping professional’s perception of the welfare
Helping professional’s perception of the welfareHelping professional’s perception of the welfare
Helping professional’s perception of the welfare
 
L1 Global Popn Growth
L1 Global Popn GrowthL1 Global Popn Growth
L1 Global Popn Growth
 
Sec 2 Unit 2 world population
Sec 2 Unit 2 world populationSec 2 Unit 2 world population
Sec 2 Unit 2 world population
 
World population trends
World population trendsWorld population trends
World population trends
 
CALPACT Webinar: Putting Culture Into Context: Communicating with Diverse Lat...
CALPACT Webinar: Putting Culture Into Context: Communicating with Diverse Lat...CALPACT Webinar: Putting Culture Into Context: Communicating with Diverse Lat...
CALPACT Webinar: Putting Culture Into Context: Communicating with Diverse Lat...
 
Chapter 4, Section 2
Chapter 4, Section 2Chapter 4, Section 2
Chapter 4, Section 2
 
Pop geog 1
Pop geog 1Pop geog 1
Pop geog 1
 
Social Marketing (Product Platform)
Social Marketing (Product Platform)Social Marketing (Product Platform)
Social Marketing (Product Platform)
 
Population & environment
Population & environmentPopulation & environment
Population & environment
 
population & environment1.pptx
population & environment1.pptxpopulation & environment1.pptx
population & environment1.pptx
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Bringing Data Science to the Speakers of Every Language

  • 1. Bringing Data Science to the Speakers of Every Language Robert Munro, PhD CEO, Idibon ACM Conference on Knowledge Discovery and Data Mining (KDD), 2014
  • 2. idibon About me: technology and global development CEO, Idibon Global text analytics in 50+ languages Working with leaders in industry & social good Industry: CTO / CIO Energy infrastructure in Liberia and Sierra Leone Global epidemic tracking Crowdsourcing and natural language processing for disaster response Other Ph.D. in NLP from Stanford Bicycled 20+ countries
  • 3. idibon Recommendations for language processing for social good Look beyond English Inherent benefit understanding and support speakers of every language Employ people in those languages Crowdsourced workers speak 100s of languages, and want to use them Embrace the variation You can’t rely on consistent spellings, but you can learn to model the diversity
  • 4. idibon How many languages are in the connected world? 5 5 5 5 5 5 4.5 4 50 1500 5000 2000 1400 720 540 500 Year #oflanguagesHow has this changed?
  • 5. idibon 5 5 5 5 5 5 4.5 4 50 1500 5000 2000 1400 720 540 500 How many languages are in the connected world? How has this changed? Year #oflanguages
  • 6. idibon 5 5 5 5 5 5 4.5 4 50 1500 5000 2000 1400 720 540 500 How many languages are in the connected world? Year #oflanguages
  • 7. idibon How many languages are in the connected world? 5 5 5 5 5 5 4.5 4 50 1500 5000 2000 1400 720 540 500 Year #oflanguages Putting a phone in the hands of everyone on the planet is the easy part Understanding everyone is going to be more complicated
  • 8. idibon Every human communication this year Source: Ethnologue, Nationalencyklopedin
  • 9. idibon 7% of our communications are digital, most is still direct spoken language
  • 10. idibon If every online picture is worth a thousand words, it would double social media. Every picture
  • 11. idibon Every 3 months, the world's text messages exceed the word count of every book. Every book. Ever. Source: Google Books
  • 12. idibon Print communication is smaller than anything shown.
  • 13. idibon Print communication is smaller than anything shown. Ditto any one social network.
  • 14. idibon The Twitter “firehose” is about the size of the dot above the “i” in English. Beyond the processing capacity of most organizations. Might not be a representative sample of all human activity for your area of interest.
  • 15. idibon There are more than 6,000 other languages. Only the top 1% are shown.
  • 16. idibon No language from the Americas made the cut. Quechua
  • 17. idibon Email spam would be larger than every block except spoken Mandarin (官话). Source: Mashable
  • 18. idibon Short messages (SMS and IM) make up 2% of the world’s communications. The largest and most linguistically diverse form of written communication that has ever existed. # PhDs focused on processing large volumes of short messages in low resource languages? 1
  • 19. idibon If the Facebook “like” is a one-word language it is in the top 5% of languages by word count.
  • 20. idibon Your browser probably won't show Sundanese script (ᮘᮘᮘᮘᮘᮘᮘ)
  • 21. idibon Combined. Sundanese speakers outnumber the populations New York, London, Tokyo and Moscow.
  • 22. idibon You misread “Sundanese” as "Sudanese" which is a variety of Arabic We have a blind spot for knowing about the existence of languages.
  • 23. idibon This is the breakdown of languages that most of our data is moving towards
  • 24. idibon January 12, 2010 An earthquake struck Haiti on January 12, 2010 Most local services failed, but most cell-towers remained functional.
  • 27. idibon Mission 4636 Message translated, categorized & geolocated Locationis refined & actionableitems are identified “Fanm gen tranche pou fè yon pitit nan Delmas 31” “Fanm gen tranche pou fè yon pitit nan Delmas 31” Undergoing children delivery Delmas 31 18.495746829274168, 72.31849193572998 Emergency (18.4957, -72.3185) “Fanm gen tranche pou fè yon pitit nan Delmas 31” Undergoing children delivery Delmas 31 18.495746829274168, 72.31849193572998 Emergency
  • 28. idibon Global collaboration 2,000 volunteers, transferred to paid workers in Haiti
  • 29. idibon Lopital Sacre-Coeur ki nan vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. “Sacre-Coeur Hospital which located in this village of Okap is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.”
  • 30. idibon Lopital Sacre-Coeur ki nan vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. “Sacre-Coeur Hospital which located in this village of Okap is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.”
  • 31. idibon Lopital Sacre-Coeur ki nan vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. “Sacre-Coeur Hospital which located in this village of Okap is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.”
  • 32. idibon Local knowledge Workers collaborating to find locations: Dalila: I need Thomassin Apo please Apo: Kenscoff Route: Lat: 18.495746829274168, Long:- 72.31849193572998 Apo: This Area after Petion-Ville and Pelerin 5 is not on Google Map. We have no streets name ‘here’ = anywhere Feedback from responders: "just got emergency SMS, child delivery, USCG are acting, and, the GPS coordinates of the location we got from someone of your team were 100% accurate!" (18.4957, -72.3185) Apo Dalila Haiti responders The ability for someone to make a real-time difference at any other place in the world: Apo: I know this place like my pocket Dalila: thank God u was here
  • 33. idibon How do we automate processing the world’s data?
  • 34. idibon English Generations of standardization in spelling and simple morphology Whole words suitable as features for NLP systems Most other languages Relatively complex morphology Less (observed) standardized spellings More dialectal variation
  • 35. idibon Haitian Krèyol No standard (wide-spread) spellings More or less French spellings More or less phonetic spellings Frequent words (esp pronouns) are shortened and compounded Regional slang / abbreviations
  • 36. idibon Haitian Krèyol mèsi, mesi, mèci, merci C a p - H a ï t i e n K a p a y i s y e n
  • 37. idibon The extent of the subword variation >30 spellings of odwala (‘patient’) in Chichewa >50% variants of ‘odwala’ occur only once in the data used here: Affixes and incorporation ‘kwaodwala’ -> ‘kwa + odwala’ ‘ndiodwala’ -> ‘ndi odwala’ (official ‘ngodwala’ not present) Phonological/Orthographic ‘odwara’ -> ‘odwala’ ‘ndiwodwala’ -> ‘ndi (w) odwala’
  • 38. idibon Chichewa The word odwala (‘patient’) in 600 text-messages in Chichewa and the English translations
  • 39. idibon Modeling the variation gives accurate results ndimmafunamanthwala (‘I currently need medicine’) ndimafunamantwala ndi-ma-fun-a man-twala ndi-ma-fun-a man-twala ndi -fun man-twala (“I need medicine”) Category = “Request for aid” ndi kufuni mantwara (‘mywant of medicine’) ndi kufuni mantwala ndi-ku-fun-i man-twala ndi-ku-fun-iman-twala ndi -fun man-twala (“I need medicine”) Category = “Request for aid” 1) Normalize spellings 2) Segment 3) Identify predictors 1 in 5 classification errors with raw messages 1 in 20 classification error post-processing. Improves with scale.
  • 41. idibon Taking it to the world
  • 42. idibon The benefits of understanding everyone Human diseases eradicated in the last 75 years: Increase in air travel in the last 75 years: s m a l l p o x
  • 43. idibon Reports of ‘strange new illnesses’ pre-date official records H1N1 (Swine Flu) months (10% of world infected)HIV decades (35 million infected) H1N5 (Bird Flu) weeks (>50% fatal)
  • 44. idibon …but the reports are in 1000s of languages 90% of ecological diversity 90% of linguistic diversity в п р е д с т о я щ и й о с е н н е - з и м н и й п е р и од в Ук р а и н е ож и д а ю т с я д в е э п и д е м и и г р и п п а ‫ن‬ ‫م‬ ‫د‬ ‫ي‬ ‫ز‬ ‫م‬‫ز‬ ‫ن‬ ‫و‬ ‫ل‬ ‫ف‬ ‫ن‬ ‫ا‬‫ر‬ ‫و‬ ‫ي‬ ‫ط‬ ‫ل‬ ‫ا‬ ‫ا‬‫ر‬ ‫ص‬ ‫م‬ ‫ي‬ ‫ف‬ 香 港 现 1 例 H 5 N 1 禽 流 感 病 例 曾 游 上 海 南 京 等 地
  • 45. idibon Crowdsourcing, big data, and expert analysts в п р е д с т о я щ и й о с е н н е - з и м н и й п е р и од в Ук р а и н е ож и д а ю т с я д в е э п и д е м и и г р и п п а ‫ن‬ ‫م‬ ‫د‬ ‫ي‬ ‫ز‬ ‫م‬‫ز‬ ‫ن‬ ‫و‬ ‫ل‬ ‫ف‬ ‫ن‬ ‫ا‬‫ر‬ ‫و‬ ‫ي‬ ‫ط‬ ‫ل‬ ‫ا‬ ‫ا‬‫ر‬ ‫ص‬ ‫م‬ ‫ي‬ ‫ف‬ 香 港 现 1 例 H 5 N 1 禽 流 感 病 例 曾 游 上 海 南 京 等 地 Most information is in plain language: Multiple skill and processing strategies required.
  • 46. idibon Digital Disease Discovery Big Data machine learning: extraction, filtering & prioritization Global monitoring Safer world Analysts several domain experts Crowdsourcing thousands of native-language speakers Reports millions per day: many languages, much noise
  • 47. idibon The impact of scalable monitoring Found historical signals that pre-dated Google Flu Trends by 3 weeks, CDC by 5 … on CNN How can we filter/model media-driven amplification? “I’m Jacqui Jeras with today’s cold and flu report ... across the mid-Atlantic states, a little bit of an increase” January 4, 2008 CNN Weather
  • 48. idibon The impact of scalable monitoring Tracked Ebola in Uganda 5 days before World Health Organization. “we were able to pull in much richer data from a larger number of sources, so we knew not just how many people were infected, but what kind of transport they took when they went from their village to the hospital in the nearest main town.” Robert Munro, UN General Assembly on “Big Data and Global Development ” Age + gender + village + went to hospital. This is personally identifying & could lead to persecution. Open data would have a negative impact.
  • 49. idibon The impact of scalable monitoring Tracked E-Coli Outbreak in Germany 2 days before ECDC. How do we motivate information processing? Margins are small: only for-profit big data and crowdsourcing can have a sustained impact
  • 50. idibon Idibon’s current work Hurricane Sandy Idibon’s CTO ran FEMA’s Aerial Damage Assessments. We have >1,000,000 manual tags on communications. MIT Humanitarian Response Lab Identifying reports about supply- line interruptions. Research data from a combination of crowdsourcing and natural language processing
  • 51. idibon Recommendations for language processing for social good Look beyond English Inherent benefit understanding and support speakers of every language Employ people in those languages Crowdsourced workers speak 100s of languages, and want to use them Embrace the variation You can’t rely on consistent spellings, but you can learn to model the diversity
  • 52. idibon Further reading Munro, Robert. 2013. Crowdsourcing and the Crisis-Affected Community: lessons learned and looking forward from Mission 4636. Journal of Information Retrieval 16(2). Munro, Robert. 2012. Processing short message communications in low-resource languages. PhD Thesis, Stanford University. Stanford, CA Munro, Robert and Christopher Manning. 2012. Short message communications: users, topics, and in-language processing. Second Annual Symposium on Computing for Development (ACM DEV 2012), Atlanta. Munro, Robert, Lucky Gunasekara, Stephanie Nevins, Lalith Polepeddi and Evan Rosen. 2012. Tracking Epidemics with Natural Language Processing and Crowdsourcing. Spring Symposium for Association for the Advancement of Artificial Intelligence (AAAI), Stanford. Munro, Robert. 2011 Subword and spatiotemporal models for identifying actionable information in Haitian Kreyol. Fifteenth Conference on Computational Natural Language Learning (CoNLL 2011), Portland. Munro, Robert and Christopher Manning. 2010. Subword Variation in Text Message Classification. Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2010), Los Angeles, CA.
  • 53. Thank you Robert Munro, PhD CEO, Idibon