SlideShare uma empresa Scribd logo
1 de 17
Project Update - July 11, 2013
The Eric & Wendy Schmidt
Data Science
for Social Good
Summer Fellowship 2013
www.dssg.io | dssg-ushahidi@googlegroups.com
Ushahidi Workflow
Ushahidi Workflow +
DSSG
Data Sets
23,000 reports from 20 datasets
• 22% English
• 35% non-English
• 43% mixed languages
Each report includes text, category, location,
sometimes more data
Data Sets
Additional
unusable
datasets for
various reasons
(e.g. overly
formulaic
language)
What is the
quality of the
existing "gold
standard"
annotation?
Working on
translations of
Afghanistan election
(peaceful)
Kenyan election
(less peaceful)
Data Set Differences
Current Task Status [July 11]
1) Suggest categories.......................
2) Extract named entities...................
(especially locations)
3) Detect language............................
End of presentation has more extensive technical details
Toy Demo
http://ec2-54-218-196-140.us-west-2.compute.amazonaws.com/home
Note this is ONLY a basic "toy" user interface to demonstrate the current prototype functionality.
Our plan is to deliver an open-source code library,
which Ushahidi will incorporate into the existing user interface.
If link doesn't work -- just look at the screenshots in the next slides. :)
Demo: Example #1
Demo: Example #2
Secondary Project Ideas
1. Detect private info to strip
2. Urgency assessment
3. Filtering irrelevant reports (not strictly spam)
4. Automatically proposing new [sub-]categories
5. Cluster similar (non-identical) reports
6. Hierarchical topic modelling / visualization
Evaluation Plans
• Tap into Ushahidi and crisis mapping
communities for feedback
• Simulate past event with our system
• Success metrics:
o Increased annotator speed
o Increased annotator categorization accuracy
o Decreased annotator frustration/tedium
Feedback welcome!
Contact us at dssg-
ushahidi@googlegroups.com
We would love your input!
See next 4 slides for technical details on our 4 tasks...
or skip if you're happy to stay unaware... :)
1) Suggest categories
Currently:
• Simple bag-of-words unigram features
• 1-vs.-all classification (scikit-learn)
• Little categories fewer big categories
• Performance uninspiring :(
Future:
Bigrams... word frequency filter...
2) Extract named entities
Currently:
• NLTK's Named Entity Recognizer
• Eval: pretty good
Future:
• Train location-recognizer on datasets
• Merge types for non-location NEs
3) Detect Language
Currently:
• Existing packages (Bing, python, ...)
Future:
• Evaluate quality
• Allow event-specific language bias
4) Near-Duplicate
Detection
Currently:
• SimHash compares distances of message
text hashes efficiently
Future:
• Evaluate quality more rigorously
• Explore other methods

Mais conteúdo relacionado

Mais de Ushahidi

Pivoting An African Open Source Project
Pivoting An African Open Source ProjectPivoting An African Open Source Project
Pivoting An African Open Source Project
Ushahidi
 

Mais de Ushahidi (20)

Ushahidi Toolbox - Implementation
Ushahidi Toolbox - ImplementationUshahidi Toolbox - Implementation
Ushahidi Toolbox - Implementation
 
Ushahidi Toolbox - Assessment
Ushahidi Toolbox - AssessmentUshahidi Toolbox - Assessment
Ushahidi Toolbox - Assessment
 
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building BridgesKenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
 
Kenya Ushahidi Evaluation: Uchaguzi
Kenya Ushahidi Evaluation: UchaguziKenya Ushahidi Evaluation: Uchaguzi
Kenya Ushahidi Evaluation: Uchaguzi
 
Kenya Ushahidi Evaluation: Blog Series
Kenya Ushahidi Evaluation: Blog SeriesKenya Ushahidi Evaluation: Blog Series
Kenya Ushahidi Evaluation: Blog Series
 
Pivoting An African Open Source Project
Pivoting An African Open Source ProjectPivoting An African Open Source Project
Pivoting An African Open Source Project
 
Ushahidi esri juliana
Ushahidi esri julianaUshahidi esri juliana
Ushahidi esri juliana
 
Ushahidi personas scenarios
Ushahidi personas scenariosUshahidi personas scenarios
Ushahidi personas scenarios
 
Citizen pollution mapping made easy
Citizen pollution mapping made easy Citizen pollution mapping made easy
Citizen pollution mapping made easy
 
Testimony
TestimonyTestimony
Testimony
 
Map it, Change it
Map it, Change itMap it, Change it
Map it, Change it
 
Map it, Make it, Hack it
Map it, Make it, Hack itMap it, Make it, Hack it
Map it, Make it, Hack it
 
What if Citizens Mapped Health?
What if Citizens Mapped Health?What if Citizens Mapped Health?
What if Citizens Mapped Health?
 
Re-imagining Citizen Engagement
Re-imagining Citizen EngagementRe-imagining Citizen Engagement
Re-imagining Citizen Engagement
 
Ushahidi Research Seminar 11.11.11
Ushahidi Research Seminar 11.11.11Ushahidi Research Seminar 11.11.11
Ushahidi Research Seminar 11.11.11
 
Ihub Research
Ihub ResearchIhub Research
Ihub Research
 
What's in the toolkit (Ushahidi at ETHz)
What's in the toolkit (Ushahidi at ETHz)What's in the toolkit (Ushahidi at ETHz)
What's in the toolkit (Ushahidi at ETHz)
 
Volunteer Mappers: Building community resilience with citizen media
Volunteer Mappers: Building community resilience with citizen mediaVolunteer Mappers: Building community resilience with citizen media
Volunteer Mappers: Building community resilience with citizen media
 
Ushahidi Deployment - Output Toolbox
Ushahidi Deployment - Output ToolboxUshahidi Deployment - Output Toolbox
Ushahidi Deployment - Output Toolbox
 
Ushahidi Deployment - Implementation Toolbox
Ushahidi Deployment - Implementation ToolboxUshahidi Deployment - Implementation Toolbox
Ushahidi Deployment - Implementation Toolbox
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Data Science for Social Good and Ushahidi

  • 1. Project Update - July 11, 2013 The Eric & Wendy Schmidt Data Science for Social Good Summer Fellowship 2013 www.dssg.io | dssg-ushahidi@googlegroups.com
  • 4. Data Sets 23,000 reports from 20 datasets • 22% English • 35% non-English • 43% mixed languages Each report includes text, category, location, sometimes more data
  • 5. Data Sets Additional unusable datasets for various reasons (e.g. overly formulaic language) What is the quality of the existing "gold standard" annotation? Working on translations of
  • 7. Current Task Status [July 11] 1) Suggest categories....................... 2) Extract named entities................... (especially locations) 3) Detect language............................ End of presentation has more extensive technical details
  • 8. Toy Demo http://ec2-54-218-196-140.us-west-2.compute.amazonaws.com/home Note this is ONLY a basic "toy" user interface to demonstrate the current prototype functionality. Our plan is to deliver an open-source code library, which Ushahidi will incorporate into the existing user interface. If link doesn't work -- just look at the screenshots in the next slides. :)
  • 11. Secondary Project Ideas 1. Detect private info to strip 2. Urgency assessment 3. Filtering irrelevant reports (not strictly spam) 4. Automatically proposing new [sub-]categories 5. Cluster similar (non-identical) reports 6. Hierarchical topic modelling / visualization
  • 12. Evaluation Plans • Tap into Ushahidi and crisis mapping communities for feedback • Simulate past event with our system • Success metrics: o Increased annotator speed o Increased annotator categorization accuracy o Decreased annotator frustration/tedium
  • 13. Feedback welcome! Contact us at dssg- ushahidi@googlegroups.com We would love your input! See next 4 slides for technical details on our 4 tasks... or skip if you're happy to stay unaware... :)
  • 14. 1) Suggest categories Currently: • Simple bag-of-words unigram features • 1-vs.-all classification (scikit-learn) • Little categories fewer big categories • Performance uninspiring :( Future: Bigrams... word frequency filter...
  • 15. 2) Extract named entities Currently: • NLTK's Named Entity Recognizer • Eval: pretty good Future: • Train location-recognizer on datasets • Merge types for non-location NEs
  • 16. 3) Detect Language Currently: • Existing packages (Bing, python, ...) Future: • Evaluate quality • Allow event-specific language bias
  • 17. 4) Near-Duplicate Detection Currently: • SimHash compares distances of message text hashes efficiently Future: • Evaluate quality more rigorously • Explore other methods

Notas do Editor

  1. We're happy to give an update on our Ushahidi project's . [Abe Gong]
  2. Citizens submit reports (via SMS, twitter, and the web) which are reviewed by annotators. It's a slow manual process -- to categorize, geolocate, strip private info, etc.
  3. We're building a data wizardry system to support the manual annotation process
  4. Since Ushahidi reports are mostly public, private info should be hidden. example: names, phone numbers, and addresses 4. example: in Haiti earthquake, we might observe unexpected robbery reports arising. 5. This is mainly for a better workflow, because annotators can work better when they process similar reports altogether. 6. To see which topics are commonly occurring in Election in general, and which topics only occur in Kenyan election specifically.