ODSC East 2017: Data Science Models For Good

Harnessing the power of
data science in the
service of humanity.

Our Purpose: We amplify the impact of social organizations.
Our Customer: Social organizations that have a clear theory of
change for reducing human suffering
Our Competitive Advantage: Our network of pro bono data scientists
Our Product: Data science services, i.e. predictive analytics, machine
learning, AI
Our Style: Human-centered design, jargon-free, accessible
What We Do

DataKind and ICAAD
Classifying UPR Records

ICAAD: International Center for Advocates
Against Discrimination
Non-profit organization that combats structural discrimination through
monitoring global trends, fostering research and designing interventions
Promote religious freedom in France
Combat gender-based violence in the Pacific Islands
Better documentation of hate crimes
Mapping discrimination

What was the DataCorps problem?
We have a database of text records from the United
Nations: the “Universal Periodic Review”

Data: Universal Periodic Review

What was the DataCorps problem?
How do we leverage these UPR records to better
understand human rights conditions across the world?

Labeling with Sustainable Development Goals
Adopted in 2015, the SDGs are a set of seventeen aspirational goals that all UN
member states are committed to achieve, covering a broad range of human
rights and development issues
Successor to the Millennium Development Goals

Task: How do we map a UPR to an SDG(s)?

Deliverables
1) Build an MVP algorithm that systematically classifies Universal Periodic
Review (UPR) records using Sustainable Development Goals
2) Using the results from the algorithm, create a dashboard that visualizes
global patterns of discrimination
These two tools will enable ICAAD to better allocate their resources towards
the most important human rights interventions, as well as better disseminate
their findings to other related organizations.

UPR Data
Source Number of Records Number of Labels
ICAAD labeled 1247 2351
DataKind labeled 349 628
All-organic, self-harvested and hand-labeled...

Data Prep
1) Each UPR = (very short) document
2) Clean, tokenize and create (1,2)-grams
3) Create term-document matrix
4) Feed bag-of-words matrix into ML model
5) ML model = two step “ensemble”

Machine Learning Layer: Multi-Label SVM
Support Vector Machine
Linear (no kernel)
Loss function: Squared hinge
Penalty type: L1
Regularization constant: 2.0

Keyword Lookup Layer
If UPR text contains the word “corruption” → SDG #16
If UPR text contains the word “HIV” or “AIDS” → SDG #3
If UPR text contains the word “ICRMW” → SDG #10
And so on...

Final Ensemble Model: CV Metrics
ML Layer ML + Keyword Lookup
Precision 0.827 0.772
Recall 0.758 0.848
F1-Score 0.787 0.802
ML Layer by itself does very well, but by adding the Keyword layer,
we can sacrifice a little bit of precision for a large gain in recall,
and get overall better performance.

Dashboard Visualizations: http://52.3.119.223/

The Aftermath Part 1
● Proof of concept algorithm delivered last October
● Demonstrated and implicated among various project
partners

The Aftermath Part 2
● Team from Xerox brought in to build v2 of algorithm
○ Main SDG category contains 169 additional sub-goals
○ ICAAD wants to classify UPR records using these sub-goals
○ Army of volunteer lawyers doing a lot of manual labeling
● Something concrete by next summer!

What We Learned (Parting Shots)
1. Easier = better
2. Small data is hard
3. Simple Boolean logic works surprisingly well
4. Data scientists are paid (and sometimes not) to do the
dirty work

DataCorps Team
Ben Cohen: Software Engineer @ Warby Parker
Rebecca Wei: PhD Student @ Northwestern
Karry Lu: Senior Data Scientist @ Plated

Project Repo
https://github.com/karry-lu/datakind-icaad-model

How do we find evidence?
How do we communicate evidence?
How do we use evidence?
42

Respondents from UK
conservation community
indicate desire to use
evidence but:
Lacked a support framework
to quickly sort and evaluate
evidence
Experience-
based
Evidence-
based
modified from Pullin et al. 2004
Evidence gap

Need for knowledge on
effectiveness
Evidence-based decision making:
Using findings to inform actions
Desired outcomes achieved
Research project
Communicate findings
Monitor and evaluate
progress and outcomes
Identify
knowledge gaps
Synthesize
knowledge gluts
Determine
indicators
Adjust actions
RESEARCHERS
PRACTITIONERS
Theory of change

The need
Practitioners need standardized
storage and access to research
insights from academic and grey
literature for evidence-based
decision making
Researchers need a framework to
follow to create these resources
Best
Science
Expert
Opinion
Society’s needs
and preferences
Evidence based
Decision-making

Systematic mapping process
51
Systematic Map

Problem #1: interactivity
Thorn, Jessica PR, et al. "What evidence exists for the
effectiveness of on-farm conservation land
management strategies for preserving ecosystem
services in developing countries? A systematic map."
Environmental Evidence 5.1 (2016): 13.

The AskProblem # 2: Manual screening

System 1: relevance ranking
• Citations are ranked by expected relevance
depending on the availability and number of
user-labeled examples
– 1st uses search terms from review planning:
computes the amount of overlap between those
terms and citations' title + abstract + keywords
– 2nd after enough examples have been labeled,
uses distributional word vectors (word2vec) as
features for a support vector classifier that predicts
inclusion or exclusion; use confidence of that
classification as expected relevance
• Citations are randomly sampled each time, to avoid
hasty generalization

Unscreened Relevance is learned every 10
citations and documents are
re-sorted

System 2: extraction and tagging
• A better methodology might be to use the training
data to find sentences in the document that might
indicate a label. (provide provenance)
• We can train the system to over-predict (predict
sentences from a large number of the labels), so
that the system can focus on recall, while human
annotators can focus on precision
• For locations we can use a "Named Entity
Recognition" system to find mentioned locations
in the document, and suggest these as labels
• For other metadata, we can train a model which
predicts the relevance of sentences to a label
• We show the sentences that best predict labels to
the user, who can then use that information to
pick the correct labels

Data extraction
This process also learns
relevance, set at 50 reviews
before it presents confidence

www.natureandpeopleevidence.org

Interaction with the data portal
•Output CSV file with
individual citations and
factor tags
•CSV ingested by receiving
system
http://natureandpeopleevidence.org/

Measuring evidence synthesis and
dissemination on the “T” impact model
Sector (diffuse) impact as
measured by
• Access and operability
• Common vs. uncommon solution
• Dissemination framework
Organization (deep) impact as
measured by
• Operational efficiency
• Increased productivity
• Expanded service

16 reviews
13 review
leads
28
users
Two weeks
of soft
launch
colandr
Two virtual
trainings
conducted

Data Portal
~1,400 SESSIONS
8 MONTHS 47 REGISTERED USERS
Multiple in person trainings

“Evidence Based Conservation”
ARTICLES
ON IT
139
ARTICLES
CITING THEM
2100 HOW MANY PEOPLE
USE EVIDENCE IN
DECISION MAKING?

ODSC East 2017: Data Science Models For Good

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a ODSC East 2017: Data Science Models For Good

Semelhante a ODSC East 2017: Data Science Models For Good (20)

Último

Último (20)

ODSC East 2017: Data Science Models For Good