Vision and reflection on Mining Software Repositories research in 2024
Kboom phenoday-2016
1. k-BOOM
A Bayesian approach to ontology structure inference,
with applications in disease ontology construction
Chris Mungall
Lawrence Berkeley Laboratory
PhenoDay 2016
@monarchinit
@chrismungall
2. Building a cohesive, complete disease
ontology
Objective
• Combine existing disease
classifications and lists into
unified cohesive
framework
• Best of all worlds
• Integrate data from multiple
resources
Challenges
• Current resources
developed independently,
different perspectives
• Mappings are imprecise
OMIM Orphanet DO MESH NCIT
Deciphe
r
ICD SNOMED
Combined, coherent view
3. Disease classifications and why
mappings are not enough
• Given N disease lists
– Where each provides cross-references
(xrefs) to up to N-1 others
– Up to (N^2)-N sets of mappings
• Even more with 3rd party mappings
– These are frequently
• Inconsistent (directly or indirectly)
• Different meanings and levels of specificity
• Incomplete
• Stale
• Difficult to computationally verify
• Fundamental issue
– Xrefs lack semantics
– Explicit semantics would enable
computational checks
Ont1
Ont2 Ont3
Ont4
Ont5
Ont6
7. Objective: Coherent OWL Ontology
Merging (OOM)
• Criteria for OOM
– Merged
• Combines multiple lists and classifications (terminologies
and lists treated as ‘degenerate’ ontologies), Presented as a
single ontology
• Equivalent classes merged
– Logically Connected
• OWL/Description Logic constructs
– e.g. SubClassOf, EquivalentClass, SomeValuesFrom
• Not xrefs
– Coherent
• Logically coherent: no unsatisfiable classes
• Biologically coherent: makes biological and clinical sense
8. Our previous approach, applied to
phenotypes: L-DOOM
Logical Definition based OWL Ontology Merging
Mungall, C. J., Gkoutos, G., Smith, C., Haendel, M., Lewis, S., & Ashburner, M. (2010). Integrating phenotype ontologies across multiple species. Genome Biology, 11(1), R2.
doi:10.1186/gb-2010-11-1-r2
Köhler, S., Doelken, S. C., Ruef, B. J., Bauer, S., Washington, N., Westerfield, M., … Mungall, C. J. (2013). Construction and accessibility of a cross-species phenotype ontology
along with gene annotations for biomedical research. F1000Research, 1–12. doi:10.3410/f1000research.2-30.v1
Application to diseases?
• Works well for compositional classes (e.g. many cancer terms)
• Less well for genetic diseases, complex syndromes
1. Assign Logical Definitions
(OWL equivalence axioms) to
classes in each ontology
• Can be assigned
manually or semi-
automatically (Obol)
HP:0002180
Neuro-
degeneration
MP:0000876
Purkinje cell
degeneration
Equiv
CL:0000540
neuron
CL:0000121
Purkinje cell
Equiv
degenerate
AND
inheres-in SOME
neuron
degenerate
AND
inheres-in SOME
Purkinje cell
2. Using reasoning to infer logical
axioms
SubClassOf
9. Probabilistic Ontology OP = <A,H>
BOOM Bayes OWL Ontology Merging:
Finds the set of hypothetical axioms that maximises P(OP)
Merged Coherent
OWL Ontology
Elk
Reasoner
Ontology 1
Inter-
Ontology
Mappings
mapping
tool
Ontology 2
Ontology ..
Ontology n
Hypothetical
Logical Axioms
plus Weights (H)
mapping
curation
Axiom Weight Estimator
Weight
Curation
Next iteration
Merge equivalent
classes
11. K-BOOM Algorithm for finding most
likely merged ontology
1. Factorize calculation by dividing combined
axioms into k modules (k-BOOM)
Algorithm:
i. Assert all hypothetical axioms to be true,
ii. Make module from equivalence clique
Find values for H that maximises P.
Problem: 2^N ontologies
hi
: boolean representing truth value of hypothetical axiom Hi
2. Use greedy algorithm; start with
Most likely hypothetical axioms in Ok
3. Test each configuration using OWL
Reasoner (Elk) for satisfiability
(unsat => Pr=0), calc posterior probability
4. Repeat until number of tests
exceeds threshold
5. Return most likely configuration for Ok
12. Probability guided curator workflow:
A little knowledge goes a long way
• Run cycle
• Examine results for modules
with:
– low posterior probability
– low confidence (top ranked
solution has similar P to next
ranked)
– Pr(H_i = true) << threshold
• Apply biological/clinical
knowledge
• Override auto-generated
hypothetical axiom weights with
curated ones
– Feedback issues to source
ontologies
• Repeat
dialog
Mondo
curator
External
ontology
curator
13. Application: merging diseases into
MonDO
https://github.com/monarch-initiative/monarch-disease-ontology
“Ontology” Classes (before, after
merge)
SubClass axioms Xrefs
Inputs:
DOID 6878 6012 7082 36656
MESH (D) 11314 4152 19036
OMIM (D) 7783 7783 0 31242
Orphanet (D) 8740 4683 15182 20326
OMIA 4833 4833 3120 355
DC 209 208 310 316
Medic 0 8630 3435
Output:
MonDO 39757 27617 44837
Held back: NCIT, SNOMED, ICD9, GARD
15. Example failed resolution – due to
ontology error
https://github.com/monarch-initiative/monarch-disease-ontology/issues/99
https://github.com/DiseaseOntology/HumanDiseaseOntology/issues/164
16. Example failed resolution – due to
mesh duplicates
https://github.com/monarch-initiative/monarch-disease-ontology/issues/81
17. Evaluating results of disease merger
• No gold standard for multiple ontology merger
– Partial evaluation using held-back Orphanet NTBT/E calls:
• 6977/7986 (87% agreement)
• Ad-hoc evaluation by curator
– Approach: use posterior probabilities to rank modules requiring
attention
– This is the killer-app feature
– Iteratively refine curated probabilities
• https://github.com/monarch-initiative/monarch-disease-ontology/issues/
• Results
– Manual inspection and use of mondo
– Detection of errors in source ontologies
• E.g. duplicates in MESH
• Incorrect xrefs in DO, e.g.
– https://github.com/DiseaseOntology/HumanDiseaseOntology/issues - issues #164, #163,
#156, #154, #151, #150, #149, #140, #135
18. Next Steps
• Integrate hypothetical axiom weight estimation into
Bayesian model
• Apply Markov Chain Monte Carlo (MCMC) methods for
estimating most likely graph
– E.g Metropolis-Hastings
• Integrate other knowledge
– Logical Definitions (Phenotypes)
– Molecular knowledge
• Improve Evaluation
– Test k-BOOM on task where we have gold standard, e.g.
neuroanatomy/uberon
– Formal comparison with EFO, MedGen, …
19. Discussion
• Retrospective merging vs prospective
development
– Better to work together from outset (OBO model)
– However, current state of affairs is such that
expert knowledge is distributed across resources
– We want to preserve that rather than reinvent
– Coherent merging of molecular knowledge with
classical top-down knowledge will be required
moving forward
21. Acknowledgments
k-BOOM
• Ian Holmes
• Sebastian Kohler
• Jim Balhoff
• Peter Robinson
• Melissa Haendel
Curation
• Nicole Vasilesky (MonDO,
DC)
• Sue Bello (DC)
• Elvira Mitraka (DO)
• Lynn Shriml (DO)
FUNDING: NIH Office of Director: 1R24OD011883; NIH-UDP:
HHSN268201300036C
Notas do Editor
20 minutes. Sat July 9. 9.40am
TODO: Make data integration
https://github.com/monarch-initiative/monarch-disease-ontology/issues/90
Note the two subgraphs; little overlap in the upper areas
Note Typical (top left) and Atypical are connected
Note Typical (top left) and Atypical are connected
We treat every resource as an ontology, even the degenerate case where it’s a flat list (e.g. OMIM). Pink = novel
Heuristic/ad-hoc
Fig. 2. Module resolution graph exported by kBOOM; Initial input is nodes plus solid arrows (SubClassOf axioms in ORDO). Dotted lines are supplied mappings (no logical interpretation). Figure shows inferred most likely configuration. equivalence=red, subclass=blue, with prior probabilities written as edge labels (thick lines more probable). Enclosing boxes denote equivalence cliques, which can be merged to a single class, yielding a grouping class with two children.
TODO: Example of dupes in MESH
Highlight flipping example