The document discusses the BioSamples Database (BioSD), which provides a reference system for searching and browsing information about biological samples used in biomedical experiments. It focuses on the sample context independently of specific assay types or technologies. BioSD allows for consistency in sample annotations and common interfaces to access sample information and links to other data repositories. By modeling BioSD as linked data, it enables integration with related datasets, exploitation of ontologies for standardization, and enhanced modeling of sample attributes. This can support applications and new ways of querying the data using SPARQL.
DSPy a system for AI to Write Prompts and Do Fine Tuning
BioSamples Database Linked Data, SWAT4LS Tutorial
1. BioSamples Database Linked DataBioSamples Database Linked Data
Marco Brandizi, Functional Genomics Team
SWAT4LS Tutorial, Dec 9th, 2013
Find this presentation at http://tiny.cc/bsdswt13
2. • A reference system, where to search/browse information about biological
samples used/useable for biomedical experiments
• Focused on the sample context (i.e., independent on the specific assay
type/technology)
• Supports heterogeneous experiments
– Single place assay repositories can link (reference samples,
authoritative source for repositories like
Metagenomics/ENA/ArrayExpress)
– Single place for searches and related-to or same-as relationships
(e.g., see the 'myEquivalents' project)
• Allows for consistency/standardisation of sample attributes/annotations
• Common IT interfaces to access sample information and links to specific
data/repositories (e.g., web, XML/REST, RDF)
Why a BioSamples Database (aka BioSD)?
3. • Yet another type of interface, potentially useful to application developers
and Linked Data tools
• Integration with similar/related data-sets (see example queries below!)
• Exploitation of ontologies (see below!)
– Standardisation
– A little semantics goes a long way
• Modelling of certain aspects enhanced
– e.g., numbers, intervals, dates, units are detected from string value
labels and triplified.
• Who knows?
– Apps!
– See Hackaton ideas below!
Why Linked Data for BioSD?
4. The BioSD Model
Sample Groups
Submission
External links
Samples
http://www.ebi.ac.uk/biosamples
5. The BioSD Model
Group's (or Submission's) samples
Sample's (or Groups') attribute types
and values
External links
6. BioSD Data (External Data Sources)
SPARQL Source: http://tinyurl.com/o95xa5v
Tag Cloud made with http://www.wordle.net
SPARQL Source: http://tinyurl.com/ocyb2ld
7. BioSD Data (Common Attribute Types)
SPARQL Source: http://tinyurl.com/pjgdtzs
Tag Cloud made with http://www.wordle.net
8. BioSD Linked Data Model (Main Entities)
Please have a look at:
http://tinyurl.com/lo33ncc
9. BioSD Linked Data Model (Sample Attributes)
Please have a look at:
http://tinyurl.com/n5oyvyd
11. Find Samples and attributes
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX biosd-terms: <http://rdf.ebi.ac.uk/terms/biosd/>
PREFIX sio: <http://semanticscience.org/resource/>
SELECT DISTINCT ?smp ?pvLabel ?propTypeLabel
WHERE
{
?smp
a biosd-terms:Sample;
biosd-terms:has-bio-characteristic | sio:SIO_000332 ?pv. # is about
?pv
rdfs:label ?pvLabel;
biosd-terms:has-bio-characteristic-type ?pvType.
?pvType
rdfs:label ?propTypeLabel.
}
• Exercise: use FILTER()/REGEX() to find organism=homo sapiens
• Exercise: Find sample provenance repositories and their links
– Hint: explore the sample's links (?smp) and see how RepositoryWebRecord
looks like
Try it at: http://www.ebi.ac.uk/rdf/services/biosamples/sparql
Excercise Solution: see examples on such page
12. Samples about a given organism
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX biosd-terms: <http://rdf.ebi.ac.uk/terms/biosd/>
SELECT DISTINCT ?smp ?pvLabel ?propTypeLabel
WHERE {
?smp biosd-terms:has-bio-characteristic ?pv.
?pv biosd-terms:has-bio-characteristic-type ?pvType;
rdfs:label ?pvLabel.
?pvType a ?pvTypeClass.
# Listeria
?pvTypeClass
rdfs:label ?propTypeLabel;
# '*' gives you transitive closure, even when inference is didsbled
rdfs:subClassOf* <http://purl.obolibrary.org/obo/NCBITaxon_1637>
}
• Exercise: Use the Bioportal Service to first find all subclasses of 'alchool' (obo:CHEBI_30879)
and then search samples annotated with such subclasses
– Hint: Use SERVICE <http://sparql.bioontology.org/ontologies/sparql/?apikey=KEY>
Try it at: http://www.ebi.ac.uk/rdf/services/biosamples/sparql
Excercise Solution: see one of the examples on such page
13. Geo-located Samples/Sample Groups
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX biosd-terms: <http://rdf.ebi.ac.uk/terms/biosd/>
PREFIX sio: <http://semanticscience.org/resource/>
SELECT DISTINCT ?item ?latVal ?longVal WHERE {
?item biosd-terms:has-bio-characteristic ?latPv, ?longPv.
?latPv
biosd-terms:has-bio-characteristic-type [ rdfs:label ?latLabel];
sio:SIO_000300 ?latVal. # sio:has value
FILTER ( REGEX ( ?latLabel, "latitude", "i" ) ).
?longPv
biosd-terms:has-bio-characteristic-type [ rdfs:label ?longLabel ];
sio:SIO_000300 ?longVal. # sio:has value
FILTER ( REGEX ( ?longLabel, "longitude", "i" ) ).
}
• Find all samples having an attribute of type temperature, with a numerical value and a unit
specified. Hint: use sio:SIO_000221 (has unit), sio:SIO_000300 (has value)
• Find samples/groups annotated with intervals, which use the properties biosd-terms:has-low-
value and has-high-value and optionally have a unit.
Try it at: http://www.ebi.ac.uk/rdf/services/biosamples/sparql
Excercise Solutions: see examples on that page
14. Expressed Genes and Samples
• For http://purl.uniprot.org/uniprot/P04637 (P53 in Human)
• Find the EFO classes for which it is up-regulated in the Atlas (p-value < 1E-9)
• And show the atlas expression value label . Hints:
– Start from the example http://tinyurl.com/kvvhw6b,
– Use the Atlas endpoint: http://www.ebi.ac.uk/rdf/services/atlas/sparql
• Find the samples having attributes that are instances of such EFO classes
• Which comes from a repository other than 'ArrayExpress'
• Hints:
– Use SERVICE <http://www.ebi.ac.uk/rdf/services/biosamples/sparql> and a sub-query
– Search property values linked to prop. types that are instances of the e.f. found by the
Atlas
– Then link to the samples, the samples to the submissions, the submissions to the web
records
●
OR JUST HAVE A LOOK: http://tinyurl.com/ln3m7nv (will take a while...)
15. Ideas for the Hackaton
• Refer to http://tinyurl.com/mo7wgye for details
• From geo-located samples (samples annotated with latitude/longitude) to Google
maps, e.g, by using Exhibit (http://www.simile-widgets.org/exhibit/)
• Take similar datasets (e.g., MAASTRO, Breast Cancer Data, your data), unify the
schemas (e.g., using CONSTRUCT), define federated queries
• Use the Shape or OpenPHACTS validator to define sensible rules for BioSD and
similar data-sets, e.g., must contain an organism, should have a treatment
• Design/build an App (or Web widget) that asks for eligibility criterion, i.e., pairs of
attribute value/type, and translate it into a SPARQL query (or a more complex
search based on SPARQL) to find samples
– Use common ontologies for auto-completion over property types
– Use string-based auto-completion for values
– Consider numerical values, intervals, units
– Do approximate matching, i.e., matching 8/10 of specified pairs is good.
16. Acknowledgements
• BioSD Team - Alvis Brazma, Tony Burdett, Adam
Faulconbridge, Mike Gostev, Helen Parkinson, Rui Perreria,
Ugis Sarkans, Drashtti Vasant
• Tony Burdett for the help with Zooma
• Simon Jupp, Andy Jenkinson, James Malone, for their great
help with developing and setting up BioSD/RDF
– The rest of the Linked Data team @EBI
(http://www.ebi.ac.uk/rdf)
• BiomedBridges FP7 project (http://www.biomedbridges.eu), for
funding us
17. And you all!
Sorry, we have 2.7M samples, but not all of them...
(Source: http://en.wikipedia.org/wiki/File:Assorted_computer_mice_-_MfK_Bern.jpg)
Contact info:
www.ebi.ac.uk/biosamples
www.marcobrandizi.info
19. • biosd-terms (http://tiny.cc/biosd_terms)
– a small application ontology defining specific classes and properties, e.g.,
sample, sample group, has-knowledgeable-person
• Experimental Factors Ontology (EFO)
– mainly to define/annotate sample attributes
• Ontology for Biomedical Investigations (OBI)
• Information Artefacts Ontology (IAO)
• Semantic Science Ontology (SIO)
– to define main classes in BioSD/RDF
• Bibliographic Ontology (BIBO)
– We link publications about submissions/sample sets
• Dublin Core, schema.org, FOAF
– for general categories and in the Linked Data spirit
• Linked automatically by Zooma: many more (e.g., CHEBI, NCBI-Tax, GO)
Main Ontologies used in BioSD / Linked Data