Cape Town - Bioschemas workshop before the Bioinformatics Education Summit.
Explains schema.org, Bioschemas, TeSS Case study, and the tools and implementation techniques adopters can use
3. Expected Learning Outcomes
⢠Understand what schema.org is and how it can be applied to a
project
⢠Understand what Bioschemas is, how it differs from
schema.org, and what vocabularies are available
⢠Know the benefits and limitations to using schema.org
⢠Gain an understanding of how to apply (bio/)schema.org to
your site.
4. Workshop style
⢠Please do interrupt me if:
â You have any questions
â If you have difficulty reading the slides
â If Iâm not speaking clearly enough
â Or if I am going to fast/slow
8. Search Engines
User InformationConnect
Query text
Demographic
Location
Device Type
Document content
Web traffic
Link count
Freshness
----
21 âsignalsâ
Algorithms to guess
matches
????????
Text Matching
Named Entity Recog
TF-IDF
NLP
9. Take out some of the guessworkâŚ
⢠Search engines need to predict what a page is
aboutâŚ
⢠What if instead, search engines allow the
information providers to explicitly define their
pages contents
⢠Rather than relying on algorithmic guesswork!
11. Schema.org
⢠A lightweight way of structuring data online
⢠Created by a consortium of search engines to improve
experience and search efficacy
â˘Thousands of different vocabularies to describe information
online
23. Schema.org is community made
⢠Schema.org is made up of decentralized
extensions from different industries
24. Schema.org is community made
⢠Extensions that see good usage get âfolded-inâ
to the core schema.org vocabularies
25. Schema.org is community made
⢠To take advantage of schema.org for
Bioinformatics, we need to make our own
community
Bioinformatics
/ Life science
Community
27. Bioschemas
See; âThe FAIR Guiding Principles for scientific data management and stewardshipâ,
Mark D Wilkinson et al, 2016
28. Schema.org is community made
⢠⌠Bioschemas is a community to propose Life
science specifications to schema.org
Bioinformatics
/ Life science
Community
29. Bioschemas
⢠Bioschemas is a community project which;
â Creates Types for Life science resources
⢠Proteins, Samples, Beacons, Tools, Training, etc
â Create Profiles to Refine & Enhance Types
⢠Marginality
⢠Cardinality
⢠Controlled Vocabularies
â Creates tools to make bioschemas easier to
create, validate, and extract
30. Types
⢠Types = New vocabularies to propose to schema.org
â Some are Biological Types
â Some are Generic Types that are
useful to Life scientists
â These new types will be hosted at
bio.schema.org
â Currently at:
http://bio.sdo-bioschemas-227516.appspot.com
34. Profiles
⢠Profiles = Refinement & Interoperability Layer
- Because every industry and domain shares
in these specificationsâŚ
- Every domain includes its own properties
- So we inherit lots of properties we donât
care about
Schema.org is messy!
35. Profiles - Tidying up Schema.org
⢠For example;
â Dataset inherits from schema.org/CreativeWork
â CreativeWork (and therefore Dataset) contains
properties for:
⢠Character
⢠IsFamilyFriendly
⢠Material (e.g. leather, wool, cotton, paper)
⢠Genre
⢠Bioschemas offers an indication of how relevant /
recommended each property is, by grouping into
⢠Minimum | Recommended | Optional
36. Profiles
⢠Profiles = Refinement & Interoperability Layer
- schema.orgs generality means it does not
recommend which ontologies to annotate
with
- Lack of restrictions on cardinality make it
difficult to parse the data (if youâre not a
huge search engine)
Schema.org is not great for interoperability!
37. Profiles - Improving interoperability
⢠Bioschemas profiles include cardinality
restrictions and controlled vocabularies
tailored to our use-cases
39. Profile Development process
⢠Determining the schema is a process of
empirical surveying and expert opinion.
⢠We do a Cross-walk to find what fields are
missing and use this to gauge marginality
40. Profile Development process
Should it be
Minimum /
Optional /
Recommended
Should there
be one or
many of them?
Should values
be restricted
to a controlled
vocab?
If we already
have it:
Do we want to
keep it?
Agree on answers
for each of these
questions
Go through each
attribute (row) of
the schema
If we donât
have it:
Do we want to
include it?
Column G
Column G Column H Column I
Is the
description
provided okay?
Do we want to
rewrite it?
Column F
46. TeSS
⢠A training portal that indexes metadata from across the
web.
â˘Presents a wide selection of openly available training
resources across the bioinformatics discipline.
â˘Displays these in a navigable â easy-to-find manner; in a
feature rich environment.
49. TeSS Features
Search and
Filter
Institutional Login Events
⢠270+ Upcoming events
⢠800+Training materials
⢠Filter with 10+
different facets
Login with ELIXIRAAI using
your institutional or Google
credentials with 1-click sign-
on, to:
⢠Favourite resources
⢠Add new events &
materials
⢠Create new training
workflows
Stay informed about
upcoming events of
interest
⢠E-mail subscription
⢠Import into
calendar
applications
50. TeSS Features
Link with other
registries
Ontological
Classification
Events map
⢠Training events and
materials can be linked
with resources from
other registries.
BioportalAnnotatorWeb
service predicts topics of
resources added toTeSS.
These can be
approved/rejected easily by
our curation group
View filtered events
plotted on a map to
find the most
accessible & relevant
events
Tools & Data services
from bio.tools
Databases, standard,
& policies
from fairsharing.org
51. Content sourcing
⢠Rely on community to register resources?
⢠Community needs to be moderated (to avoid spammers)
⢠Hard to get critical mass of community involvement
⢠Rely on curators to enter content?
⢠Curators need to be paid / incentivized
⢠Data entry is boring
⢠A drop in curation/moderation attention can lead to inaccurate,
malevolent, or insufficient content
⢠Instead develop a solution that
⢠Takes metadata directly from sources
⢠Adds any resources to TeSS as they appear
⢠Updates any resources that have changed
52. How TeSS works
Front End
Automated
Aggregator
Custom Scraper
Custom Scraper
Custom Scraper
Extract metadata
from training
material and
events pages
Back End
Metadata
Catalogue
Events
Materials
Workflows
Finds relevant resources
Training
Workflows
Search
Interface
Workflow
Viewer
Online Training Resources
User enters form
data
53. â˘There are several techniques we can use to extract metadata
from content provider websites. This depends on whatâs on
the site.
â˘Interface with an API
⢠Handy but rare, difficult for websites to implement
Content aggregators must write bespoke API Client for each
⢠Structured data already embedded in page (RSS, ICS)
⢠Limited amount of data
â˘HTML Scraping
⢠Fragile technique that can break when there are changes to the
website.
Automatic extraction techniques
54. Trade-off between ease of adopting
and usefulness to aggregators
Ease to
implement
on a website
Usefulness to aggregator
55. Content Provider extraction technique
statistics
Events Materials Total
Schema.org /
Bioschemas
9 6 15
HTML 3 5 8
XML/JSON/YAML/CS
V
4 3 7
iCal 5 -- 5
JSON API -- 2 2
RSS 1 -- 1
Total 38
56. Content aggregation via Bioschemas
Front End
Automated
Aggregator
Schema.orgScrape
Custom Scraper
Custom Scraper
Extract metadata
from training
material and
events pages
Back End
Metadata
Catalogue
Events
Materials
Workflows
Finds relevant resources
Training
Workflows
Search
Interface
Workflow
Viewer
Online Training Resources
58. Technique for adding Bioschemas to a
website
⢠1. Identify an
appropriate schema(s)
for your content type
⢠1.a If it doesnât exist,
e-mail the mailing list
(W3C, or add to
Github Issue tracker)
Issue tracker
https://github.com/BioSch
emas/specifications
Mailing List
https://www.w3.org/co
mmunity/bioschemas/
59. Technique for adding Bioschemas to a
website
⢠2. Draw a table and
write down your
metadata fields on the
left hand side and the
schema.org properties
on the right.
⢠Map the ones that
correlate
60. Technique for adding Bioschemas to a
website
⢠3a. Use the Bioschemas
generator to create a
JSON-LD snippet that
you can (hopefully)
copy and paste into
your site. (This would
mean creating one for
every new schema.org
record you want to add)
http://www.macs.hw.ac.uk/SWeL/BioschemasGenerator/
61. Technique for adding Bioschemas to a
website
⢠3b. If you can modify
your site, paste in the
JSON-LD template of
the schema (from 3a),
and render the
metadata variables as
values to the keys
Mapping
62. Technique for adding Bioschemas to a
website
⢠3c. If your site is using
a CMS such as
Wordpress or Drupal,
explore whether there
is an appropriate
schema.org plugins
you can use (or ask on
the bioschemas
mailing list)
63. Tutorials
⢠Bioschemas Training Portal
â There is a step-by-step tutorial on there
for adding schema.org to jekyll pages /
github page sites.
â Hopefully there will be more to come
https://bioschemas.gitbook.io/training-portal
64. Tools
⢠Bioschemas Generator
â Form-based tool to generate valid Bioschemas
JSON-LD
â http://www.macs.hw.ac.uk/SWeL/BioschemasGener
ator/
⢠Validata [under construction]
â Web application for validating bioschemas markup
https://bioschemas.org/software/
65. Tools
⢠GoCrawlt
â JSON-LD schema.org extractor
⢠Buzzbang [on hold]
â Search engine that crawls the web for Bioschemas
JSON-LD
https://bioschemas.org/software/
66. Freebies from Schema.org
⢠Google Search Console
â Shows you what schema.org data Google is picking
up from your site, any errors, and advice on how to
fix them
â https://search.google.com/search-console
67. Freebies from Schema.org
⢠Google Structured Data Testing Tool
â Extracts the schema.org from a given web-page or
from a code-snippet, validates it, and shows you
what errors there are
â https://search.google.com/structured-data/testing-
tool
68. Freebies from Schema.org ecosystem
⢠3rd party plug-ins
â Lots available to help
add schema.org to your
framework
Collection of schemas can be used to describe online objects
Schema.org very lightweight
Going clockwise from top right â we have international organizations, communities surrounding technologies, national institutions, and other academic institutions.
All output training events and/or materials and share via their own websites. Many, many opportunities in many, many locations.