July's Connected Data London meetup was hosted by Valtech and featured Augstine Kwanashie of the BBC talking about his experience of working on one of the world's leading linked data platforms.
In the BBC we store data about entities like people, places, events and organisations that matter to our audiences (and appear in our programmes and online content) in an RDF store. These are then used to tag BBC content with the resulting RDF graphs helping to power audience-facing apps and websites. By connecting BBC content in this way, we help enhance content discovery, grouping/aggregation, navigation as well as personalisation and recommendations.
Augustine will talk about the architecture of our linked data system in terms of resilience, monitoring, performance, tooling, data quality and validation. He will also share some of our plans for opening up the platform and making the data accessible to the general public.
7. Tagging BBC Content
<http://www.bbc.co.uk/things/2b7ba3ca-32ca…>
a cwork:CreativeWork, cwork:NewsItem ;
cwork:title ”In the future I will be better…" ;
cwork:about <http://www.bbc../things/4bdbf2-d1ad…> .
<http://www.bbc.co.uk/things/4bdbf2-d1ad…>
a core:Organisation, sport:SportingOrganisation ;
core:label "Manchester City"@en-gb ;
core:sameAs <http://www.wikidata.org/entity/Q50602> .
8. Tagging helps to:
o Group/aggregate content.
o Enhance content discovery.
o Enhance navigation.
o Improve personalisation and recommendation.
13. Custom APIs vs. SPARQL endpoints
SPARQL endpoints
Custom APIs
o Can ensure performance.
o Ideal for rarely changing use-cases.
o Can validate writes.
o Complete flexibility with queries.
o Ideal for varied/changing use-cases.
14. Write APIs
Validation
Applies set of validation rules
Security
Authenticate via SSL certificate whitelists
Content-Types
Accepts Turtle, RDF+XML
Persistence
Writes asynchronously to triplestore
PUT: https://ldp-writer.int.api.bbci.co.uk/crea3ve-works
Content-Type: text/turtle
Body:
<http://www.bbc.co.uk/things/2b7ba3ca-32ca…>
a cwork:CreativeWork, cwork:NewsItem ;
cwork:title "Pep Guardiola…" ;
cwork:about <http://www.bbc../things/4bd…> .
15. Read APIs
Filters & Mixins
Restrict returned data by type, domain, etc.
Search
Full-text search on labels.
Content-Types
Produces Trig, JSON+LD, JSON, HTML
Security
Authenticate via SSL certificatesGET: https://things.api.bbc.com/things
?type = core:Person
&label_search = Theresa
&mixin = pol
Accept: json+ld
16. Documenting APIs
<urn:api:things:documentation> {
<urn:api:things:get-multiple:covered-by> a api:Filter ;
api:collectionFormat "multi"^^xsd:string ;
api:description "Filter for Things with a matching bbc:coveredBy relationship."^^xsd:string ;
api:in "query"^^xsd:string ;
api:name "covered_by"^^xsd:string ;
api:required "false"^^xsd:boolean ;
api:type "array"^^xsd:string .
}
21. Some Validation Rules
Cannot delete a Thing that is used to tag a CreativeWork
things:635 core:preferredLabel "Manchester City" ;
sport:competesIn things:834 ;
DELETE: https://ldp-writer.bbc.com
?guid=things:635
cwork:345 a cwork:CreativeWork ;
tagging:about things:635 .
22. Some Validation Rules
Cannot update a ThingGraph managed by another CMS
context:02 {
things:635 sport:competesIn things:834 .
context:02 prov:managedBy cms:LDM .
}
PUT: https://ldp-writer.bbc.com
?guid=things:635
X-ManagedBy: VIVO
28. Tagging out of Context
core:label "The Presidents of the United
States of America";
core:disambiguationHint "Music Group";
Hard to identify and prevent!
34. Optimise SPARQL queries
SELECT ?subject ?predicate ?object WHERE {
?subject ?predicate ?object .
{
SELECT ?subject WHERE {
OPTIONAL {
?subject prov:createdBy ?created .
}
}
GROUP BY (?subject)
HAVING BOUND (?subject)
}
}
SELECT ?subject ?predicate ?object WHERE {
?subject ?predicate ?object .
{
SELECT ?subject WHERE {
?subject prov:createdBy ?created .
}
GROUP BY (?subject)
}
}
35. Load-test against future demand
100%
Increase in the number of
CreativeWorks by 201960%
Increase in 99 percentile response times by 2019
21m
Requests to the CreativeWorks
API daily
94m
Triples in Triplestore
37. Auto-scaling on API Instances
5
1 2
3
4
1 ELB sends metrics
2 Instances send metrics
3 Alarms trigger autoscaling action
4 New instance is created
5 Instance is added to pool
40. Queue-based write pipeline
Queued writes across multiple clusters
Writer API Consumer
Primary GraphDB
Cluster
Consumer
Replica GraphDB
Cluster
41. Event-based write pipeline
Event-based writes improves resilience
Writer API Consumer
Replica GraphDB
Cluster
Primary GraphDB
Cluster
Event store
API
RDS
Notification
Topics
42. Backup and Recovery
26GB
Per backup
20mins
Recovery time
16Full backups per day
Opsworks recipes to:
² Switch Primary and Replica cluster roles.
² Schedule backups.
² Restore backup to cluster.
S3:
² Stores backups by date/time.
² Retires old backups to Glacier.
52. Main points
o Separating content from metadata
o APIs powered by Linked Data
o Monitoring and reacting to incidents
o Performance for present and future
57. Handing Data
Scala libraries to enable easy RDF manipulation
Trig
Turtle
etc.
Connections-RDF
² Import/Export
² Create triples
² Compare Graphs
² Navigate Graphs
² Manage Datasets
Trig
Turtle
etc.
58. Handing Data
RDF DSL in Scala
val rdfGraph = (
Iri("http://…") >> Rdf.`type` >>> Core.Thing
>> Sport.`type` >>> Sport.Organisation
>> BBC.coveredBy >>> Iri("urn:bbc:news")
>> Core.label >>> "Manchester City"
)
val label = (rdfGraph Core.label).get[String]
59. Some Validation Rules
things:635 core:preferredLabel "Manchester City" ;
cms:locator <urn:bbc:cps:asset:39715040>,
<urn:bbc:cps:asset:39715040> .
Thing locators must be unique
things:635 core:preferredLabel "Manchester City" ;
cms:locator <urn:bbc:cps:asset:01>,
<urn:bbc:cps:asset:02> .
<urn:bbc:cps:asset:01> a cms:CPSLocator .
<urn:bbc:cps:asset:02> a cms:CPSLocator .
Locator Types must be unique
60. Some Validation Rules
things:635 cms:locator <urn:bbc:cps:asset:01> .
things:636 cms:locator <urn:bbc:cps:asset:01> .
Multiple Things with the same
locator
things:635 cms:sameAs dbpedia:01 .
things:636 cms:sameAs dbpedia:01 .
Multiple Things with the same
sameAs
things:635 core:label "Manchester City"
rdf:type owl:Class .
Blacklisted URIs present
61. Ordering Thing Updates Correctly
create:1
update:2
update:3
delete:4
Document
Writer
Primary GraphDB
Cluster
1 Fetch events from Event store
1 2
34
2
Execute task on Triplestore
(only if task id is newer)
3 Errors? Put on Retry queue
4
Fetch and process events from
Retry queue
63. Search: Creating an Index
:manutd
:fclub
:manc
type
label Football Club
label
Manchester United
locatedIn label Manchester
locatedIn
:uk label United Kingdom
RDF Module for :manutd
RDF Module for :manc
64. Search: Full & Incremental Re-index
INSERT DATA {
luc:labelIndex luc:addToIndex <http://www.bbc.co.uk/things/2b7ba3ca-32ca…> .
}
Run incremental re-index after each Thing update
INSERT DATA {
luc:labelIndex luc:updateIndex _:b1 .
}
Run full re-index once daily
65. Search: Full Text Search Query
SELECT ?thing ?score WHERE {
?thing a tagging:TagConcept .
?thing luc:score ?score .
?thing luc:labelIndex " (Manchester OR *Manchester OR *Manchester*) " .
}
o Index available during the re-index process