1. Technical challenges in resource discovery
Paul
Walk
paul@paulwalk.net
@paulwalk
http://www.paulwalk.net
2. Contents
1. a
general
consideration:
• open
or
closed
2. a
particular
challenge:
• synchronisation
in
an
open
world
3. the
‘nothing
new’,
but
doing
it
better
• APIs
that
work
and
can
be
trusted
4. open and closed worlds
• I’m
not
talking
about
licensing
or
access
to
data
• open
• unbounded
-‐
like
the
Web
• closed
• bounded
-‐
like
most
collections
management
system,
aggregations
etc.
• formally,
much
of
what
we
do
is
underpinned
by
‘open/closed
worlds’
assumptions:
• open
world
assumption:
any
statement
not
known
to
be
true
is
unknown
• closed
world
assumption:
any
statement
not
known
to
be
true
is
false
7. judging where to apply each
• we
need
our
infrastructure
(especially
integration
technology
between
systems)
to
be
open
and
relatively
unbounded
• the
Web
is
still
the
best
available
foundation
for
this
• however,
we
still
need
to
manage
our
resources,
maintain
quality
and
honour
complex
rights
management
commitments
• we
probably
need
to
recognise
that
users’
experience
is
often
enhanced
through
the
application
of
a
more
focussed,
targeted
and
context-‐aware
approach
9. synchronisation
• how
is
the
state
of
the
resource
maintained
across
Resource
Collection an
infrastructure
of
Aggregation ‘federated’
repositories?
Resource
• if
a
resource
is
changed
or
Collection
Aggregation
deleted,
how
does
the
right-‐
hand
side
aggregation
know?
Aggregation
Resource • note
-‐
this
is
based
on
our
Collection existing
‘harvesting’
or
‘pull’
approach
Resource
Collection multiple harvest routes,
multiple copies
10. ResourceSync
• a
joint
project
of
NISO
and
OAI,
led
by
Herbert
Van
de
Sompel
of
Los
Alamos
• a
light-‐weight
mechanism
to
allow
the
state
of
web
resources
to
be
communicated
between
web
systems
• developing
a
spec
which
builds
on
the
sitemap
speciTication,
allowing
content
providers
to
publish
changesets
• draft:
http://bit.ly/WYhTz2
• Jisc
have
funded
UK
participation
in
this
11. The sun shone, having no
alternative, on the nothing
new. Murphy,
Samuel
Becket
12. A distributed system is one
in which the failure of a
computer you didn't even
know existed can render
your own computer unusable
Leslie Lamport
13. a common ‘anti-pattern’
• as
a
developer,
I
have
no
reason
to
trust
that
these
APIs
are
any
good.
end-user
end-user end-user
UI • after
all,
the
service
provider
UI UI doesn’t
seem
to
trust
them
for
their
Future own
application....
Future 3rd-party Future
3rd-party dev 3rd-party
dev dev
API AP
A PI I
some aggregated data of broad
interest and potential usefulness
= certainty UI
= belief
= speculation
end-user
14. a better pattern
• As
a
developer,
I’m
more
likely
to
trust
this
pattern.
• the
content
provider
is
using
their
end-user end-user own
API
to
deliver
their
own
application.
UI UI
• they
have
a
vested
interest!
3rd-party focussed
app app
API
= certainty
= belief
some aggregated data of broad
= speculation interest and potential usefulness
15. APIs are not best thought of
as machine-to-machine
interfaces
APIs are interfaces for
developers
16. messages from developers to content-providers
• These
are
from
yesterday’s
developer
day
held
here
at
the
BL
in
support
of
this
summit:
• please
don’t
build
elaborate
APIs
which
do
not
allow
us
to
see
all
of
the
data,
or
its
extent.
It’s
not
that
we
simply
want
to
download
all
the
data
-‐
but
we
do
need
to
see
what
we’re
dealing
with
• if
you
give
us
access
to
incomplete
data
(perhaps
because
you’re
worried
about
revealing
poor
data
quality),
then
we
will
tend
to
either
abandon
our
attempts
to
use
it
or
we
will
‘Bill
in
the
gaps’
with
data
from
elsewhere.
So
offering
an
API
which
delivers
incomplete
data
is
usually
self-‐defeating
• the
implicit
bargain,
made
explicit:
• give
us
access
to
the
data
as
soon
as
possible
and
we
will
do
some
of
the
work
to
process
so
it
is
Bit
for
some
new
purpose
-‐
and
we
will
happily
share
this
code
with
you
17. Questions for the parallel sessions
1. Which
emerging
technologies
do
we
need
to
focus
on
in
2013?
2. Do
we
still
need
to
aggregate?
3. What
does
data
quality
stop
us
doing?
18. Which emerging technologies do we need to
focus on in 2013?
• Graphs:
Content
Context
is
king
• both
Facebook
and
Google
are
betting
heavily
on
graph
technologies
• closer
to
home
-‐
so
are
content
providers
like
the
BBC
• linking
these
is
an
interesting
challenge
• databases
based
on
a
graph
model
give
the
potential
for
a
richer
understanding
about
entities
(users!)
• instrumentation
in
personal
devices
makes
more
context
available
(e.g.
geo-‐
location).
21. Do we still need to aggregate?
yes.
• to
address
systems/network
latency
-‐
provide
a
cache
• to
showcase!
• for
‘Web
Scale
concentration’
• network
effects
if
user
facing
services
also
developed
• to
create
middleman
business
opportunities
• as
infrastructure
to
support
locally
developed
services
• as
an
approach
to
preservation
22. What does data quality stop us doing?
• interpreted
as:
“what
does
a
concern
for
data
quality
stop
us
doing?”
• it
stops
us
from
releasing
data
early
• interpreted
as:
“what
does
poor/uncertain
data
quality
stop
us
doing?”
• it
erodes
trust,
which
impacts
the
likelihood
of
someone
doing
something
worthwhile
with
our
data
• reconciling
these
concerns
is
a
major
challenge
for
us.
23. thank you!
Paul
Walk
paul@paulwalk.net
@paulwalk
http://www.paulwalk.net