Open science is yielding active efforts to make data from research available for broader use. But data have restrictions on them (privacy, sensitivity restrictions; regulated by statute or otherwise) that can limit their ability to be made available more broadly. In this talk we offer that there are alternate approaches to the spectrum of data sharing options that offers more control over data than full sharing yet are more contributory than no sharing at all. We offer the controlled compute environment, or capsule, as a viable new approach for computational analysis of data that have restrictions. The compute environment increases the range of possibilities for facilitating science through data reuse, an objective of open science. This talk frames the capsule, and provides experience based on one such capsule used in HathiTrust for research with copyrighted materials.
1. Capsule computing: safe open
science
Beth Plale
Professor, Indiana University Bloomington
On loan to National Science Foundation
*Opinions expressed herein are those of the author alone and do not represent
the views of the National Science Foundation
Binghamton University December 04, 2018
3. • Data has value beyond the use for which
it is originally collected
• Open science is a broad-based global
effort to make data emerging from
research available for wider use
• Open science is thus acknowledgement
of inherent value of scientific data
independent of published scientific or
scholarly outcome
December 04, 2018
4. Open Science encourages from researcher:
• More thoughtful
research
processes;
• Thought to usesof data and codebeyond originalintent
• Attention to
reproducibility /
replicability of
work
The Upturned Microscope by Nik Papageorgiou is licensed under CC BY NC ND
5. Tension of Open Science
Much data resulting from externally funded
research can be made open, but some data
simply cannot nor will ever be completely and
freely open
Data should be made open - open access, open
use, open license and perhaps made open by
default, but there are important cases where
controls on the data must remain
6. More options needed for
restricted data reuse on
spectrum between
completely open (Open
Access) and completely
hidden
7. Possible way forward suggested by
principle: Open as possible, closed as
necessary*
Principle articulated in "Guidelines on FAIR Data Management in
Horizon 2020", EU Horizon 2020 programme
8. Forms of data availability on spectrum
between pure open access and fully
hidden
9. Capsule framework
Controlled compute environment, capsule
framework, is viable approach to accessing and
sharing restricted data that
satisfies sharing while protecting data from
unintended use or use prohibited by law
10. Capsule framework
Implemented through combination of
policy, processes, and software services,
to protect the data and
make the software infrastructure as easy to use
as possible.
plale@Indiana.edu
12. Trust Model
Threat model : high level articulation of tradeoffs during
design of the system. Not an implementation guide
Policy decisions influenced by situation of use:
restrictions on the data;
assumptions of use; and
limits of software services.
Our major tradeoff: how much trust must you place with
the user versus a locked down (and relatively unusable)
system
14. HT Mission and Purpose
To contribute to research, scholarship, and the
common good by collaboratively collecting, organizing,
preserving, communicating, and sharing the record of
human knowledge.
• A trusted digital preservation service enabling the broadest
possible access worldwide.
• An organization with over 100 research libraries partnering to
develop its programs.
• A range of transformative programs enabled by working at a
very large scale.
15. Current Major Cooperative Initiatives
• Distributed manual copyright reviews.
• Establishing a distributed shared print
monograph archive.
• Expanding and enhancing access to US Federal
Government Documents.
• Expanding services of the HathiTrust Research
Center.
16. Scale of the HathiTrust Collection
• 16,639,076 total volumes
– 8,075,459 book titles
– 446,580 serial titles
– 5,823,676,600 pages
– 746 terabytes
• 6,256,362 open volumes (~38% of total)
Collection includes (mostly) published materials
in bound form, digitized from research and
academic library collections.
17. Example use: how influenced is a writer by
time spent at Iowa Writers Workshop?
• Assemble corpus of works by authors affiliated with
renowned Iowa Writers Workshop
• Perform analysis to determine whether a Workshop
style exists and what the characteristics of such a
style might be.
• Collect metrics such as
vocabulary size,
sentence length, or
even frequency of
male and female
pronouns
18. Controlled Compute Environment
(or remote secure enclave) provides
researchers with remote analytical
access to a data collection that has
restrictions (legal, privacy) on use,
and because of its size, requires
compute to come to the data and
not vice versa
SUNY Binghamton, Dec 2018
Beth Plale, Inna Kouper, Samitha Liayanage, Yu Ma, Robert McDonald, and John Walsh,
Capsule Computing: Safe Open Science, under review, 2019.
19. Capsule Framework, as a controlled
compute environment,is
implemented through policy,
processes, and software services
working together to protect the data
while making the software
infrastructure as easy to use as
possible.
DataPASS
20. Policies in place for HathiTrust
Human facing
• Non-consumptive Use Research Policy
• Terms of Use
Infrastructure facing
• HathiTrust Rights Database
• Trust (threat) model
Export review
• Human review of results exported from
Capsule
Human facing
policy
Infrastructure facing
“service agreement”
mutually
reinforcing
21. Overriding policy is that of Non-
consumptive Use Research Policy
Research in which computational analysis is
performed on one or more volumes but not
research in which researcher reads or displays
substantial portions of an in-copyright or rights-
restricted volume to understand expressive content
presented within that volume.
Examples: text extraction, automated translation, image
analysis, file manipulation, OCR correction, and indexing and
search.
https://www.hathitrust.org/htrc_ncup
22. Terms of Use
Agreement between HTRC/HT and individual intending to
use HTRC Data Capsule service. Top 4 terms:
1. Read and comply with Non-Consumptive Use Research
Policy.
2. Use their Capsule for non-consumptive research
purposes only as defined in Section 1 of the policy.
3. Prior to first use, User submits form indicating
intended use and expected forms of outputs.
4. By using HTRC Data Capsule service, User
acknowledges that information about their activities
while in Capsule may be reviewed in manner consistent
with HathiTrust privacy policy.
https://www.hathitrust.org/htrc_dc_tou
23. Rights Database
● Database for storing and tracking rights
information for each digitized volume in HathiTrust
● At core of system is algorithm that considers a)
copyright status and/or explicit access controls
associated with the volume, b) volume's digitizing
agent (e.g., Google or the University of Chicago),
and c) identity of user (if known) in order to
determine access rights.
● How used: demo capsule uses only public domain
content.
https://www.hathitrust.org/rights_database plale@Indiana.edu
24. Threat model
● Threat model: structured representation of all
information that affects security of an
application.
● Two most relevant clauses:
○ Analysts are themselves considered to act in good
faith, but this does not preclude possibility of them
unwittingly allowing system to be compromised.
■ Reasonable assumption and motivates why analysts are
required to sign use agreement.
25. Capsule Framework
k*N user VMs running in back end layer; managed by a hypervisor. All
software implementing Capsule framework is open source.
26. Mode one: Maintenance mode
Access to
Internet
permitted;
Channel to
restricted
data closed
HT DL
27. Mode two: Secure mode
Access to
Internet
denied;
Channel to
restricted
data open
HT DL
28. Threat Model implementation in 7 easy
steps
● The threat model for the Capsule framework
implementation in HathiTrust is built on the
assumed existence of a Trusted Computing
Base (TCB), where there resides the totality of
security mechanisms within a secure system
reside
● Threat model implementation (8 statements)
29. Threat Model implementation
1. An analyst accesses restricted data through
remotely accessed VMs that read data from a
network-accessed data service.
2. The VM that is given to the analyst for use is not
part of the TCB. The remaining support is within the
TCB: the Virtual Machine Manager (VMM), the host
that the VMM runs on, and the system services that
enforce network and data access policies for the
virtual machines. Data storage is included within the
TCB.
30. Threat Model implementation
3. Users may inadvertently install malware; there may be
other remotely initiated attacks on the VM. These attacks
could potentially compromise the entire operating system
and install a rootkit, both of which are undetectable to the
end user.
Analysts are themselves considered to act in good faith,
but this does not preclude the possibility of them
unwittingly allowing the system to be compromised.
Analysts are required to sign a use agreement before using
the system. Results are reviewed before made available to
the user for download.
31. Threat Model implementation
4. Users have VNC access to their virtual machines in non-secure
mode to give them a desktop interface to the machine. They also
have SSH access in non-secure mode so that they can upload data
sets and install software more easily. However, VNC access
represents a channel for potential data leak; through use of a use
agreement and profile, HT is comfortable that the analyst is acting
in good faith. An analyst must refrain from sharing their virtual
machine.
5. A potential threat is that of covert channels between virtual
machines that run on the same host machine. A solution requires
using two physically separated systems, one that only runs VMs in
secure mode and another that runs VMs only in maintenance
mode. HT currently performs routine host port scanning.
32. Threat Model implementation
6. Analysts have complete freedom to access the Internet
from their Capsule to upload/download material while in a
non-secure mode. Once switching to a secure mode, the
analyst has direct access to the restricted materials.
While in secure mode, Internet access is prohibited, as is
copying from the Capsule to the desktop.
7. The analyst’s state is retained in a Capsule across
sessions of work, but when an analyst completes her work
and wants to pull data out of the Capsule, she must store
results off to a special drive. The contents of this drive are
manually reviewed before results are made available to
the analyst.
33. Policy / Infrastructure tradeoff
Human facing policies
• Non-consumptive Use Research Policy
• Terms of Use
Infrastructure facing
• HathiTrust Rights Database
• Trust (threat) model
Export review
• Human review of results exported from Capsule
Tradeoff: how much trust must you place with the
user versus a locked down (and relatively unusable)
system?
Human facing
policy
Infrastructure facing
“service agreement”
mutually
reinforcing
34. Takeaways
• HT Capsule implementation heavily driven by
“non-consumptive research” and wide-open
research modes to the HathiTrust collection
– By Authors Guild et al. v HathiTrust et al., 11 Civ
6351 (S.D.N.Y Sep 12 2011), research must be
non-consumptive. That is, “no eyeballs on texts”
– Massive collection where no single set of tools
satisfies needs. Researcher needs freedom to
install own tools
35. Takeaways
• Recently released toolkit comes default in
every VM. Helps connect to restricted data
and import other data resources (user’s
workset) into Capsule. Proven to reduce
programming burden.
• Running on physical servers at Indiana
University Bloomington.
36. Resources
• Non-Consumptive Use Research Policy
https://www.hathitrust.org/htrc_ncup
• HathiTrust Rights Database
https://www.hathitrust.org/rights_database
• Trust (threat) model (somewhat outdated)
– Plale, Beth; Prakash, Atul; McDonald, Robert (2015). The
Data Capsule for Non-Consumptive Research: Final Report.
Available from http://hdl.handle.net/2022/19277
• Terms of Use https://www.hathitrust.org/htrc_dc_tou
• HTRC Data Capsule accessible at
https://analytics.hathitrust.org
37. Please feel free to reach out to me for
more information
Beth Plale
plale@indiana.edu