SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Capsule computing: safe open
science
Beth Plale
Professor, Indiana University Bloomington
On loan to National Science Foundation
*Opinions expressed herein are those of the author alone and do not represent
the views of the National Science Foundation
Binghamton University December 04, 2018
Data: a
foundation
of science
plale@Indiana.edu
Weather simulation
• Data has value beyond the use for which
it is originally collected
• Open science is a broad-based global
effort to make data emerging from
research available for wider use
• Open science is thus acknowledgement
of inherent value of scientific data
independent of published scientific or
scholarly outcome
December 04, 2018
Open Science encourages from researcher:
• More thoughtful
research
processes;
• Thought to usesof data and codebeyond originalintent
• Attention to
reproducibility /
replicability of
work
The Upturned Microscope by Nik Papageorgiou is licensed under CC BY NC ND
Tension of Open Science
Much data resulting from externally funded
research can be made open, but some data
simply cannot nor will ever be completely and
freely open
Data should be made open - open access, open
use, open license and perhaps made open by
default, but there are important cases where
controls on the data must remain
More options needed for
restricted data reuse on
spectrum between
completely open (Open
Access) and completely
hidden
Possible way forward suggested by
principle: Open as possible, closed as
necessary*
Principle articulated in "Guidelines on FAIR Data Management in
Horizon 2020", EU Horizon 2020 programme
Forms of data availability on spectrum
between pure open access and fully
hidden
Capsule framework
Controlled compute environment, capsule
framework, is viable approach to accessing and
sharing restricted data that
satisfies sharing while protecting data from
unintended use or use prohibited by law
Capsule framework
Implemented through combination of
policy, processes, and software services,
to protect the data and
make the software infrastructure as easy to use
as possible.
plale@Indiana.edu
Capsule technical design
?
?
?
?
Trust Model
Threat model : high level articulation of tradeoffs during
design of the system. Not an implementation guide
Policy decisions influenced by situation of use:
restrictions on the data;
assumptions of use; and
limits of software services.
Our major tradeoff: how much trust must you place with
the user versus a locked down (and relatively unusable)
system
plale@Indiana.edu
The motivating need for capsule computing is the
HathiTrust (HT) shared digital repository
HT Mission and Purpose
To contribute to research, scholarship, and the
common good by collaboratively collecting, organizing,
preserving, communicating, and sharing the record of
human knowledge.
• A trusted digital preservation service enabling the broadest
possible access worldwide.
• An organization with over 100 research libraries partnering to
develop its programs.
• A range of transformative programs enabled by working at a
very large scale.
Current Major Cooperative Initiatives
• Distributed manual copyright reviews.
• Establishing a distributed shared print
monograph archive.
• Expanding and enhancing access to US Federal
Government Documents.
• Expanding services of the HathiTrust Research
Center.
Scale of the HathiTrust Collection
• 16,639,076 total volumes
– 8,075,459 book titles
– 446,580 serial titles
– 5,823,676,600 pages
– 746 terabytes
• 6,256,362 open volumes (~38% of total)
Collection includes (mostly) published materials
in bound form, digitized from research and
academic library collections.
Example use: how influenced is a writer by
time spent at Iowa Writers Workshop?
• Assemble corpus of works by authors affiliated with
renowned Iowa Writers Workshop
• Perform analysis to determine whether a Workshop
style exists and what the characteristics of such a
style might be.
• Collect metrics such as
vocabulary size,
sentence length, or
even frequency of
male and female
pronouns
Controlled Compute Environment
(or remote secure enclave) provides
researchers with remote analytical
access to a data collection that has
restrictions (legal, privacy) on use,
and because of its size, requires
compute to come to the data and
not vice versa
SUNY Binghamton, Dec 2018
Beth Plale, Inna Kouper, Samitha Liayanage, Yu Ma, Robert McDonald, and John Walsh,
Capsule Computing: Safe Open Science, under review, 2019.
Capsule Framework, as a controlled
compute environment,is
implemented through policy,
processes, and software services
working together to protect the data
while making the software
infrastructure as easy to use as
possible.
DataPASS
Policies in place for HathiTrust
Human facing
• Non-consumptive Use Research Policy
• Terms of Use
Infrastructure facing
• HathiTrust Rights Database
• Trust (threat) model
Export review
• Human review of results exported from
Capsule
Human facing
policy
Infrastructure facing
“service agreement”
mutually
reinforcing
Overriding policy is that of Non-
consumptive Use Research Policy
Research in which computational analysis is
performed on one or more volumes but not
research in which researcher reads or displays
substantial portions of an in-copyright or rights-
restricted volume to understand expressive content
presented within that volume.
Examples: text extraction, automated translation, image
analysis, file manipulation, OCR correction, and indexing and
search.
https://www.hathitrust.org/htrc_ncup
Terms of Use
Agreement between HTRC/HT and individual intending to
use HTRC Data Capsule service. Top 4 terms:
1. Read and comply with Non-Consumptive Use Research
Policy.
2. Use their Capsule for non-consumptive research
purposes only as defined in Section 1 of the policy.
3. Prior to first use, User submits form indicating
intended use and expected forms of outputs.
4. By using HTRC Data Capsule service, User
acknowledges that information about their activities
while in Capsule may be reviewed in manner consistent
with HathiTrust privacy policy.
https://www.hathitrust.org/htrc_dc_tou
Rights Database
● Database for storing and tracking rights
information for each digitized volume in HathiTrust
● At core of system is algorithm that considers a)
copyright status and/or explicit access controls
associated with the volume, b) volume's digitizing
agent (e.g., Google or the University of Chicago),
and c) identity of user (if known) in order to
determine access rights.
● How used: demo capsule uses only public domain
content.
https://www.hathitrust.org/rights_database plale@Indiana.edu
Threat model
● Threat model: structured representation of all
information that affects security of an
application.
● Two most relevant clauses:
○ Analysts are themselves considered to act in good
faith, but this does not preclude possibility of them
unwittingly allowing system to be compromised.
■ Reasonable assumption and motivates why analysts are
required to sign use agreement.
Capsule Framework
k*N user VMs running in back end layer; managed by a hypervisor. All
software implementing Capsule framework is open source.
Mode one: Maintenance mode
Access to
Internet
permitted;
Channel to
restricted
data closed
HT DL
Mode two: Secure mode
Access to
Internet
denied;
Channel to
restricted
data open
HT DL
Threat Model implementation in 7 easy
steps
● The threat model for the Capsule framework
implementation in HathiTrust is built on the
assumed existence of a Trusted Computing
Base (TCB), where there resides the totality of
security mechanisms within a secure system
reside
● Threat model implementation (8 statements)
Threat Model implementation
1. An analyst accesses restricted data through
remotely accessed VMs that read data from a
network-accessed data service.
2. The VM that is given to the analyst for use is not
part of the TCB. The remaining support is within the
TCB: the Virtual Machine Manager (VMM), the host
that the VMM runs on, and the system services that
enforce network and data access policies for the
virtual machines. Data storage is included within the
TCB.
Threat Model implementation
3. Users may inadvertently install malware; there may be
other remotely initiated attacks on the VM. These attacks
could potentially compromise the entire operating system
and install a rootkit, both of which are undetectable to the
end user.
Analysts are themselves considered to act in good faith,
but this does not preclude the possibility of them
unwittingly allowing the system to be compromised.
Analysts are required to sign a use agreement before using
the system. Results are reviewed before made available to
the user for download.
Threat Model implementation
4. Users have VNC access to their virtual machines in non-secure
mode to give them a desktop interface to the machine. They also
have SSH access in non-secure mode so that they can upload data
sets and install software more easily. However, VNC access
represents a channel for potential data leak; through use of a use
agreement and profile, HT is comfortable that the analyst is acting
in good faith. An analyst must refrain from sharing their virtual
machine.
5. A potential threat is that of covert channels between virtual
machines that run on the same host machine. A solution requires
using two physically separated systems, one that only runs VMs in
secure mode and another that runs VMs only in maintenance
mode. HT currently performs routine host port scanning.
Threat Model implementation
6. Analysts have complete freedom to access the Internet
from their Capsule to upload/download material while in a
non-secure mode. Once switching to a secure mode, the
analyst has direct access to the restricted materials.
While in secure mode, Internet access is prohibited, as is
copying from the Capsule to the desktop.
7. The analyst’s state is retained in a Capsule across
sessions of work, but when an analyst completes her work
and wants to pull data out of the Capsule, she must store
results off to a special drive. The contents of this drive are
manually reviewed before results are made available to
the analyst.
Policy / Infrastructure tradeoff
Human facing policies
• Non-consumptive Use Research Policy
• Terms of Use
Infrastructure facing
• HathiTrust Rights Database
• Trust (threat) model
Export review
• Human review of results exported from Capsule
Tradeoff: how much trust must you place with the
user versus a locked down (and relatively unusable)
system?
Human facing
policy
Infrastructure facing
“service agreement”
mutually
reinforcing
Takeaways
• HT Capsule implementation heavily driven by
“non-consumptive research” and wide-open
research modes to the HathiTrust collection
– By Authors Guild et al. v HathiTrust et al., 11 Civ
6351 (S.D.N.Y Sep 12 2011), research must be
non-consumptive. That is, “no eyeballs on texts”
– Massive collection where no single set of tools
satisfies needs. Researcher needs freedom to
install own tools
Takeaways
• Recently released toolkit comes default in
every VM. Helps connect to restricted data
and import other data resources (user’s
workset) into Capsule. Proven to reduce
programming burden.
• Running on physical servers at Indiana
University Bloomington.
Resources
• Non-Consumptive Use Research Policy
https://www.hathitrust.org/htrc_ncup
• HathiTrust Rights Database
https://www.hathitrust.org/rights_database
• Trust (threat) model (somewhat outdated)
– Plale, Beth; Prakash, Atul; McDonald, Robert (2015). The
Data Capsule for Non-Consumptive Research: Final Report.
Available from http://hdl.handle.net/2022/19277
• Terms of Use https://www.hathitrust.org/htrc_dc_tou
• HTRC Data Capsule accessible at
https://analytics.hathitrust.org
Please feel free to reach out to me for
more information
Beth Plale
plale@indiana.edu

Mais conteúdo relacionado

Mais procurados

What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 
Trust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail ScienceTrust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail ScienceBeth Plale
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR
 
AD_LABX_BRO_19Nov2014__1_
AD_LABX_BRO_19Nov2014__1_AD_LABX_BRO_19Nov2014__1_
AD_LABX_BRO_19Nov2014__1_Leonard Cibelli
 
Providing support and services for researchers in good data governance
Providing support and services for researchers in good data governanceProviding support and services for researchers in good data governance
Providing support and services for researchers in good data governanceRobin Rice
 
20160523 23 Research Data Things
20160523 23 Research Data Things20160523 23 Research Data Things
20160523 23 Research Data ThingsKatina Toufexis
 
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaiDataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaidatascienceiqss
 
20160719 23 Research Data Things
20160719 23 Research Data Things20160719 23 Research Data Things
20160719 23 Research Data ThingsKatina Toufexis
 
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSMicah Altman
 
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...ICPSR
 
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESBROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESMicah Altman
 
2012 Fall Data Management Planning Workshop
2012 Fall Data Management Planning Workshop2012 Fall Data Management Planning Workshop
2012 Fall Data Management Planning WorkshopLizzy_Rolando
 
Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADBeth Plale
 
The Future of Open Science
The Future of Open ScienceThe Future of Open Science
The Future of Open SciencePhilip Bourne
 
Privacy Preserving Data Mining
Privacy Preserving Data MiningPrivacy Preserving Data Mining
Privacy Preserving Data MiningVrushali Malvadkar
 
A Successful Academic Medical Center Must be a Truly Digital Enterprise
A Successful Academic Medical Center Must be a Truly Digital EnterpriseA Successful Academic Medical Center Must be a Truly Digital Enterprise
A Successful Academic Medical Center Must be a Truly Digital EnterprisePhilip Bourne
 
Data Citation Implementation Guidelines By Tim Clark
Data Citation Implementation Guidelines By Tim ClarkData Citation Implementation Guidelines By Tim Clark
Data Citation Implementation Guidelines By Tim Clarkdatascienceiqss
 
Data Sharing & Data Citation
Data Sharing & Data CitationData Sharing & Data Citation
Data Sharing & Data CitationMicah Altman
 

Mais procurados (20)

What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
Trust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail ScienceTrust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail Science
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
 
AD_LABX_BRO_19Nov2014__1_
AD_LABX_BRO_19Nov2014__1_AD_LABX_BRO_19Nov2014__1_
AD_LABX_BRO_19Nov2014__1_
 
Providing support and services for researchers in good data governance
Providing support and services for researchers in good data governanceProviding support and services for researchers in good data governance
Providing support and services for researchers in good data governance
 
20160523 23 Research Data Things
20160523 23 Research Data Things20160523 23 Research Data Things
20160523 23 Research Data Things
 
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaiDataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
 
20160719 23 Research Data Things
20160719 23 Research Data Things20160719 23 Research Data Things
20160719 23 Research Data Things
 
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
 
Levine - Data Curation; Ethics and Legal Considerations
Levine - Data Curation; Ethics and Legal ConsiderationsLevine - Data Curation; Ethics and Legal Considerations
Levine - Data Curation; Ethics and Legal Considerations
 
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
 
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESBROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
 
2012 Fall Data Management Planning Workshop
2012 Fall Data Management Planning Workshop2012 Fall Data Management Planning Workshop
2012 Fall Data Management Planning Workshop
 
Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEAD
 
Digital Curation 101 - Taster
Digital Curation 101 - TasterDigital Curation 101 - Taster
Digital Curation 101 - Taster
 
The Future of Open Science
The Future of Open ScienceThe Future of Open Science
The Future of Open Science
 
Privacy Preserving Data Mining
Privacy Preserving Data MiningPrivacy Preserving Data Mining
Privacy Preserving Data Mining
 
A Successful Academic Medical Center Must be a Truly Digital Enterprise
A Successful Academic Medical Center Must be a Truly Digital EnterpriseA Successful Academic Medical Center Must be a Truly Digital Enterprise
A Successful Academic Medical Center Must be a Truly Digital Enterprise
 
Data Citation Implementation Guidelines By Tim Clark
Data Citation Implementation Guidelines By Tim ClarkData Citation Implementation Guidelines By Tim Clark
Data Citation Implementation Guidelines By Tim Clark
 
Data Sharing & Data Citation
Data Sharing & Data CitationData Sharing & Data Citation
Data Sharing & Data Citation
 

Semelhante a Capsule Computing: Safe Open Science

public truthfulness assessment for shared active cloud data storage with grou...
public truthfulness assessment for shared active cloud data storage with grou...public truthfulness assessment for shared active cloud data storage with grou...
public truthfulness assessment for shared active cloud data storage with grou...Ijripublishers Ijri
 
Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020Sarah Jones
 
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus PosterNIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus PosterGlobus
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...Javier González
 
eCitizen Sensible-Data Design Challenge
eCitizen Sensible-Data Design ChallengeeCitizen Sensible-Data Design Challenge
eCitizen Sensible-Data Design Challengehopbeat
 
Architecture and Standards
Architecture and StandardsArchitecture and Standards
Architecture and StandardsARDC
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...
Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...
Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...NAUMAN MUSHTAQ
 
Nicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchNicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchGigaScience, BGI Hong Kong
 
Preventing Mirror Problem And Privacy Issues In Multistorage Area With Dimens...
Preventing Mirror Problem And Privacy Issues In Multistorage Area With Dimens...Preventing Mirror Problem And Privacy Issues In Multistorage Area With Dimens...
Preventing Mirror Problem And Privacy Issues In Multistorage Area With Dimens...IJSRED
 
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network Security
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network SecurityWhitepaper- User Behavior-Based Anomaly Detection for Cyber Network Security
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network SecurityHappiest Minds Technologies
 
Ieeepro techno solutions 2011 ieee dotnet project -secure role based data
Ieeepro techno solutions   2011 ieee dotnet project -secure role based dataIeeepro techno solutions   2011 ieee dotnet project -secure role based data
Ieeepro techno solutions 2011 ieee dotnet project -secure role based dataASAITHAMBIRAJAA
 
Ieeepro techno solutions 2011 ieee java project -secure role based data
Ieeepro techno solutions   2011 ieee java project -secure role based dataIeeepro techno solutions   2011 ieee java project -secure role based data
Ieeepro techno solutions 2011 ieee java project -secure role based datahemanthbbc
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
 
Exploring New Methods for Protecting and Distributing Confidential Research ...
Exploring New Methods for Protecting and Distributing Confidential Research ...Exploring New Methods for Protecting and Distributing Confidential Research ...
Exploring New Methods for Protecting and Distributing Confidential Research ...Bryan Beecher
 

Semelhante a Capsule Computing: Safe Open Science (20)

public truthfulness assessment for shared active cloud data storage with grou...
public truthfulness assessment for shared active cloud data storage with grou...public truthfulness assessment for shared active cloud data storage with grou...
public truthfulness assessment for shared active cloud data storage with grou...
 
Shadow Data Exposed
Shadow Data ExposedShadow Data Exposed
Shadow Data Exposed
 
Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020
 
Intro to RDM
Intro to RDMIntro to RDM
Intro to RDM
 
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus PosterNIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
A Distributed Architecture for Sharing Ecological Data Sets with Access and U...
 
eCitizen Sensible-Data Design Challenge
eCitizen Sensible-Data Design ChallengeeCitizen Sensible-Data Design Challenge
eCitizen Sensible-Data Design Challenge
 
Architecture and Standards
Architecture and StandardsArchitecture and Standards
Architecture and Standards
 
BLC & Digital Science: Mark Hahnel, Figshare
BLC & Digital Science: Mark Hahnel, FigshareBLC & Digital Science: Mark Hahnel, Figshare
BLC & Digital Science: Mark Hahnel, Figshare
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...
Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...
Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...
 
Nicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchNicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do research
 
Preventing Mirror Problem And Privacy Issues In Multistorage Area With Dimens...
Preventing Mirror Problem And Privacy Issues In Multistorage Area With Dimens...Preventing Mirror Problem And Privacy Issues In Multistorage Area With Dimens...
Preventing Mirror Problem And Privacy Issues In Multistorage Area With Dimens...
 
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network Security
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network SecurityWhitepaper- User Behavior-Based Anomaly Detection for Cyber Network Security
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network Security
 
Ieeepro techno solutions 2011 ieee dotnet project -secure role based data
Ieeepro techno solutions   2011 ieee dotnet project -secure role based dataIeeepro techno solutions   2011 ieee dotnet project -secure role based data
Ieeepro techno solutions 2011 ieee dotnet project -secure role based data
 
Ieeepro techno solutions 2011 ieee java project -secure role based data
Ieeepro techno solutions   2011 ieee java project -secure role based dataIeeepro techno solutions   2011 ieee java project -secure role based data
Ieeepro techno solutions 2011 ieee java project -secure role based data
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 
Trial io pcori doc v1
Trial io pcori doc v1Trial io pcori doc v1
Trial io pcori doc v1
 
Exploring New Methods for Protecting and Distributing Confidential Research ...
Exploring New Methods for Protecting and Distributing Confidential Research ...Exploring New Methods for Protecting and Distributing Confidential Research ...
Exploring New Methods for Protecting and Distributing Confidential Research ...
 

Mais de Beth Plale

Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open ScienceBeth Plale
 
Open science as roadmap to better data science research
Open science as roadmap to better data science researchOpen science as roadmap to better data science research
Open science as roadmap to better data science researchBeth Plale
 
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedTowards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedBeth Plale
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Beth Plale
 
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBeth Plale
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for scienceBeth Plale
 
HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013Beth Plale
 

Mais de Beth Plale (7)

Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open Science
 
Open science as roadmap to better data science research
Open science as roadmap to better data science researchOpen science as roadmap to better data science research
Open science as roadmap to better data science research
 
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedTowards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014
 
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for science
 
HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013
 

Último

Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 

Último (20)

Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 

Capsule Computing: Safe Open Science

  • 1. Capsule computing: safe open science Beth Plale Professor, Indiana University Bloomington On loan to National Science Foundation *Opinions expressed herein are those of the author alone and do not represent the views of the National Science Foundation Binghamton University December 04, 2018
  • 3. • Data has value beyond the use for which it is originally collected • Open science is a broad-based global effort to make data emerging from research available for wider use • Open science is thus acknowledgement of inherent value of scientific data independent of published scientific or scholarly outcome December 04, 2018
  • 4. Open Science encourages from researcher: • More thoughtful research processes; • Thought to usesof data and codebeyond originalintent • Attention to reproducibility / replicability of work The Upturned Microscope by Nik Papageorgiou is licensed under CC BY NC ND
  • 5. Tension of Open Science Much data resulting from externally funded research can be made open, but some data simply cannot nor will ever be completely and freely open Data should be made open - open access, open use, open license and perhaps made open by default, but there are important cases where controls on the data must remain
  • 6. More options needed for restricted data reuse on spectrum between completely open (Open Access) and completely hidden
  • 7. Possible way forward suggested by principle: Open as possible, closed as necessary* Principle articulated in "Guidelines on FAIR Data Management in Horizon 2020", EU Horizon 2020 programme
  • 8. Forms of data availability on spectrum between pure open access and fully hidden
  • 9. Capsule framework Controlled compute environment, capsule framework, is viable approach to accessing and sharing restricted data that satisfies sharing while protecting data from unintended use or use prohibited by law
  • 10. Capsule framework Implemented through combination of policy, processes, and software services, to protect the data and make the software infrastructure as easy to use as possible. plale@Indiana.edu
  • 12. Trust Model Threat model : high level articulation of tradeoffs during design of the system. Not an implementation guide Policy decisions influenced by situation of use: restrictions on the data; assumptions of use; and limits of software services. Our major tradeoff: how much trust must you place with the user versus a locked down (and relatively unusable) system
  • 13. plale@Indiana.edu The motivating need for capsule computing is the HathiTrust (HT) shared digital repository
  • 14. HT Mission and Purpose To contribute to research, scholarship, and the common good by collaboratively collecting, organizing, preserving, communicating, and sharing the record of human knowledge. • A trusted digital preservation service enabling the broadest possible access worldwide. • An organization with over 100 research libraries partnering to develop its programs. • A range of transformative programs enabled by working at a very large scale.
  • 15. Current Major Cooperative Initiatives • Distributed manual copyright reviews. • Establishing a distributed shared print monograph archive. • Expanding and enhancing access to US Federal Government Documents. • Expanding services of the HathiTrust Research Center.
  • 16. Scale of the HathiTrust Collection • 16,639,076 total volumes – 8,075,459 book titles – 446,580 serial titles – 5,823,676,600 pages – 746 terabytes • 6,256,362 open volumes (~38% of total) Collection includes (mostly) published materials in bound form, digitized from research and academic library collections.
  • 17. Example use: how influenced is a writer by time spent at Iowa Writers Workshop? • Assemble corpus of works by authors affiliated with renowned Iowa Writers Workshop • Perform analysis to determine whether a Workshop style exists and what the characteristics of such a style might be. • Collect metrics such as vocabulary size, sentence length, or even frequency of male and female pronouns
  • 18. Controlled Compute Environment (or remote secure enclave) provides researchers with remote analytical access to a data collection that has restrictions (legal, privacy) on use, and because of its size, requires compute to come to the data and not vice versa SUNY Binghamton, Dec 2018 Beth Plale, Inna Kouper, Samitha Liayanage, Yu Ma, Robert McDonald, and John Walsh, Capsule Computing: Safe Open Science, under review, 2019.
  • 19. Capsule Framework, as a controlled compute environment,is implemented through policy, processes, and software services working together to protect the data while making the software infrastructure as easy to use as possible. DataPASS
  • 20. Policies in place for HathiTrust Human facing • Non-consumptive Use Research Policy • Terms of Use Infrastructure facing • HathiTrust Rights Database • Trust (threat) model Export review • Human review of results exported from Capsule Human facing policy Infrastructure facing “service agreement” mutually reinforcing
  • 21. Overriding policy is that of Non- consumptive Use Research Policy Research in which computational analysis is performed on one or more volumes but not research in which researcher reads or displays substantial portions of an in-copyright or rights- restricted volume to understand expressive content presented within that volume. Examples: text extraction, automated translation, image analysis, file manipulation, OCR correction, and indexing and search. https://www.hathitrust.org/htrc_ncup
  • 22. Terms of Use Agreement between HTRC/HT and individual intending to use HTRC Data Capsule service. Top 4 terms: 1. Read and comply with Non-Consumptive Use Research Policy. 2. Use their Capsule for non-consumptive research purposes only as defined in Section 1 of the policy. 3. Prior to first use, User submits form indicating intended use and expected forms of outputs. 4. By using HTRC Data Capsule service, User acknowledges that information about their activities while in Capsule may be reviewed in manner consistent with HathiTrust privacy policy. https://www.hathitrust.org/htrc_dc_tou
  • 23. Rights Database ● Database for storing and tracking rights information for each digitized volume in HathiTrust ● At core of system is algorithm that considers a) copyright status and/or explicit access controls associated with the volume, b) volume's digitizing agent (e.g., Google or the University of Chicago), and c) identity of user (if known) in order to determine access rights. ● How used: demo capsule uses only public domain content. https://www.hathitrust.org/rights_database plale@Indiana.edu
  • 24. Threat model ● Threat model: structured representation of all information that affects security of an application. ● Two most relevant clauses: ○ Analysts are themselves considered to act in good faith, but this does not preclude possibility of them unwittingly allowing system to be compromised. ■ Reasonable assumption and motivates why analysts are required to sign use agreement.
  • 25. Capsule Framework k*N user VMs running in back end layer; managed by a hypervisor. All software implementing Capsule framework is open source.
  • 26. Mode one: Maintenance mode Access to Internet permitted; Channel to restricted data closed HT DL
  • 27. Mode two: Secure mode Access to Internet denied; Channel to restricted data open HT DL
  • 28. Threat Model implementation in 7 easy steps ● The threat model for the Capsule framework implementation in HathiTrust is built on the assumed existence of a Trusted Computing Base (TCB), where there resides the totality of security mechanisms within a secure system reside ● Threat model implementation (8 statements)
  • 29. Threat Model implementation 1. An analyst accesses restricted data through remotely accessed VMs that read data from a network-accessed data service. 2. The VM that is given to the analyst for use is not part of the TCB. The remaining support is within the TCB: the Virtual Machine Manager (VMM), the host that the VMM runs on, and the system services that enforce network and data access policies for the virtual machines. Data storage is included within the TCB.
  • 30. Threat Model implementation 3. Users may inadvertently install malware; there may be other remotely initiated attacks on the VM. These attacks could potentially compromise the entire operating system and install a rootkit, both of which are undetectable to the end user. Analysts are themselves considered to act in good faith, but this does not preclude the possibility of them unwittingly allowing the system to be compromised. Analysts are required to sign a use agreement before using the system. Results are reviewed before made available to the user for download.
  • 31. Threat Model implementation 4. Users have VNC access to their virtual machines in non-secure mode to give them a desktop interface to the machine. They also have SSH access in non-secure mode so that they can upload data sets and install software more easily. However, VNC access represents a channel for potential data leak; through use of a use agreement and profile, HT is comfortable that the analyst is acting in good faith. An analyst must refrain from sharing their virtual machine. 5. A potential threat is that of covert channels between virtual machines that run on the same host machine. A solution requires using two physically separated systems, one that only runs VMs in secure mode and another that runs VMs only in maintenance mode. HT currently performs routine host port scanning.
  • 32. Threat Model implementation 6. Analysts have complete freedom to access the Internet from their Capsule to upload/download material while in a non-secure mode. Once switching to a secure mode, the analyst has direct access to the restricted materials. While in secure mode, Internet access is prohibited, as is copying from the Capsule to the desktop. 7. The analyst’s state is retained in a Capsule across sessions of work, but when an analyst completes her work and wants to pull data out of the Capsule, she must store results off to a special drive. The contents of this drive are manually reviewed before results are made available to the analyst.
  • 33. Policy / Infrastructure tradeoff Human facing policies • Non-consumptive Use Research Policy • Terms of Use Infrastructure facing • HathiTrust Rights Database • Trust (threat) model Export review • Human review of results exported from Capsule Tradeoff: how much trust must you place with the user versus a locked down (and relatively unusable) system? Human facing policy Infrastructure facing “service agreement” mutually reinforcing
  • 34. Takeaways • HT Capsule implementation heavily driven by “non-consumptive research” and wide-open research modes to the HathiTrust collection – By Authors Guild et al. v HathiTrust et al., 11 Civ 6351 (S.D.N.Y Sep 12 2011), research must be non-consumptive. That is, “no eyeballs on texts” – Massive collection where no single set of tools satisfies needs. Researcher needs freedom to install own tools
  • 35. Takeaways • Recently released toolkit comes default in every VM. Helps connect to restricted data and import other data resources (user’s workset) into Capsule. Proven to reduce programming burden. • Running on physical servers at Indiana University Bloomington.
  • 36. Resources • Non-Consumptive Use Research Policy https://www.hathitrust.org/htrc_ncup • HathiTrust Rights Database https://www.hathitrust.org/rights_database • Trust (threat) model (somewhat outdated) – Plale, Beth; Prakash, Atul; McDonald, Robert (2015). The Data Capsule for Non-Consumptive Research: Final Report. Available from http://hdl.handle.net/2022/19277 • Terms of Use https://www.hathitrust.org/htrc_dc_tou • HTRC Data Capsule accessible at https://analytics.hathitrust.org
  • 37. Please feel free to reach out to me for more information Beth Plale plale@indiana.edu