I explained how we overcame the technical challenges preventing our depositors from sharing bigger files by adding HTML5 upload to our DSpace repository (Edinburgh DataShare) allowing upload of files up to 15 GB and even 20 GB in some circumstances. These slides also explain our plans to link up an FTP server to Edinburgh DataShare to make it convenient for users to download filesets of around 10 GB and possibly even bigger. Unfortunately I hit my 7-minute limit before I had a chance to get to the FTP bit. Thank you Slideshare for allowing everybody to find out what was in the rest of my talk!
The code for the HTML5 upload is at https://github.com/edina/DSpace/tree/xml-html5-upload . This 7-minute presentation ("24x7") was given at Open Repositories 2016 on the 14th of June 2016. The full abstract is at: https://www.conftool.com/or2016/index.php?page=browseSessions&form_session=126
N.B. the 20 GB file which I have uploaded, and which is our current upper size limit, is 20 GB in the strict sense of the word i.e. 20 x 1,024 x 1,024 x 1,024 bytes. In Linux this is displayed as 21,474,836,480 (bytes) whereas in Windows the size is usually expressed in KB i.e. 20,971,520 KB.
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
Growing Open Data: Making the sharing of XXL-sized research data files online a reality, using Edinburgh DataShare
1. GROWING OPEN DATA: MAKING
THE SHARING OF XXL-SIZED
RESEARCH DATA FILES ONLINE
A REALITY, USING EDINBURGH
DATASHARE
PAULINE WARD: PAULINE.WARD@ED.AC.UK @PAULINEDATAWARD
GEORGE HAMILTON
2. THE CHALLENGE
• Researchers are generating bigger files. At University of
Edinburgh all researchers are entitled to 500 GB storage.
3. THE CHALLENGE
• Researchers need to be able to share their data online.
• For impact.
• For discoverability.
• For reproducibility.
• For compliance.
4. THE CHALLENGE
• DataShare is the Institutional Repository for research data for
staff and students at the University of Edinburgh:
datashare.is.ed.ac.uk .
• Previous file size limit of 2.1 GB.
• Largest file we’ve been asked to share: 20 GB – split into
smaller files.
• Largest fileset we’ve been asked to share: 226 GB – split into
smaller filesets.
5. THE CHALLENGE
• Some files had to be imported via time-consuming batch
import process because too big / too numerous for web
deposit.
• Some files still waiting to be shared because they are too big
for users to be able to conveniently download them.
• These files are generated from a wide range of disciplines and
wide range of methods.
6. THE SOLUTION
• Getting the files from the depositors: address upload
• Allowing users to get the files: address download
8. THE SOLUTION: UPLOAD
• EDINA’s code for implementing HTML5 upload in DSpace is on
GitHub:
https://github.com/edina/DSpace/tree/xml-html5-upload
• Uses resumable.js
• This was the XMLUI re-write of functionality that was available
for DSpace 5.0 JSPUI. See
https://jira.duraspace.org/browse/DS-1562 for further details.
9.
10.
11. THE SOLUTION: UPLOAD
• Testing shows files up to 15 GB upload successfully.
• (cf figshare 5 GB file size limit, Zenodo 2 GB)
• 20 GB file upload has been done in testing, but generates an error
message in the browser, and the user must find and Resume the
submission from the Submissions page
• Multiple files can be uploaded by drag’n’drop.
12. THE SOLUTION: DOWNLOAD
We wanted a mechanism, which DSpace doesn’t provide, of
zipping up files for download.
• BitTorrent was one possible approach: could be added at a later
date
• Other approaches possible (Rsync, Secure Copy (SCP))
13. THE SOLUTION: DOWNLOAD
• FTP download: agreed
• Tried and tested technology that we are confident we can put in place
and will work well
• All files will be accessed from the FTP server anonymously
• Users can still download files via browser via FTP
• Users who wish can use an FTP client, allowing them to resume a
download
14. THE SOLUTION: DOWNLOAD
• Specification:
• All files will still be required to have appropriate metadata stored in
DSpace
• All filesets will now be downloadable as a zip file (previous 5.2 GB limit)
• Move DSpace assetstore to a location where more storage available
• Statistics (i.e. numbers) of file downloads by SFTP will be added to
DSpace statistics
15. THE SOLUTION: DOWNLOAD
• This is a replacement for our current on-the-fly zip file
creation of Item bitstreams.
• Will mitigate potential performance issues. Because it will use
less server resources (Java threads and RAM)
16. SUMMARY
• We have implemented HTML5 upload in the DataShare (DSpace)
web interface to allow depositors to easily and quickly deposit
individual files up to 15 GB.
• We are working on integrating an SFTP server to allow users to
retrieve filesets larger than our current 20 GB limit. Storage
rather than network/browser timeout will become the limiting
factor on fileset size. We anticipate making numerous filesets
around 100 GB available in this way in the medium term.