SlideShare a Scribd company logo
1 of 32
‘Archiving and managing a
million or more data files on
BiG Grid’

Peter Doorn, Data Archiving and Networked
Services (DANS)
With Jan Just Keijser (NIKHEF)
BiG Grid & Beyond, Amsterdam, 26/9/2012
Contents
 Promises and ideas at the kick-off of BiG Grid in 2007: what
  became of them?
     In NL
     SSH in UK, DE, ESFRI
 Two sub-projects of BiG Grid with DANS
     Analyzing and visualizing big humanities data (briefly)
     Archiving and managing a million or so humanities files
 Beyond BiG Grid: next requirements and challenges for the
  future of SSH research and infrastructure
     An example of analysis of Big Social Science Data (GPS
      traces) from Italy
     Challenges for data infrastructure
From the original Big Grid proposal:

 “BIG GRID is crucial to the success and continuity of
  many Dutch research communities, covering
  important areas such as life sciences, astronomy,
  particle physics, meteorology, and climate
  research, water management, to name just a few.
 However, the very nature of the new infrastructure, a
  multidimensional collaboration enabler and                     s…
  accelerator, allows for direct participation of also         e
                                                             is
  social sciences, humanities, and even addressesm        ro
  communities in administrative domains, like digital   P
  academic repositories.”                          e s…
                                                          is
                                                      m
                                                Pro
ESFRI projects in SSH about grid

CESSDA: grid technologies for facilitating the merging of
  distributed data sources
DARIAH:
     grid services for an open semantic architecture facilitating arts
      and humanities research
     need for ‘easy’ interfaces for humanities scholars, services need
      to be usable without the complexities of the grid infrastructure …
CLARIN: grid technology for                                                   es
                                                                            is
     access to guidance and advice through distributed knowledge centres  om
                                                                        Pr
     access to repositories of data with standardized descriptions, processing
      tools ready to operate on standardized data                  s…   e
                                                                     is
                                                                 m
                                                           Pro
Tools for processing, analysing,
                 annotating, editing and publishing text data


• Grid-enabled workbench to process, analyse,
  annotate, edit and publish XML-encoded textual
  data for academic research
• Connect to the D-Grid Integration Platform (DGI)
  via TextGrid-specific middleware components
• Demonstrate the efficiency of the grid-enabled tools
  in the areas publishing, processing, retrieval, and
                                                                e s…
  linking                                                     is
                                                           om
• Semantic TextGrid: semantic methods for
                                                        Pr
  processing text assets, and for interweaving texts s…
  and dictionaries                                 e
                                                is
                                                       m
                                                 Pro
Germany: Textgrid




                                         …
                                       ts
                                     ul
                               r   es
                           o
                        ls
                      ta
                    Bu
TextGrid VRE: Repository + Lab




                                              …
                                            ts
                                          ul
                                    r   es
                                o
                             ls
                           ta
                         Bu
UK: e-Social Science

 “The National Centre for e-Social Science
  (NCeSS) investigates how innovative and
  powerful computer-based infrastructure and
  tools, developed under the UK e-Science
  programme, can benefit the social science
  research community”
Examples of grid-projects:
                                                                      e s…
     Mixed Media Grid (MiMeG): generate tools and                  is
      techniques for social scientists to analyse audio-visual om
      qualitative data and related materials collaborativelyPr
                                                         s…
     SABRE software has been specifically designed for the
                                                       e
      statistical analysis of multi-process random effect
                                                    is
      response data, using parallel processing om
                                                  Pr
UK e-Social Science discontinued…




                                               r e…
                                           o
                                       m
                                  no
                             Is
Dutch example from humanities

   Subject: organization of knowledge
   Comparison of designed classification system (UDC) with
    a socially grown knowledge system (Wikipedia)
   Multidisciplinary research group, including DANS
    researcher Andrea Scharnhorst
   Big data set (dump of Wikipedia: 2,8 TB)
       Mine the data to extract the page and category link
        changes over time
       Create complex visualizations
   Computational support by BiG Grid team: Tom Visser,
    Coen Schrijvers and Ammar Benabadelkader
Archiving experiments since 2007

 Grid middleware not very suitable for our archiving
  purposes
 Use case:
    How can you be sure that what you store on the grid
      is valid?
    Giving proof of data integrity is a requirement of ISO
      standard 16363 for trusted digital archives
 Advantages of grid storage:
    Fast access to grid worker node
    Hierarchical storage manager: eg. efficient automated
      backup procedures
    Shared facility is efficient and economically attractive
Large numbers of datasets and files

   > 23,000 data sets in DANS archives
   Every data set consists of 1+ data files, sometimes 1000+
   Most data sets are small (98% < 1 Gb)
   For example, the entire population census of 1960 (>11 million
    records) fits on one CD-ROM (< 700 Mb)
   Total number of files >1 million
   Total storage volume ca. 70 Tb
   Long processing times with large numbers of datasets and files
   Management operations on the whole archive: slow and
    problematic on normal servers
      Mass conversions (e.g. thumbnails of images)

      Data integrity control (checksums)

      Compressing the data

   Copying of the whole archive to the grid is not trivial
Datasets in DANS EASY (Sept. 2012)



   1,8% of datasets > 2 GB
   2,8% of datasets > 1 GB




                             23,560 datasets
                             1,693,413 files
The experiment

 Experiment with five digital archives (not in EASY),
  containing a total 290,341 files, grouped over a total of
  1695 'tar' files of 5 GB each (c. 8.5 TB)
 Carried out by Jan Just Keijser (Nikhef)
 Three-phase workflow
DANS Workflow phase 1:
• Create checksums
• Create tarballs (.tar files)
• Upload tarballs to the grid


                            1) md5sum




                                    2) tar




                                  3) Upload     grid
                                              storage
DANS Workflow phase 2:
• Download .tar file
• Compress it to a .tar.gz file
• Upload compressed tarball

                                                  Worker Node



           grid                   1) Download
         storage




                                                2) Compress




                                  3) Upload
DANS Workflow phase 3:
• Download .tar.gz file                  Worker Node
• Unpack it
• Calculate checksums
• Send checksums back and compare


                                            2) Unpack
                           1) Download
       grid
     storage




                                             3) md5sum



                  4) Compare
Results

   The tool works
   One checksum mismatch detected: disk
    failure on grid worker node!
SSH: big data challenges

   Data generated by people tend to be small
   Data generated by social processes (Twitter,
    Facebook), transactions (financial),
    administrations and by devices (GSM, GPS) tend
    to be big
   More analytical projects of big data in SSH (but
    few in NL)
      Millions of digitized books (“Culturomics”)
      Sentiment analysis of twitter feeds to predict
       markets and economic trends
      Traffic flows using GPS
An example from Italy
GPS  traces
17K private cars
one week of ordinary mobility
200K trips (trajectories)
Milan, Italy




From presentation
by Dino Pedreschi
Pisa



Data donated by OCTO Telematics

    Where is traffic concentrated between midnight and 2 a.m.?
    (red = most intense)

    Where is traffic concentrated between 6 p.m. and 8 p.m.?

    Select only trips that start in the city centre (orange) and move
    to North-West

    Where is people between 6pm and 8pm of Wednesday, April
    4th?

    Where is people between 8pm and 10pm of Wednesday, April
    4th? (high density spot appeared)

    Where is people between 10pm and midnight of Wednesday,
    April 4th? (The dense spot disappeared. What happened?)

    Focus on the high-density spot: Centered on the parking lots of the
    stadium, a football match took place there...
SSH Research beyond Big Grid

   Acceptance of grid technology by SSH community is low
    and slow: “my laptop has enough processing power”
   Grid is still perceived as “complicated”
   Researchers are not aware of:
      data management issues
      the research potential of “Big SSH Data”
   Demonstrator projects are still needed:
      Social scientists need to focus more on the analytical
        potential of “Big Social Data”
      “Culturomics” in humanities
   DANS can help to make that accessible, although we are
    not only driven by data, but also by… demand!
Archiving beyond BiG Grid

   Storage capacity: joining forces with other parties: 3TU
    Data Centre, National Coalition for Digital Preservation
    (NCDD with Royal Library, National Archives, Institute for
    Sound and Vision, museum sector), Roadmap projects
   Archiving is more than storage: archival management
    requires repeated operations on masses of files, many
    small, but also big (e.g. audio/visual)
   Set of procedures to support archival management
   Continuity of grid infrastructure is prerequisite
   Is cloud the answer?
       Public cloud is not without risk
       Costs are not yet attractive enough
       Private community cloud is attractive
Thank you for your attention



 peter.doorn@dans.knaw.nl
     janjust@nikhef.nl



    www.dans.knaw.nl

More Related Content

What's hot

Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Recordspbajcsy
 
e-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCubee-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCubeFAO
 
Efficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with Cloud
Efficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with CloudEfficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with Cloud
Efficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with Cloudiosrjce
 
Introduction of Deep Learning
Introduction of Deep LearningIntroduction of Deep Learning
Introduction of Deep LearningMyungjin Lee
 
" NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer...
" NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer..." NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer...
" NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer...Dataconomy Media
 
Introduction to parallel iterative deep learning on hadoop’s next​ generation...
Introduction to parallel iterative deep learning on hadoop’s next​ generation...Introduction to parallel iterative deep learning on hadoop’s next​ generation...
Introduction to parallel iterative deep learning on hadoop’s next​ generation...Anh Le
 
Nubilum: Resource Management System for Distributed Clouds
Nubilum: Resource Management System for Distributed CloudsNubilum: Resource Management System for Distributed Clouds
Nubilum: Resource Management System for Distributed CloudsGlauco Gonçalves
 
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Replication and Synchronization Algorithms for Distributed Databases - Lena W...Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Replication and Synchronization Algorithms for Distributed Databases - Lena W...distributed matters
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Data Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-ÖkosystemData Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-Ökosysteminovex GmbH
 
CLOUD-BASED MULTIMEDIA CONTENT PROTECTION SYSTEM
 CLOUD-BASED MULTIMEDIA CONTENT PROTECTION SYSTEM CLOUD-BASED MULTIMEDIA CONTENT PROTECTION SYSTEM
CLOUD-BASED MULTIMEDIA CONTENT PROTECTION SYSTEMNexgen Technology
 
Internet data mining 2006
Internet data mining   2006Internet data mining   2006
Internet data mining 2006raj_vij
 
"Open Source Software in the Scientific World Case Study Triana Software" by ...
"Open Source Software in the Scientific World Case Study Triana Software" by ..."Open Source Software in the Scientific World Case Study Triana Software" by ...
"Open Source Software in the Scientific World Case Study Triana Software" by ...eLiberatica
 
D4Science scientific data infrastructure promoting interoperability by embrac...
D4Science scientific data infrastructure promoting interoperability by embrac...D4Science scientific data infrastructure promoting interoperability by embrac...
D4Science scientific data infrastructure promoting interoperability by embrac...FAO
 
Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?LIBER Europe
 
Introducing Vortex Lite
Introducing Vortex LiteIntroducing Vortex Lite
Introducing Vortex LiteAngelo Corsaro
 

What's hot (17)

Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Records
 
e-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCubee-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCube
 
Efficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with Cloud
Efficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with CloudEfficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with Cloud
Efficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with Cloud
 
Introduction of Deep Learning
Introduction of Deep LearningIntroduction of Deep Learning
Introduction of Deep Learning
 
" NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer...
" NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer..." NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer...
" NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer...
 
Introduction to parallel iterative deep learning on hadoop’s next​ generation...
Introduction to parallel iterative deep learning on hadoop’s next​ generation...Introduction to parallel iterative deep learning on hadoop’s next​ generation...
Introduction to parallel iterative deep learning on hadoop’s next​ generation...
 
Nubilum: Resource Management System for Distributed Clouds
Nubilum: Resource Management System for Distributed CloudsNubilum: Resource Management System for Distributed Clouds
Nubilum: Resource Management System for Distributed Clouds
 
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Replication and Synchronization Algorithms for Distributed Databases - Lena W...Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Data Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-ÖkosystemData Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-Ökosystem
 
CLOUD-BASED MULTIMEDIA CONTENT PROTECTION SYSTEM
 CLOUD-BASED MULTIMEDIA CONTENT PROTECTION SYSTEM CLOUD-BASED MULTIMEDIA CONTENT PROTECTION SYSTEM
CLOUD-BASED MULTIMEDIA CONTENT PROTECTION SYSTEM
 
Internet data mining 2006
Internet data mining   2006Internet data mining   2006
Internet data mining 2006
 
Sub1584
Sub1584Sub1584
Sub1584
 
"Open Source Software in the Scientific World Case Study Triana Software" by ...
"Open Source Software in the Scientific World Case Study Triana Software" by ..."Open Source Software in the Scientific World Case Study Triana Software" by ...
"Open Source Software in the Scientific World Case Study Triana Software" by ...
 
D4Science scientific data infrastructure promoting interoperability by embrac...
D4Science scientific data infrastructure promoting interoperability by embrac...D4Science scientific data infrastructure promoting interoperability by embrac...
D4Science scientific data infrastructure promoting interoperability by embrac...
 
Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?
 
Introducing Vortex Lite
Introducing Vortex LiteIntroducing Vortex Lite
Introducing Vortex Lite
 

Viewers also liked (7)

Sixth Level
Sixth LevelSixth Level
Sixth Level
 
ZoekstrategieëN Vip
ZoekstrategieëN VipZoekstrategieëN Vip
ZoekstrategieëN Vip
 
Student Life Executive Report, October 2010
Student Life Executive Report, October 2010Student Life Executive Report, October 2010
Student Life Executive Report, October 2010
 
June 2 2011
June 2 2011June 2 2011
June 2 2011
 
11111111111
1111111111111111111111
11111111111
 
Triptrotting Around the Globe
Triptrotting Around the GlobeTriptrotting Around the Globe
Triptrotting Around the Globe
 
La Miseria Humana
La Miseria HumanaLa Miseria Humana
La Miseria Humana
 

Similar to Managing and Archiving Over 1 Million Humanities Data Files

Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Metaverse for Dataverse
Metaverse for DataverseMetaverse for Dataverse
Metaverse for Dataversevty
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Enrico Daga
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
g-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionalityg-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking FunctionalityNicholas Loulloudes
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
GridComputing-an introduction.ppt
GridComputing-an introduction.pptGridComputing-an introduction.ppt
GridComputing-an introduction.pptNileshkuGiri
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by SunnyDignitasDigital1
 
The big data_computing_architecture-graph500
The big data_computing_architecture-graph500The big data_computing_architecture-graph500
The big data_computing_architecture-graph500Accenture
 

Similar to Managing and Archiving Over 1 Million Humanities Data Files (20)

Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Metaverse for Dataverse
Metaverse for DataverseMetaverse for Dataverse
Metaverse for Dataverse
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
Harsh
HarshHarsh
Harsh
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
g-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionalityg-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionality
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
GridComputing-an introduction.ppt
GridComputing-an introduction.pptGridComputing-an introduction.ppt
GridComputing-an introduction.ppt
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
 
The big data_computing_architecture-graph500
The big data_computing_architecture-graph500The big data_computing_architecture-graph500
The big data_computing_architecture-graph500
 

More from pkdoorn

GDPR Datatags DANS Oct 2017
GDPR Datatags DANS Oct 2017GDPR Datatags DANS Oct 2017
GDPR Datatags DANS Oct 2017pkdoorn
 
Seven common objections to data sharing
Seven common objections to data sharingSeven common objections to data sharing
Seven common objections to data sharingpkdoorn
 
Apa frascati november 2012
Apa frascati november 2012Apa frascati november 2012
Apa frascati november 2012pkdoorn
 
Lisbon digital history november 2011
Lisbon digital history november 2011Lisbon digital history november 2011
Lisbon digital history november 2011pkdoorn
 
Open onderzoeks data dag mei 2011 amsterdam
Open onderzoeks data dag mei 2011 amsterdamOpen onderzoeks data dag mei 2011 amsterdam
Open onderzoeks data dag mei 2011 amsterdampkdoorn
 
The Dutch Approach to Research Data Infrastructure
The Dutch Approach to Research Data InfrastructureThe Dutch Approach to Research Data Infrastructure
The Dutch Approach to Research Data Infrastructurepkdoorn
 
ESF Strasbourg Peter Doorn October 2010
ESF Strasbourg Peter Doorn October 2010ESF Strasbourg Peter Doorn October 2010
ESF Strasbourg Peter Doorn October 2010pkdoorn
 
DARIAH Oxford Peter
DARIAH Oxford PeterDARIAH Oxford Peter
DARIAH Oxford Peterpkdoorn
 
Madrid 2 November 2009
Madrid 2 November 2009Madrid 2 November 2009
Madrid 2 November 2009pkdoorn
 
KVAN 2009 Archieven De Ruimte Doorn
KVAN 2009 Archieven De Ruimte  DoornKVAN 2009 Archieven De Ruimte  Doorn
KVAN 2009 Archieven De Ruimte Doornpkdoorn
 
Dariah Advisory Board June 2009 Peter
Dariah Advisory Board June 2009 PeterDariah Advisory Board June 2009 Peter
Dariah Advisory Board June 2009 Peterpkdoorn
 
DARIAH Athens May 2009
DARIAH  Athens  May 2009DARIAH  Athens  May 2009
DARIAH Athens May 2009pkdoorn
 

More from pkdoorn (12)

GDPR Datatags DANS Oct 2017
GDPR Datatags DANS Oct 2017GDPR Datatags DANS Oct 2017
GDPR Datatags DANS Oct 2017
 
Seven common objections to data sharing
Seven common objections to data sharingSeven common objections to data sharing
Seven common objections to data sharing
 
Apa frascati november 2012
Apa frascati november 2012Apa frascati november 2012
Apa frascati november 2012
 
Lisbon digital history november 2011
Lisbon digital history november 2011Lisbon digital history november 2011
Lisbon digital history november 2011
 
Open onderzoeks data dag mei 2011 amsterdam
Open onderzoeks data dag mei 2011 amsterdamOpen onderzoeks data dag mei 2011 amsterdam
Open onderzoeks data dag mei 2011 amsterdam
 
The Dutch Approach to Research Data Infrastructure
The Dutch Approach to Research Data InfrastructureThe Dutch Approach to Research Data Infrastructure
The Dutch Approach to Research Data Infrastructure
 
ESF Strasbourg Peter Doorn October 2010
ESF Strasbourg Peter Doorn October 2010ESF Strasbourg Peter Doorn October 2010
ESF Strasbourg Peter Doorn October 2010
 
DARIAH Oxford Peter
DARIAH Oxford PeterDARIAH Oxford Peter
DARIAH Oxford Peter
 
Madrid 2 November 2009
Madrid 2 November 2009Madrid 2 November 2009
Madrid 2 November 2009
 
KVAN 2009 Archieven De Ruimte Doorn
KVAN 2009 Archieven De Ruimte  DoornKVAN 2009 Archieven De Ruimte  Doorn
KVAN 2009 Archieven De Ruimte Doorn
 
Dariah Advisory Board June 2009 Peter
Dariah Advisory Board June 2009 PeterDariah Advisory Board June 2009 Peter
Dariah Advisory Board June 2009 Peter
 
DARIAH Athens May 2009
DARIAH  Athens  May 2009DARIAH  Athens  May 2009
DARIAH Athens May 2009
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 

Managing and Archiving Over 1 Million Humanities Data Files

  • 1. ‘Archiving and managing a million or more data files on BiG Grid’ Peter Doorn, Data Archiving and Networked Services (DANS) With Jan Just Keijser (NIKHEF) BiG Grid & Beyond, Amsterdam, 26/9/2012
  • 2. Contents  Promises and ideas at the kick-off of BiG Grid in 2007: what became of them?  In NL  SSH in UK, DE, ESFRI  Two sub-projects of BiG Grid with DANS  Analyzing and visualizing big humanities data (briefly)  Archiving and managing a million or so humanities files  Beyond BiG Grid: next requirements and challenges for the future of SSH research and infrastructure  An example of analysis of Big Social Science Data (GPS traces) from Italy  Challenges for data infrastructure
  • 3. From the original Big Grid proposal:  “BIG GRID is crucial to the success and continuity of many Dutch research communities, covering important areas such as life sciences, astronomy, particle physics, meteorology, and climate research, water management, to name just a few.  However, the very nature of the new infrastructure, a multidimensional collaboration enabler and s… accelerator, allows for direct participation of also e is social sciences, humanities, and even addressesm ro communities in administrative domains, like digital P academic repositories.” e s… is m Pro
  • 4. ESFRI projects in SSH about grid CESSDA: grid technologies for facilitating the merging of distributed data sources DARIAH:  grid services for an open semantic architecture facilitating arts and humanities research  need for ‘easy’ interfaces for humanities scholars, services need to be usable without the complexities of the grid infrastructure … CLARIN: grid technology for es is  access to guidance and advice through distributed knowledge centres om Pr  access to repositories of data with standardized descriptions, processing tools ready to operate on standardized data s… e is m Pro
  • 5. Tools for processing, analysing, annotating, editing and publishing text data • Grid-enabled workbench to process, analyse, annotate, edit and publish XML-encoded textual data for academic research • Connect to the D-Grid Integration Platform (DGI) via TextGrid-specific middleware components • Demonstrate the efficiency of the grid-enabled tools in the areas publishing, processing, retrieval, and e s… linking is om • Semantic TextGrid: semantic methods for Pr processing text assets, and for interweaving texts s… and dictionaries e is m Pro
  • 6. Germany: Textgrid … ts ul r es o ls ta Bu
  • 7. TextGrid VRE: Repository + Lab … ts ul r es o ls ta Bu
  • 8. UK: e-Social Science  “The National Centre for e-Social Science (NCeSS) investigates how innovative and powerful computer-based infrastructure and tools, developed under the UK e-Science programme, can benefit the social science research community” Examples of grid-projects: e s…  Mixed Media Grid (MiMeG): generate tools and is techniques for social scientists to analyse audio-visual om qualitative data and related materials collaborativelyPr s…  SABRE software has been specifically designed for the e statistical analysis of multi-process random effect is response data, using parallel processing om Pr
  • 9. UK e-Social Science discontinued… r e… o m no Is
  • 10. Dutch example from humanities  Subject: organization of knowledge  Comparison of designed classification system (UDC) with a socially grown knowledge system (Wikipedia)  Multidisciplinary research group, including DANS researcher Andrea Scharnhorst  Big data set (dump of Wikipedia: 2,8 TB)  Mine the data to extract the page and category link changes over time  Create complex visualizations  Computational support by BiG Grid team: Tom Visser, Coen Schrijvers and Ammar Benabadelkader
  • 11.
  • 12. Archiving experiments since 2007  Grid middleware not very suitable for our archiving purposes  Use case:  How can you be sure that what you store on the grid is valid?  Giving proof of data integrity is a requirement of ISO standard 16363 for trusted digital archives  Advantages of grid storage:  Fast access to grid worker node  Hierarchical storage manager: eg. efficient automated backup procedures  Shared facility is efficient and economically attractive
  • 13. Large numbers of datasets and files  > 23,000 data sets in DANS archives  Every data set consists of 1+ data files, sometimes 1000+  Most data sets are small (98% < 1 Gb)  For example, the entire population census of 1960 (>11 million records) fits on one CD-ROM (< 700 Mb)  Total number of files >1 million  Total storage volume ca. 70 Tb  Long processing times with large numbers of datasets and files  Management operations on the whole archive: slow and problematic on normal servers  Mass conversions (e.g. thumbnails of images)  Data integrity control (checksums)  Compressing the data  Copying of the whole archive to the grid is not trivial
  • 14. Datasets in DANS EASY (Sept. 2012) 1,8% of datasets > 2 GB 2,8% of datasets > 1 GB 23,560 datasets 1,693,413 files
  • 15. The experiment  Experiment with five digital archives (not in EASY), containing a total 290,341 files, grouped over a total of 1695 'tar' files of 5 GB each (c. 8.5 TB)  Carried out by Jan Just Keijser (Nikhef)  Three-phase workflow
  • 16. DANS Workflow phase 1: • Create checksums • Create tarballs (.tar files) • Upload tarballs to the grid 1) md5sum 2) tar 3) Upload grid storage
  • 17. DANS Workflow phase 2: • Download .tar file • Compress it to a .tar.gz file • Upload compressed tarball Worker Node grid 1) Download storage 2) Compress 3) Upload
  • 18. DANS Workflow phase 3: • Download .tar.gz file Worker Node • Unpack it • Calculate checksums • Send checksums back and compare 2) Unpack 1) Download grid storage 3) md5sum 4) Compare
  • 19. Results  The tool works  One checksum mismatch detected: disk failure on grid worker node!
  • 20. SSH: big data challenges  Data generated by people tend to be small  Data generated by social processes (Twitter, Facebook), transactions (financial), administrations and by devices (GSM, GPS) tend to be big  More analytical projects of big data in SSH (but few in NL)  Millions of digitized books (“Culturomics”)  Sentiment analysis of twitter feeds to predict markets and economic trends  Traffic flows using GPS
  • 21. An example from Italy GPS traces 17K private cars one week of ordinary mobility 200K trips (trajectories) Milan, Italy From presentation by Dino Pedreschi Pisa Data donated by OCTO Telematics
  • 22. Where is traffic concentrated between midnight and 2 a.m.? (red = most intense)
  • 23. Where is traffic concentrated between 6 p.m. and 8 p.m.?
  • 24. Select only trips that start in the city centre (orange) and move to North-West
  • 25. Where is people between 6pm and 8pm of Wednesday, April 4th?
  • 26. Where is people between 8pm and 10pm of Wednesday, April 4th? (high density spot appeared)
  • 27. Where is people between 10pm and midnight of Wednesday, April 4th? (The dense spot disappeared. What happened?)
  • 28. Focus on the high-density spot: Centered on the parking lots of the stadium, a football match took place there...
  • 29. SSH Research beyond Big Grid  Acceptance of grid technology by SSH community is low and slow: “my laptop has enough processing power”  Grid is still perceived as “complicated”  Researchers are not aware of:  data management issues  the research potential of “Big SSH Data”  Demonstrator projects are still needed:  Social scientists need to focus more on the analytical potential of “Big Social Data”  “Culturomics” in humanities  DANS can help to make that accessible, although we are not only driven by data, but also by… demand!
  • 30. Archiving beyond BiG Grid  Storage capacity: joining forces with other parties: 3TU Data Centre, National Coalition for Digital Preservation (NCDD with Royal Library, National Archives, Institute for Sound and Vision, museum sector), Roadmap projects  Archiving is more than storage: archival management requires repeated operations on masses of files, many small, but also big (e.g. audio/visual)  Set of procedures to support archival management  Continuity of grid infrastructure is prerequisite  Is cloud the answer?  Public cloud is not without risk  Costs are not yet attractive enough  Private community cloud is attractive
  • 31.
  • 32. Thank you for your attention peter.doorn@dans.knaw.nl janjust@nikhef.nl www.dans.knaw.nl