SlideShare uma empresa Scribd logo
1 de 26
The Use of Big Data Techniques
for Digital Archiving
Sven Schlarb, Austrian Institute of
Technology
Tuesday 15th March 2016, Cambridge
OUTLINE
• E-ARK Project Overview
• Technical Background
• Integrated Prototype
• Data Mining Use Cases
Project Overview
THE
E-ARK PROJECT
IS
CO-FUNDED
BY THE
EUROPEAN
COMMISSION
UNDER THE
ICT-PSP
PROGRAMME
www.eark-project.eu
Advisory Boards
Archival
• Archives of Emilia-Romagna, Italy
• Directorate-General of the Book, of
Archives & of Libraries, Portugal
• EC Archives & Records Management
• EC Historical Archives
• German Federal Archives
• National Archives of Bulgaria
• National Archives of Finland
• National Archives of France
• National Archives of Sweden
• National Archives of the
Netherlands
• Polish Data Archive
• Queensland State Archives
• Swiss Federal Archives
• UK National Archives
• UK Parliamentary Archives
Commercial
Technial
• Arkivum
• ARMA Europe
• DigitalForever
• Discovery Garden
• Microsoft Research
• Open Preservation Foundation
• Open Text Initiative
• Preservica
• Versity
Data Providers
• Danish Agency for Digitisation
• Estonian Ministry of Economic
Affairs & Communication
• Estonian Unemployment Insurance
Fund
• James Lappin, RM Consultant
Project mission
• Improve access to the archived records of
European Archives
• Create guidelines and recommended
practices
• Cover relational databases, record
management systems, and geographical
data
• Create open source implementation
evaluated in several pilots
Outcomes
Standardisation of
available best-
practices
• Common terminology
(Knowledge Center)
• SIP, AIP and DIP
format specifications
• Pre-ingest, ingest and
access workflows
Open source tools
• Scalable, modular,
and reusable
implementation of
specifications
• Individual
deployments (Pilots)
and an integrated
reference
implementation
Technical Background
Hadoop Cluster
Task Trackers
Data Nodes
Job Tracker
Name Node
Hadoop = MapReduce + HDFS
Distributed processing (MapReduce)
Distributed Storage (HDFS)
example: 2 x Quad-Core-CPUs:
10 Map (Parallelisierung)
4 Reduce (Aggregation)
example: 4 x 1 TB Hard-Disks (replication factor 3):
ca. 1,33 TB
HADOOP
Sort
Shuffle
Merge
Input data
Input split 1
Record 1
Record 2
Record 3
Input split 2
Record 4
Record 5
Record 6
Input split 3
Record 7
Record 8
Record 9
Task1
Map Reduce
Task 2
Task 3
Output data
Aggregated
Result
Aggregated
Result
Map/Reduce in a nutshell
E-ARK Integrated Prototype
Architecture & Implementation
Base technology stack
E-ARK Web
“Integrated” Prototype?
AIP to DIPSIP to AIP
Hadoop Distributed
File System
NAS
Working area
Search and Access
Lily Repository
DIP Delivery
Workers
Celery
Information Package processing &
Access Repository
Access Repository - Interfaces
Ingest and Preservation
Access
E-ARK
SIP
SIP
Creation
Tools
Archival
records
Content and
Records
Management
Systems
SIP – AIP
Conversion
E-ARK
AIP
CMIS
Interface
Data
Mining
Interface
Digital preservation systems
AIP - DIP
Conversion
Scalable
Computation
E-ARK
DIP
Archival Search ,
Access and
Display Tools
Content and
Records
Management
Systems
Data Mining
Showcase
E-ARK Data Mining
Geographical/timeline search
Peripleo - PELAGIOS Project
Geographical/timeline search
Peripleo - PELAGIOS Project
Text mining: Text classification
Training
• Train classifier using annotated text corpus
• SVM – based on statistical features
Classification
• Scan for texts during ingest (or run MR after)
• Text category estimation
Search
• Add category as a searcheable field to Lily index
• Full-text search using Lily‘s SolR search interface
OLAP (Online Analytical Processing)
• Database archiving
and re-use (SIARD2)
• Normalization -
OLAP/Oracle Data
Warehouse
Thank you!
• http://www.eark-project.eu
• https://github.com/eark-project

Mais conteúdo relacionado

Mais procurados

Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRM
Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRMBéatrice Markhoff - Semantic mediation ArSol and CIDOC CRM
Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRMariadnenetwork
 
RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"RIPE NCC
 
Intro to R statistic programming
Intro to R statistic programming Intro to R statistic programming
Intro to R statistic programming Bryan Downing
 
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...CILIP MDG
 
Arkstore web ready2013
Arkstore web ready2013Arkstore web ready2013
Arkstore web ready2013coldsnipe
 
Python in geospatial analysis
Python in geospatial analysisPython in geospatial analysis
Python in geospatial analysisSakthivel R
 
GeoKnow: Making the Web an Exploratory Place for Spatial Data
GeoKnow: Making the Web an Exploratory Place for Spatial DataGeoKnow: Making the Web an Exploratory Place for Spatial Data
GeoKnow: Making the Web an Exploratory Place for Spatial DataOpenLink Software
 
Basic Analytic Techniques - Using R Tool - Part 1
Basic Analytic Techniques - Using R Tool - Part 1Basic Analytic Techniques - Using R Tool - Part 1
Basic Analytic Techniques - Using R Tool - Part 1Beamsync
 
Comsode tools - pushing data to open ecosystem
Comsode tools - pushing data to open ecosystemComsode tools - pushing data to open ecosystem
Comsode tools - pushing data to open ecosystemComsode - FP7 project
 
c,c++,java and python in gis development
c,c++,java and python in gis developmentc,c++,java and python in gis development
c,c++,java and python in gis developmentSakthivel R
 
Drupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupalDay
 
Using Linked Data to diversify search results: a case study in cultural heritage
Using Linked Data to diversify search results: a case study in cultural heritageUsing Linked Data to diversify search results: a case study in cultural heritage
Using Linked Data to diversify search results: a case study in cultural heritageChris Dijkshoorn
 
Zenodo and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...
Zenodo  and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...Zenodo  and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...
Zenodo and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...OpenAIRE
 
RJ Broker: Automating Delivery of Research Output to Repositories
RJ Broker: Automating Delivery of Research Output to RepositoriesRJ Broker: Automating Delivery of Research Output to Repositories
RJ Broker: Automating Delivery of Research Output to RepositoriesEDINA, University of Edinburgh
 
IXP Traffic and Major Sports Events
IXP Traffic and Major Sports EventsIXP Traffic and Major Sports Events
IXP Traffic and Major Sports EventsRIPE NCC
 

Mais procurados (20)

HDF Town Hall
HDF Town HallHDF Town Hall
HDF Town Hall
 
Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRM
Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRMBéatrice Markhoff - Semantic mediation ArSol and CIDOC CRM
Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRM
 
RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"
 
Pilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOTPilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOT
 
Intro to R statistic programming
Intro to R statistic programming Intro to R statistic programming
Intro to R statistic programming
 
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...
 
Sitemap4rdf(v2 boris)
Sitemap4rdf(v2 boris)Sitemap4rdf(v2 boris)
Sitemap4rdf(v2 boris)
 
Arkstore web ready2013
Arkstore web ready2013Arkstore web ready2013
Arkstore web ready2013
 
Python in geospatial analysis
Python in geospatial analysisPython in geospatial analysis
Python in geospatial analysis
 
The New HDF-EOS WebSite - How it can help you
The New HDF-EOS WebSite - How it can help youThe New HDF-EOS WebSite - How it can help you
The New HDF-EOS WebSite - How it can help you
 
GeoKnow: Making the Web an Exploratory Place for Spatial Data
GeoKnow: Making the Web an Exploratory Place for Spatial DataGeoKnow: Making the Web an Exploratory Place for Spatial Data
GeoKnow: Making the Web an Exploratory Place for Spatial Data
 
Basic Analytic Techniques - Using R Tool - Part 1
Basic Analytic Techniques - Using R Tool - Part 1Basic Analytic Techniques - Using R Tool - Part 1
Basic Analytic Techniques - Using R Tool - Part 1
 
Comsode tools - pushing data to open ecosystem
Comsode tools - pushing data to open ecosystemComsode tools - pushing data to open ecosystem
Comsode tools - pushing data to open ecosystem
 
c,c++,java and python in gis development
c,c++,java and python in gis developmentc,c++,java and python in gis development
c,c++,java and python in gis development
 
Drupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open data
 
Using Linked Data to diversify search results: a case study in cultural heritage
Using Linked Data to diversify search results: a case study in cultural heritageUsing Linked Data to diversify search results: a case study in cultural heritage
Using Linked Data to diversify search results: a case study in cultural heritage
 
Zenodo and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...
Zenodo  and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...Zenodo  and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...
Zenodo and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...
 
RJ Broker: Automating Delivery of Research Output to Repositories
RJ Broker: Automating Delivery of Research Output to RepositoriesRJ Broker: Automating Delivery of Research Output to Repositories
RJ Broker: Automating Delivery of Research Output to Repositories
 
IXP Traffic and Major Sports Events
IXP Traffic and Major Sports EventsIXP Traffic and Major Sports Events
IXP Traffic and Major Sports Events
 
Geo linked data lstd10(v2-boris)
Geo linked data lstd10(v2-boris)Geo linked data lstd10(v2-boris)
Geo linked data lstd10(v2-boris)
 

Destaque

Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.
Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.
Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.CRISTIÁN E. GUTIERREZ GEBAUER
 
21st Annual Day of Yoganjali Natyalayam 2014
21st Annual Day of Yoganjali Natyalayam 201421st Annual Day of Yoganjali Natyalayam 2014
21st Annual Day of Yoganjali Natyalayam 2014Yogacharya AB Bhavanani
 
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...IndexBox Marketing
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
 
TMC Resource Kit Hugues Sweeney CoProduction Interview NFB
TMC Resource Kit Hugues Sweeney CoProduction Interview NFBTMC Resource Kit Hugues Sweeney CoProduction Interview NFB
TMC Resource Kit Hugues Sweeney CoProduction Interview NFBTMC Resource Kit
 
ABB MagMaster - Flow Meter & End to End Testing Procedure
ABB MagMaster - Flow Meter & End to End Testing ProcedureABB MagMaster - Flow Meter & End to End Testing Procedure
ABB MagMaster - Flow Meter & End to End Testing ProcedureDavid List
 
Elastic search 클러스터관리
Elastic search 클러스터관리Elastic search 클러스터관리
Elastic search 클러스터관리HyeonSeok Choi
 
Data analysis with Tajo
Data analysis with TajoData analysis with Tajo
Data analysis with TajoGruter
 
Neev Conversion Strategy Capabilities
Neev Conversion Strategy CapabilitiesNeev Conversion Strategy Capabilities
Neev Conversion Strategy CapabilitiesNeev Technologies
 
Ddd start! 6장. 응용 서비스와 표현 영역
Ddd start!   6장. 응용 서비스와 표현 영역Ddd start!   6장. 응용 서비스와 표현 영역
Ddd start! 6장. 응용 서비스와 표현 영역Hyunsoo Jung
 
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍Amazon Web Services Korea
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data LakesKiran Kamreddy
 
Using hadoop for enterprise data management
Using hadoop for enterprise data managementUsing hadoop for enterprise data management
Using hadoop for enterprise data managementEstuate, Inc.
 
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...Amazon Web Services Korea
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopYahoo Developer Network
 

Destaque (20)

Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.
Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.
Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.
 
Composición de predios agricolas
Composición de predios agricolasComposición de predios agricolas
Composición de predios agricolas
 
Mercado hotelero 06 2016
Mercado hotelero 06 2016Mercado hotelero 06 2016
Mercado hotelero 06 2016
 
21st Annual Day of Yoganjali Natyalayam 2014
21st Annual Day of Yoganjali Natyalayam 201421st Annual Day of Yoganjali Natyalayam 2014
21st Annual Day of Yoganjali Natyalayam 2014
 
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
TMC Resource Kit Hugues Sweeney CoProduction Interview NFB
TMC Resource Kit Hugues Sweeney CoProduction Interview NFBTMC Resource Kit Hugues Sweeney CoProduction Interview NFB
TMC Resource Kit Hugues Sweeney CoProduction Interview NFB
 
ABB MagMaster - Flow Meter & End to End Testing Procedure
ABB MagMaster - Flow Meter & End to End Testing ProcedureABB MagMaster - Flow Meter & End to End Testing Procedure
ABB MagMaster - Flow Meter & End to End Testing Procedure
 
Therapeutic Potential of Pranayama
Therapeutic Potential of PranayamaTherapeutic Potential of Pranayama
Therapeutic Potential of Pranayama
 
Elastic search 클러스터관리
Elastic search 클러스터관리Elastic search 클러스터관리
Elastic search 클러스터관리
 
Data analysis with Tajo
Data analysis with TajoData analysis with Tajo
Data analysis with Tajo
 
Neev Conversion Strategy Capabilities
Neev Conversion Strategy CapabilitiesNeev Conversion Strategy Capabilities
Neev Conversion Strategy Capabilities
 
DDD start 1장
DDD start 1장DDD start 1장
DDD start 1장
 
Ddd start! 6장. 응용 서비스와 표현 영역
Ddd start!   6장. 응용 서비스와 표현 영역Ddd start!   6장. 응용 서비스와 표현 영역
Ddd start! 6장. 응용 서비스와 표현 영역
 
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data Lakes
 
Using hadoop for enterprise data management
Using hadoop for enterprise data managementUsing hadoop for enterprise data management
Using hadoop for enterprise data management
 
Lesson 14 osmosis
Lesson 14   osmosisLesson 14   osmosis
Lesson 14 osmosis
 
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
 

Semelhante a The Use of Big Data Techniques for Digital Archiving

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?cneudecker
 
Digital Archiving at the Meertens Institute
Digital Archiving at the Meertens InstituteDigital Archiving at the Meertens Institute
Digital Archiving at the Meertens Institutejuntez
 
ARCLib project presentation from Pasig 2016
ARCLib project presentation from Pasig 2016ARCLib project presentation from Pasig 2016
ARCLib project presentation from Pasig 2016dp-blog-cz
 
Do MORe with your data
Do MORe with your dataDo MORe with your data
Do MORe with your datalocloud
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project
 
Application scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibraryApplication scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibrarySven Schlarb
 
UCD Digital Library: Creating online access to historical and contemporary co...
UCD Digital Library: Creating online access to historical and contemporary co...UCD Digital Library: Creating online access to historical and contemporary co...
UCD Digital Library: Creating online access to historical and contemporary co...UCD Library
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataPascal-Nicolas Becker
 
SCAPE - Building Digital Preservation Infrastructure
SCAPE - Building Digital Preservation InfrastructureSCAPE - Building Digital Preservation Infrastructure
SCAPE - Building Digital Preservation InfrastructureSCAPE Project
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008Nancy Elkington
 
LoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloudLoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloudEuropeana
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...South Tyrol Free Software Conference
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 

Semelhante a The Use of Big Data Techniques for Digital Archiving (20)

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
 
Digital Archiving at the Meertens Institute
Digital Archiving at the Meertens InstituteDigital Archiving at the Meertens Institute
Digital Archiving at the Meertens Institute
 
ARCLib project presentation from Pasig 2016
ARCLib project presentation from Pasig 2016ARCLib project presentation from Pasig 2016
ARCLib project presentation from Pasig 2016
 
Do MORe with your data
Do MORe with your dataDo MORe with your data
Do MORe with your data
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
Application scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibraryApplication scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National Library
 
Towards a Common Approach for Access to Digital Archival Records in Europe. A...
Towards a Common Approach for Access to Digital Archival Records in Europe. A...Towards a Common Approach for Access to Digital Archival Records in Europe. A...
Towards a Common Approach for Access to Digital Archival Records in Europe. A...
 
UCD Digital Library: Creating online access to historical and contemporary co...
UCD Digital Library: Creating online access to historical and contemporary co...UCD Digital Library: Creating online access to historical and contemporary co...
UCD Digital Library: Creating online access to historical and contemporary co...
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
 
SCAPE - Building Digital Preservation Infrastructure
SCAPE - Building Digital Preservation InfrastructureSCAPE - Building Digital Preservation Infrastructure
SCAPE - Building Digital Preservation Infrastructure
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Ariadne overview
Ariadne overviewAriadne overview
Ariadne overview
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008
 
LoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloudLoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloud
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
E-ARK: Open Data Mining for Government Archives
E-ARK: Open Data Mining for Government ArchivesE-ARK: Open Data Mining for Government Archives
E-ARK: Open Data Mining for Government Archives
 

Último

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

The Use of Big Data Techniques for Digital Archiving

  • 1. The Use of Big Data Techniques for Digital Archiving Sven Schlarb, Austrian Institute of Technology Tuesday 15th March 2016, Cambridge
  • 2. OUTLINE • E-ARK Project Overview • Technical Background • Integrated Prototype • Data Mining Use Cases
  • 4. THE E-ARK PROJECT IS CO-FUNDED BY THE EUROPEAN COMMISSION UNDER THE ICT-PSP PROGRAMME www.eark-project.eu
  • 5. Advisory Boards Archival • Archives of Emilia-Romagna, Italy • Directorate-General of the Book, of Archives & of Libraries, Portugal • EC Archives & Records Management • EC Historical Archives • German Federal Archives • National Archives of Bulgaria • National Archives of Finland • National Archives of France • National Archives of Sweden • National Archives of the Netherlands • Polish Data Archive • Queensland State Archives • Swiss Federal Archives • UK National Archives • UK Parliamentary Archives Commercial Technial • Arkivum • ARMA Europe • DigitalForever • Discovery Garden • Microsoft Research • Open Preservation Foundation • Open Text Initiative • Preservica • Versity Data Providers • Danish Agency for Digitisation • Estonian Ministry of Economic Affairs & Communication • Estonian Unemployment Insurance Fund • James Lappin, RM Consultant
  • 6. Project mission • Improve access to the archived records of European Archives • Create guidelines and recommended practices • Cover relational databases, record management systems, and geographical data • Create open source implementation evaluated in several pilots
  • 7. Outcomes Standardisation of available best- practices • Common terminology (Knowledge Center) • SIP, AIP and DIP format specifications • Pre-ingest, ingest and access workflows Open source tools • Scalable, modular, and reusable implementation of specifications • Individual deployments (Pilots) and an integrated reference implementation
  • 9. Hadoop Cluster Task Trackers Data Nodes Job Tracker Name Node
  • 10. Hadoop = MapReduce + HDFS Distributed processing (MapReduce) Distributed Storage (HDFS) example: 2 x Quad-Core-CPUs: 10 Map (Parallelisierung) 4 Reduce (Aggregation) example: 4 x 1 TB Hard-Disks (replication factor 3): ca. 1,33 TB HADOOP
  • 11. Sort Shuffle Merge Input data Input split 1 Record 1 Record 2 Record 3 Input split 2 Record 4 Record 5 Record 6 Input split 3 Record 7 Record 8 Record 9 Task1 Map Reduce Task 2 Task 3 Output data Aggregated Result Aggregated Result Map/Reduce in a nutshell
  • 15. AIP to DIPSIP to AIP Hadoop Distributed File System NAS Working area Search and Access Lily Repository DIP Delivery Workers Celery Information Package processing & Access Repository
  • 16. Access Repository - Interfaces
  • 17. Ingest and Preservation Access E-ARK SIP SIP Creation Tools Archival records Content and Records Management Systems SIP – AIP Conversion E-ARK AIP CMIS Interface Data Mining Interface Digital preservation systems AIP - DIP Conversion Scalable Computation E-ARK DIP Archival Search , Access and Display Tools Content and Records Management Systems Data Mining Showcase
  • 18.
  • 19.
  • 20.
  • 24. Text mining: Text classification Training • Train classifier using annotated text corpus • SVM – based on statistical features Classification • Scan for texts during ingest (or run MR after) • Text category estimation Search • Add category as a searcheable field to Lily index • Full-text search using Lily‘s SolR search interface
  • 25. OLAP (Online Analytical Processing) • Database archiving and re-use (SIARD2) • Normalization - OLAP/Oracle Data Warehouse
  • 26. Thank you! • http://www.eark-project.eu • https://github.com/eark-project

Notas do Editor

  1. Purpose is to assess contributions to and from the project Open to interested parties Meetings of these groups Gather information and contribute to a knowledge base (maintained by the DLM Forum)
  2. Technologies: Hadoop MapReduce, SolR, HDFS, Lily Repository, ESSArch Preservatin Platform, E-ARK Web Vertical Integration: [MapReduce] works atop [HDFS], [SolR] indexes [Lily] Records Horizontal Integration: [MapReduce] used to build [SolR] index, [HDFS] used to store [Lily] content, packages ingested via [EPP] UI are searched/accessed via [E-ARK WEB] UI