SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Updates on the
BHL Global
Cluster
 biodiversity heritage library
   anthony goddard         phil cryer
Us?
      o We   do this talk a lot.. generally our shirts match.
What is the BHL?

  • BHL - The Biodiversity Heritage Library
    o digitization component of the Encylopedia of Life
    o a consortium of a global partners
    o aims to share historic biodiversity literature texts
    o provide open access of all content
    o free for all
Why do we need a cluster?

• All BHL data is at the Internet Archive in San Francisco
  o no redundancy
  o single point of failure (earthquake risk)
  o limited in how we could serve
  o no easy way to analyze data

• First global BHL cluster gives us
   o redundancy
   o no single point of failure
   o various new serving options
   o new ways to run analytics



  #win!
Use Linux and open source software running on
commodity hardware to create a scalable, distributed filesystem.
software
hardware
http://whbhl01.ubio.org/ganglia
# ls -lh /mnt/glusterfs/www/a/actasocietatissc26suom
total 649M
-rwxr-xr-x 1 www-data www-data 19M 2009-07-10 01:55    actasocietatissc26suom_abbyy.gz
-rwxr-xr-x 1 www-data www-data 28M 2009-07-10 06:53    actasocietatissc26suom_bw.pdf
-rwxr-xr-x 1 www-data www-data 1.3K 2009-06-12 10:21   actasocietatissc26suom_dc.xml
-rwxr-xr-x 1 www-data www-data 18M 2009-07-10 03:05    actasocietatissc26suom.djvu
-rwxr-xr-x 1 www-data www-data 1.3M 2009-07-10 06:54   actasocietatissc26suom_djvu.txt
-rwxr-xr-x 1 www-data www-data 14M 2009-07-10 02:08    actasocietatissc26suom_djvu.xml
-rwxr-xr-x 1 www-data www-data 4.4K 2009-12-14 04:42   actasocietatissc26suom_files.xml
-rwxr-xr-x 1 www-data www-data 20M 2009-07-09 18:57    actasocietatissc26suom_flippy.zip
-rwxr-xr-x 1 www-data www-data 285K 2009-07-09 18:52   actasocietatissc26suom.gif
-rwxr-xr-x 1 www-data www-data 193M 2009-07-09 18:51   actasocietatissc26suom_jp2.zip
-rwxr-xr-x 1 www-data www-data 5.7K 2009-06-12 10:21   actasocietatissc26suom_marc.xml
-rwxr-xr-x 1 www-data www-data 2.0K 2009-06-12 10:21   actasocietatissc26suom_meta.mrc
-rwxr-xr-x 1 www-data www-data 416 2009-06-12 10:21    actasocietatissc26suom_metasource.xml
-rwxr-xr-x 1 www-data www-data 2.2K 2009-12-01 12:20   actasocietatissc26suom_meta.xml
-rwxr-xr-x 1 www-data www-data 279K 2009-12-14 04:42   actasocietatissc26suom_names.xml
-rwxr-xr-x 1 www-data www-data 324M 2009-07-09 13:28   actasocietatissc26suom_orig_jp2.tar
-rwxr-xr-x 1 www-data www-data 34M 2009-07-10 04:35    actasocietatissc26suom.pdf
-rwxr-xr-x 1 www-data www-data 365K 2009-07-09 13:28   actasocietatissc26suom_scandata.xml
initial population
the plan
• Internet2 - woohoo
   o “This will take forever” (it took longer)
   o “We need more space” (not 24TB)
   o “something’s overloading the network” (oops)
   o “this checksum is wrong” (what the...)

• Lessons learned would we do it again? Probably not.
code: grabbyd




                                          1
        Internet Archive, San Francisco       BHL Global, Woods Hole
code: grabbyd_reporting




           http://cluster.biodiversitylibrary.org/
code: bhl-sync

        Open source Dropbox model

                   inotify

                   lsyncd

                 OpenSSH

                   rsync
all of our created code is open sourced
       and available at bit.ly/bhl-bits
http://bit.ly/09-bhl-sync
Replication | Replication
BHL content distribution


                                    1                                    ?
  Internet Archive, San Francisco           BHL Global, Woods Hole             BHL China, Beijing




                                        2               2            ?




         BHL, St. Louis                      BHL Europe, London              BHL Australia, Melbourne
BHL content + local data



  Internet Archive, San Francisco           BHL Global, Woods Hole     BHL China, Beijing




                              Content sourced from China, scanned by
                            Internet Archive, replicated into BHL Global
BHL content + regional data



  Internet Archive, San Francisco       BHL Global, Woods Hole




                                    ?




       BHL Europe, Paris                 BHL Europe, London      BHL Europe, Berlin




              Content sourced from BHL Europe partners may, or may
              not, be passed back to Internet Archive and BHL Global
other replication challenges

• deleting content - "going dark"
• new content coming in from other sources (localization of content)
• distributing modified content 
fedora-commons integration
Repository platform
• storage, access and management digital content
• a base for software developers to build tools for sharing
• free, community supported, open source software
fedora-commons integration
Repository platform
• storage, access and management digital content
• a base for software developers to build tools for sharing
• free, community supported, open source software

• Maintains a persistent, stable, digital archive
  o provides backup, redundancy and disaster recovery
  o complements existing architecture by incorporating open standards
  o stores data in a neutral manner
  o shares data via OAI
BHL content distribution



  Internet Archive, San Francisco                    BHL Global, Woods Hole                        Fedora-commons




                                    BHL, St. Louis                            BHL Europe, London
BHL content distribution



  Internet Archive, San Francisco              BHL Global, Woods Hole                    Fedora-commons




                                                                        OAI




                                    BHL node                            Fedora-commons
BHL content distribution



  Internet Archive, San Francisco              BHL Global, Woods Hole                    Fedora-commons




                                                                                          OAI




                                    BHL node                            Fedora-commons
computational services
thanks.
     anthony goddard                           phil cryer




                all code available bit.ly/bhl-bits
          presentation slides on slidesha.re/bhl-slides

Mais conteúdo relacionado

Mais procurados

Red Hat System Administration
Red Hat System AdministrationRed Hat System Administration
Red Hat System AdministrationRafi Rahimov
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!Zubair Nabi
 
Compression Commands in Linux
Compression Commands in LinuxCompression Commands in Linux
Compression Commands in LinuxPegah Taheri
 
Compression
CompressionCompression
Compressionaswathyu
 
Archiving in linux tar
Archiving in linux tarArchiving in linux tar
Archiving in linux tarInfoExcavator
 
101 3.3 perform basic file management
101 3.3 perform basic file management101 3.3 perform basic file management
101 3.3 perform basic file managementAcácio Oliveira
 
101 2.4 use debian package management
101 2.4 use debian package management101 2.4 use debian package management
101 2.4 use debian package managementAcácio Oliveira
 
basic linux command (questions)
basic linux command (questions)basic linux command (questions)
basic linux command (questions)Sukhraj Singh
 
Command Line Tools
Command Line ToolsCommand Line Tools
Command Line ToolsDavid Harris
 
The basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemThe basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemHungWei Chiu
 
101 2.1 design hard disk layout
101 2.1 design hard disk layout101 2.1 design hard disk layout
101 2.1 design hard disk layoutAcácio Oliveira
 
nf-core: A community-driven collection of omics portable pipelines
nf-core: A community-driven collection of omics portable pipelinesnf-core: A community-driven collection of omics portable pipelines
nf-core: A community-driven collection of omics portable pipelinesJose Espinosa-Carrasco
 
12 linux archiving tools
12 linux archiving tools12 linux archiving tools
12 linux archiving toolsShay Cohen
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelDivye Kapoor
 
101 2.3 manage shared libraries
101 2.3 manage shared libraries101 2.3 manage shared libraries
101 2.3 manage shared librariesAcácio Oliveira
 
Linux training
Linux trainingLinux training
Linux trainingartisriva
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksZubair Nabi
 

Mais procurados (20)

Red Hat System Administration
Red Hat System AdministrationRed Hat System Administration
Red Hat System Administration
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!
 
Compression Commands in Linux
Compression Commands in LinuxCompression Commands in Linux
Compression Commands in Linux
 
Compression
CompressionCompression
Compression
 
Archiving in linux tar
Archiving in linux tarArchiving in linux tar
Archiving in linux tar
 
101 3.3 perform basic file management
101 3.3 perform basic file management101 3.3 perform basic file management
101 3.3 perform basic file management
 
101 2.4 use debian package management
101 2.4 use debian package management101 2.4 use debian package management
101 2.4 use debian package management
 
basic linux command (questions)
basic linux command (questions)basic linux command (questions)
basic linux command (questions)
 
Command Line Tools
Command Line ToolsCommand Line Tools
Command Line Tools
 
4. linux file systems
4. linux file systems4. linux file systems
4. linux file systems
 
The basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemThe basic concept of Linux FIleSystem
The basic concept of Linux FIleSystem
 
101 2.1 design hard disk layout
101 2.1 design hard disk layout101 2.1 design hard disk layout
101 2.1 design hard disk layout
 
nf-core: A community-driven collection of omics portable pipelines
nf-core: A community-driven collection of omics portable pipelinesnf-core: A community-driven collection of omics portable pipelines
nf-core: A community-driven collection of omics portable pipelines
 
12 linux archiving tools
12 linux archiving tools12 linux archiving tools
12 linux archiving tools
 
Registry
RegistryRegistry
Registry
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
101 2.3 manage shared libraries
101 2.3 manage shared libraries101 2.3 manage shared libraries
101 2.3 manage shared libraries
 
Linux training
Linux trainingLinux training
Linux training
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocks
 

Semelhante a Updates on the BHL Global Cluster

Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processPhil Cryer
 
BHL Developments thru Jan-Sep 2009
BHL Developments thru Jan-Sep 2009BHL Developments thru Jan-Sep 2009
BHL Developments thru Jan-Sep 2009Chris Freeland
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersPhil Cryer
 
BHL Developments - Prague
BHL Developments - PragueBHL Developments - Prague
BHL Developments - PragueChris Freeland
 
BHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionBHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionChris Freeland
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit downChris Freeland
 
BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)Mark Jensen
 
BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)Mark Jensen
 
The BHL Infrastructure
The BHL InfrastructureThe BHL Infrastructure
The BHL Infrastructurecostantinog
 
BHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceBHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceWilliam Ulate
 
BHL Technical Projects Updates
BHL Technical Projects UpdatesBHL Technical Projects Updates
BHL Technical Projects UpdatesWilliam Ulate
 
F/LOSS in Norwegian libraries
F/LOSS in Norwegian librariesF/LOSS in Norwegian libraries
F/LOSS in Norwegian librariesLibriotech
 
Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich ABES
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryChris Freeland
 
Sxswedu 2013-oerpub-cnx-booktype
Sxswedu 2013-oerpub-cnx-booktypeSxswedu 2013-oerpub-cnx-booktype
Sxswedu 2013-oerpub-cnx-booktypekathi-fletcher
 
An Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of CongressAn Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of Congresslljohnston
 

Semelhante a Updates on the BHL Global Cluster (20)

Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
 
BHL Developments thru Jan-Sep 2009
BHL Developments thru Jan-Sep 2009BHL Developments thru Jan-Sep 2009
BHL Developments thru Jan-Sep 2009
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clusters
 
BHL @ #TDWG09
BHL @ #TDWG09BHL @ #TDWG09
BHL @ #TDWG09
 
BHL Developments - Prague
BHL Developments - PragueBHL Developments - Prague
BHL Developments - Prague
 
BHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionBHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussion
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit down
 
BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)
 
BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)
 
The BHL Infrastructure
The BHL InfrastructureThe BHL Infrastructure
The BHL Infrastructure
 
BHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceBHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable Resource
 
BHL Technical Projects Updates
BHL Technical Projects UpdatesBHL Technical Projects Updates
BHL Technical Projects Updates
 
PBCore RDF Ontology Hackathon | Code4Lib 2015
PBCore RDF Ontology Hackathon | Code4Lib 2015PBCore RDF Ontology Hackathon | Code4Lib 2015
PBCore RDF Ontology Hackathon | Code4Lib 2015
 
Openedweek 2013
Openedweek 2013Openedweek 2013
Openedweek 2013
 
F/LOSS in Norwegian libraries
F/LOSS in Norwegian librariesF/LOSS in Norwegian libraries
F/LOSS in Norwegian libraries
 
Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich
 
DI-fusion, english presentation
DI-fusion, english presentationDI-fusion, english presentation
DI-fusion, english presentation
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage Library
 
Sxswedu 2013-oerpub-cnx-booktype
Sxswedu 2013-oerpub-cnx-booktypeSxswedu 2013-oerpub-cnx-booktype
Sxswedu 2013-oerpub-cnx-booktype
 
An Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of CongressAn Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of Congress
 

Mais de Phil Cryer

Getting started with Mantl
Getting started with MantlGetting started with Mantl
Getting started with MantlPhil Cryer
 
Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPhil Cryer
 
Moving towards unified logging
Moving towards unified loggingMoving towards unified logging
Moving towards unified loggingPhil Cryer
 
What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?Phil Cryer
 
What if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usWhat if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usPhil Cryer
 
Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Phil Cryer
 
Online Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonOnline Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonPhil Cryer
 
Is your data secure? privacy and trust in the social web
Is your data secure?  privacy and trust in the social webIs your data secure?  privacy and trust in the social web
Is your data secure? privacy and trust in the social webPhil Cryer
 
Adoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity InformaticsAdoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity InformaticsPhil Cryer
 
Data hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity dataData hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity dataPhil Cryer
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...Phil Cryer
 
Storing and distributing data
Storing and distributing dataStoring and distributing data
Storing and distributing dataPhil Cryer
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionPhil Cryer
 
Biodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoBiodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoPhil Cryer
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchivePhil Cryer
 

Mais de Phil Cryer (15)

Getting started with Mantl
Getting started with MantlGetting started with Mantl
Getting started with Mantl
 
Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolved
 
Moving towards unified logging
Moving towards unified loggingMoving towards unified logging
Moving towards unified logging
 
What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?
 
What if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usWhat if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of us
 
Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)
 
Online Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonOnline Privacy in the Year of the Dragon
Online Privacy in the Year of the Dragon
 
Is your data secure? privacy and trust in the social web
Is your data secure?  privacy and trust in the social webIs your data secure?  privacy and trust in the social web
Is your data secure? privacy and trust in the social web
 
Adoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity InformaticsAdoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity Informatics
 
Data hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity dataData hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity data
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
 
Storing and distributing data
Storing and distributing dataStoring and distributing data
Storing and distributing data
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage Solution
 
Biodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoBiodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles Demo
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
 

Último

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 

Último (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 

Updates on the BHL Global Cluster

  • 1. Updates on the BHL Global Cluster biodiversity heritage library anthony goddard phil cryer
  • 2. Us? o We do this talk a lot.. generally our shirts match.
  • 3. What is the BHL? • BHL - The Biodiversity Heritage Library o digitization component of the Encylopedia of Life o a consortium of a global partners o aims to share historic biodiversity literature texts o provide open access of all content o free for all
  • 4. Why do we need a cluster? • All BHL data is at the Internet Archive in San Francisco o no redundancy o single point of failure (earthquake risk) o limited in how we could serve o no easy way to analyze data • First global BHL cluster gives us o redundancy o no single point of failure o various new serving options o new ways to run analytics #win!
  • 5.
  • 6.
  • 7. Use Linux and open source software running on commodity hardware to create a scalable, distributed filesystem.
  • 11. # ls -lh /mnt/glusterfs/www/a/actasocietatissc26suom total 649M -rwxr-xr-x 1 www-data www-data 19M 2009-07-10 01:55 actasocietatissc26suom_abbyy.gz -rwxr-xr-x 1 www-data www-data 28M 2009-07-10 06:53 actasocietatissc26suom_bw.pdf -rwxr-xr-x 1 www-data www-data 1.3K 2009-06-12 10:21 actasocietatissc26suom_dc.xml -rwxr-xr-x 1 www-data www-data 18M 2009-07-10 03:05 actasocietatissc26suom.djvu -rwxr-xr-x 1 www-data www-data 1.3M 2009-07-10 06:54 actasocietatissc26suom_djvu.txt -rwxr-xr-x 1 www-data www-data 14M 2009-07-10 02:08 actasocietatissc26suom_djvu.xml -rwxr-xr-x 1 www-data www-data 4.4K 2009-12-14 04:42 actasocietatissc26suom_files.xml -rwxr-xr-x 1 www-data www-data 20M 2009-07-09 18:57 actasocietatissc26suom_flippy.zip -rwxr-xr-x 1 www-data www-data 285K 2009-07-09 18:52 actasocietatissc26suom.gif -rwxr-xr-x 1 www-data www-data 193M 2009-07-09 18:51 actasocietatissc26suom_jp2.zip -rwxr-xr-x 1 www-data www-data 5.7K 2009-06-12 10:21 actasocietatissc26suom_marc.xml -rwxr-xr-x 1 www-data www-data 2.0K 2009-06-12 10:21 actasocietatissc26suom_meta.mrc -rwxr-xr-x 1 www-data www-data 416 2009-06-12 10:21 actasocietatissc26suom_metasource.xml -rwxr-xr-x 1 www-data www-data 2.2K 2009-12-01 12:20 actasocietatissc26suom_meta.xml -rwxr-xr-x 1 www-data www-data 279K 2009-12-14 04:42 actasocietatissc26suom_names.xml -rwxr-xr-x 1 www-data www-data 324M 2009-07-09 13:28 actasocietatissc26suom_orig_jp2.tar -rwxr-xr-x 1 www-data www-data 34M 2009-07-10 04:35 actasocietatissc26suom.pdf -rwxr-xr-x 1 www-data www-data 365K 2009-07-09 13:28 actasocietatissc26suom_scandata.xml
  • 13.
  • 14.
  • 15. the plan • Internet2 - woohoo o “This will take forever” (it took longer) o “We need more space” (not 24TB) o “something’s overloading the network” (oops) o “this checksum is wrong” (what the...) • Lessons learned would we do it again? Probably not.
  • 16. code: grabbyd 1 Internet Archive, San Francisco BHL Global, Woods Hole
  • 17. code: grabbyd_reporting http://cluster.biodiversitylibrary.org/
  • 18. code: bhl-sync Open source Dropbox model inotify lsyncd OpenSSH rsync
  • 19. all of our created code is open sourced and available at bit.ly/bhl-bits
  • 22. BHL content distribution 1 ? Internet Archive, San Francisco BHL Global, Woods Hole BHL China, Beijing 2 2 ? BHL, St. Louis BHL Europe, London BHL Australia, Melbourne
  • 23. BHL content + local data Internet Archive, San Francisco BHL Global, Woods Hole BHL China, Beijing Content sourced from China, scanned by Internet Archive, replicated into BHL Global
  • 24. BHL content + regional data Internet Archive, San Francisco BHL Global, Woods Hole ? BHL Europe, Paris BHL Europe, London BHL Europe, Berlin Content sourced from BHL Europe partners may, or may not, be passed back to Internet Archive and BHL Global
  • 25. other replication challenges • deleting content - "going dark" • new content coming in from other sources (localization of content) • distributing modified content 
  • 26. fedora-commons integration Repository platform • storage, access and management digital content • a base for software developers to build tools for sharing • free, community supported, open source software
  • 27. fedora-commons integration Repository platform • storage, access and management digital content • a base for software developers to build tools for sharing • free, community supported, open source software • Maintains a persistent, stable, digital archive o provides backup, redundancy and disaster recovery o complements existing architecture by incorporating open standards o stores data in a neutral manner o shares data via OAI
  • 28. BHL content distribution Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons BHL, St. Louis BHL Europe, London
  • 29. BHL content distribution Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons OAI BHL node Fedora-commons
  • 30. BHL content distribution Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons OAI BHL node Fedora-commons
  • 32. thanks. anthony goddard phil cryer all code available bit.ly/bhl-bits presentation slides on slidesha.re/bhl-slides

Notas do Editor

  1. PHIL/ANT Ohai! This is Phil, this is Ant.. Since last year’s TDWG we’ve taken this talk on the road, here’s an update for those who were here last year, an introduction for those who weren’t.
  2. ANT if anyone does not know, BHL is a digitization component of the Encylopedia of Life, a consortium of a global libraries and national history museums, it aims to digitize and share historic biodiversity literature texts, open access and free for all
  3. PHIL currently all BHL data is stored at IA, this is bad for many reasons - with our own cluster we can have control and new options on how we can use and store out data
  4. PHIL so how did we get from a proof of concept, put together with some various, outdated hardware
  5. PHIL to our formal production cluster
  6. PHIL with our first global cluster, our concept was to create a scalable storage system to store and server our data, that others *could* emulate, using open source software
  7. PHIL our systems run Debian Linux, using the latest filesystem, ext4, which supports far larger file systems (up to 1 Extabyte) and file sizes (16TB). We use the GlusterFS distributed/networked filesystem to handle the replication
  8. ANT cluster contains 6 boxes like this, 24 hard drives per, broken up and mirrored via a networked clustered storage. This gives us 216TB raw space, or 108TB of usable, mirrored data.
  9. ANT it’s a cluster.. six boxes but we see it as one giant machine.. 64GB RAM, 100TB Hard drive, 48 processors
  10. PHIL just an example of a record type that we store, all of the derivative files of a book can range anywhere from 200MB to over 3TB. Here’s an average record, and it’s about 650MB, the size of a standard cd-rom (our mirror currently has over 80k such records)
  11. ANT we looked at different ways of transferring the files from Internet archive to our own cluster
  12. ANT after considering all the options, was decided to download the data from IA
  13. ANT this shows some of the downloading in progress (250MB/sec), all told we have downloaded 74TB so far - had some problems...
  14. ANT talk about the 1st, 2nd and 3rd one PHIL do the 4th, and the ‘lessons learned’
  15. PHIL parts of the code that did the initial download has been reworked to be an ongoing process, grabbyd will handle downloading new items from IA to the cluster weekly
  16. PHIL we currently have reporting to give updates on download progress, overall size of the data and transfer rates. This will be expanded as we go forward
  17. PHIL to keep various nodes in sync we’ve written a backend ‘open source’ Dropbox like server application. Using other software we can have a service listening for any changes and kicking off the syncing scripts
  18. ANT all of our code that we write is available as open source software, hosted on the BHL code repository
  19. ANT We have begun initial speed and sync tests within the US and to London, work will be starting on these tests to Australia shortly
  20. ANT the global aspect of BHL has become more clear now after last week’s global meeting, with Egypt, Brazil joining others like China, AU, and EU
  21. ANT There are many options for syncing, due to the degree of control we require, we chose to use IA as a point of data ingestion, Woods Hole as a master site to seed data from
  22. ANT but, in the case of China, we ingest data into IA and then sync that data to our cluster in Woods Hole - so our model is flexible
  23. ANT In the case of BHL-Europe, content may or may not be ingested via IA, depending on the desire for BHL-Europe to take advantage of IA services such as OCR
  24. Phil there are other challenges such as deleting or content “going dark”, localization of content and especially how to deal with modified or annotated content
  25. Phil to track changes to the content we’re using Fedora-commons, which provides access and management of digital content, and is a base to build other apps on to use the data in other ways
  26. Phil Fedora maintains a persistent archive, used for backup and disaster recovery for the files, compliment existing arch by using open standards and not requiring anything of the existing system. Offers more sharing options via OAI
  27. Phil as seen in the mix, Fedora runs independently of the system
  28. Phil while it could provide a conduit to share metadata about the archive
  29. Phil and can even talk 1:1 with another Fedora instance
  30. ANT we have this hardware, we are intending on making use of it for computation services such as taxon name finding and text mining. PHIL we have tested running Hadoop on our cluster, and work on running statistical jobs in R have been run in Missouri and we’re looking to integrate the cluster for this
  31. PHIL/ANT in closing, while the BHL global cluster is to serve a certain purpose, we’d like to highlight that anyone could cluster can be built a similar cluster in many ways, and for almost no money, contact us for any advice or assistance for this. Thanks