5. Access
• Web portal just one way into
the BHL data
• Can be search OCR text for
species name
• Full text search (coming soon)
• Users can develop own tools
to access data
• Cite Bank & Biostor two
examples
• BHL data available using open
standards and linked to semantic web
An elementary manual of New Zealand entomology
London,West, Newman & Co.,1892.
biodiversitylibrary.org/item/34950
6. Copyright
• Majority of works in Public Domain
• Public Domain considered to be
> 90 years from publication
• What’s in the public domain
stays in the public domain.
• Permission sought by contributors
for in copyright material.
• Use of in copyright content
L'histoire naturelle des estranges poissons marins
A Paris :De l'imprimerie de Regnaud Chaudiere,1551.
licensed
biodiversitylibrary.org/page/4748789
7. Audience
• Scientific Community
• Need authoritative reference texts
• Need to find data
• General interest
• Want to see what’s interesting
and attractive
• Need to know what to look for
• Web developers
• Need a challenge!
• Need open data standards.
A monograph of the Trochilidæ, or family of humming-birds /.
London :Printed by Taylor and Francis ;1861 [i.e. 1849-1861].
biodiversitylibrary.org/page/34843253
8.
9. Digitisation Workflow
• Images
• Captured as RAW
• Converted to 8 bit uncompressed TIFF
• Processed TIFFs saved as
archival copy
• Compressed JPEG 2000
uploaded to IA
• Metadata
• Bibliographic record as MODS
• Page metadata exported from
Macaw as XML
Ornithological miscellany V.1
London :Trübner and Co., Bernard Quaritch, R.H. Porter,1876-1878
biodiversitylibrary.org/item/108982
11. Macaw
• Developed by Smithsonian Libraries ….
• Simple metadata creation for book
and page items
• Upload directly to
Internet Archive
• … Hacked by MV
• Simplified workflow
• New styling
• Multiple contributor upload
13. Thank you
Joe Coleman
jcoleman@museum.vic.gov.au
http://bhl.ala.org.au
Notas do Editor
Welcome slide
Introduction.Every scientist stands on the shoulders of those that have gone before. In particular, the sciences of taxonomy and bioinformatics are the sort of discipline that involve as much time in the library as in the laboratory. The study of the species that make up the world’s biodiversity requires reference to a large body of biological literature, much of it spanning centuries of research. Libraries Academic libraries house large collections of this stuff and natural history museums in particular hold very focused catalogues of literature pertaining to the scope of their collection. A terrific resource these may be, but they have one major drawback: the library collections are usually not located where the biologist would most like to be, out collecting in the field. Furthermore, as anyone familiar with Murphy’s law would agree, the one book we most desire when undertaking research, is the one that is missing, on loan or was deemed not to fit with the acquisition policies devised by management of the day.
Introducing BHL The Biodiversity Heritage Library was begun as a way of solving some of these problems and began as a consortium of some of the heavyweights of natural history museums, herbaria and academic libraries of the UK and North America. The goal of the Biodiversity Heritage Library has been to digitise as much of the literature relating to biological sciences as possible and make it accessible online. Underlying this endeavour is the philosophy that the body of knowledge contained within makes up the legacy of human understanding about the world we live in, and as such should be freely available to everyone.
Partners:With founding members such as the Smithsonian Institution, Natural History Museum of London, Kew Gardens and Woods Hole marine institute a fair portion of the literature (at least from the Northern Hemisphere) has already been scanned. There are currently around 103 thousand volumes online and growing. The project aims to be truly global in its reach and it has since expanded to include collections from Continental Europe in affiliation with the Europeana project, China, Brazil, Australasia and most recently a partnership in Africa. Each affiliated project contributes to either the provision of technical services, content or both. Internet Archive.Central to the success of the BHL has been a partnership with the Internet Archive. The archive is a not for profit organisation tasked with just the small mission of providing universal online access to recorded knowledge. The Archive already hosts a vast amount of digitised literature, and has the resources and expertise to host the content of the BHL as well as assisted with much of the scanning. Once a book has been uploaded, the Archive’s internal processes create the OCR text and derivatives for download or online viewing. In reality, the BHL comprises two distinct projects, each sharing the same mutual objective of open access to knowledge: one project with the function of digitizing literature and an online project to develop the systems to deliver the content to its audience.
Access With such a large amount of data online, the issue of access changes from a question of how to get a hold of a resource, to one of how do I find the information I want and how can I share it with others? For many users, the first contact with the BHL might be through the various portal sites which reflect the regional contributions. In addition to standard title, subject and author searches, these search functions are tuned to the requirements of the local scientific community. For example the search results for a particular region may place an emphasis on result showing species or publications from that area. Built in to the BHL is a link to the uBio’sTaxonfinder resource that scours the OCR text for species names within the full text of the BHL and can return results based on a plant or animal’s Latin name. A new feature that is under development by the Australian node’s partners at the CSIRO is a full text search of the entire collection. It uses an enhanced Lucene search to produce results very quickly from the complete text and It’s scheduled to be available on the Australian site in beta form in the next couple of weeks. Further tools for the researcher are available through the CiteBank service to generate customised bibliographies, based on species citations and allows the user to build up a personalised library. The BioStor project is a service developed by a BHL user to search for article extents within journals and link back to them using OpenURLs. The BHL is committed to sharing data using open web standards. It’s important for the success of the project that the data held by the BHL is used widely and in creative ways. Book metadata is published as OAI-PMH queries and returns metadata in either Dublin Core or MODS formats. The books themselves can be referenced by DOI or persistent URL and individual pages can also be accessed by OpenURLs.
Copyright Before I go any further, I had better mention the C word. The BHL walks the copyright tightrope pretty carefully. It has to because no library wants to be embroiled in a copyright suit with publishers and the BHL relies on the goodwill of its contributors to succeed. The ‘Heritage’ part of the Biodiversity Heritage Library indicates the historical nature of the library’s collection. Some of the digitised titles in the collection date back to the fifteenth century and about 70 per cent is over a hundred years old and in the public domain. We get into murky territory with local differences in the extent of public domain, so as a precaution the default limit has been 90 years from date of publication. The remainder has been published copyright free or the contributing institution has explicitly sought the permission of the copyright holder to put the material online. We want people to be able to access the combined literary resources of some of the world’s great natural history collections and it is hoped that they will be able to put this resource to good use. To this end, a memorandum of understanding has been signed between all participants to the effect that all material currently in the public domain that has been made available by the Biodiversity Heritage Library remains in the public domain and no party shall claim intellectual property rights over the original or any derivative version. Documents currently in copyright for which permission has been granted for digital representation is done so a Creative Commons Non-Commercial, Share Alike 3.0 license. Certainly, the more recent publications there are online in the BHL, the more useful it can be to scientific research so we’re actively engaged in negotiating with the copyright holders to gain permission to extend the holdings of some of our targeted content.
Audience So who are the audience for the BHL? Without a doubt, the largest segment of users of the BHL is the scientific community; many researchers in the field of taxonomy depend on the BHL as a major resource. These people need access to authoritative bibliographies and first descriptions of species and digital access saves an enormous amount of time and headaches. A second segment of the BHL’s audience are the people who appreciate the art of the scientific illustrations and the bibliophiles for whom such a deep collection of historical books represents hours of fascination. Developers Finally, another group we would like to encourage are the developers and metadata junkies who can creatively reuse and mash up the bibliographic data and the book content to develop their own technology projects. These are the people who add to the value to the collection, who surprise us with novel ways of re-imagining the knowledge residing in the historical literature and presenting it in novel ways. To aid these people and those who wish to mine the dataset for research, the BHL has a published API which allows access to metadata and content using open web standards. Documentation can be found on the developer page on the BHL portals.
BHL Australia In this part of the world, the local branch of the BHL was set up to provide the literature service for of the Atlas of Living Australia and is being coordinated from Melbourne by Museum Victoria. Since the middle of last year, we have put in place a local portal to the collection and developed software to facilitate our regional content contribution, and designed the workflow for digitisation of books from local libraries. The focus so far has been on accessing Australian collections, but I’m keen on expanding our scope to include literature from elsewhere in the region, especially New Zealand. Small scale scanning Our scanning operation is intentionally small scale: thousands of titles have been scanned already by the big libraries including many publications from Australia and New Zealand. We don’t have the resources to digitise large volumes of material, but we can target our operation to fill the gaps that the big guys have left. To direct our scanning effort, we’ve developed a website where We encourage our users within the scientific community to nominate titles and vote on the priority of our scanning list. We initially seeded our database with a bibliography obtained from the Australian government species registers and order the list according to number of citations. The initial list came in at over 7 thousand titles but we’ve had to pull out quite a few duplicates. We allow users to vote with a simple ‘like’ type system to add a weighting to a title and move it up the list. Similarly, titles added by users carry a greater weigh than the seed list, so hopefully our scanning list reflects something close to the preferences of our community of users. In general, we’re focussing on completing the runs of locally published serial titles that are represented in the BHL but have incomplete holdings. Then we’re targeting the small niche publications such as those put out by the amateur naturalist societies. We are also hoping to digitise some of the beautiful rare books in our collections, which may be hard to obtain, have hand coloured illustrations or are in some way unique. We began digitising titles from Museum Victoria’s library using our Bookdrive Pro copy stand just before Christmas, but only really began in earnest in February when we put in place a volunteer programme to do the image capture and post processing. In that time we’ve digitised about seventy volumes and fifty of those have been up loaded to the BHL so far. This is tiny compared to what we have on our bid list – but we’re making headway.
Digitisation at MV When I started out on the digitisation project, I didn’t realize just what a manual process it was going to be. The image capture is very hands on and quite physical, but on a good day we can get through about 1000 pages in an hour. When we were selecting a digitisation platform, we chose the Bookdrive system because it allowed for pages to be photographed flat without unbinding the volume. This is particularly important for our conservators if we are to digitise our rare books. Once we’ve imaged a book, the files are batch processed out of Adobe Bridge from camera Raw to TIFF. Each file is then opened in Photoshop and individually cropped and straightened. I’ve evaluated a number of different solutions for the image processing, but from what I’ve experienced so far, the best results and the most efficient process has been to do the post processing in Photoshop. This stage takes by far the longest, so we have two workstations set up processing files from the one capture rig.
Volunteers. Since the beginning of February most of the digitisation has been carried out by volunteers from the Museum’s volunteer programme. We have six people who have committed to the project until the end of July and they have been operating the image capture system and carrying out the post-processing. When they started, they all had only basic computer skills and were a bit overwhelmed by the amount they had to learn. We provided training and to begin with, supervised them closely until they became familiar with process, but as their confidence in operating the machinery has grown, so has their output. Lately, on average we have been getting though about four books per volunteer a day and we have two each day for three days a week. They’re a terrific group of people from very different backgrounds but all have become very passionate about their contribution to the BHL and are keen to continue with the project. Owing to the fairly physical and repetitive aspect of the imaging we have established a buddy system where they are paired up and swap jobs at regular intervals. Each pair is responsible for digitising their allocated books for the day and they usually exceed their targets. I haven’t even had to bribe them, but to keep them interested; we have regular morning teas as well as special viewings of the rare books and collection areas of the Museum.
Volunteers. Since the beginning of February most of the digitisation has been carried out by volunteers from the Museum’s volunteer programme. We have six people who have committed to the project until the end of July and they have been operating the image capture system and carrying out the post-processing. When they started, they all had only basic computer skills and were a bit overwhelmed by the amount they had to learn. We provided training and to begin with, supervised them closely until they became familiar with process, but as their confidence in operating the machinery has grown, so has their output. Lately, on average we have been getting though about four books per volunteer a day and we have two each day for three days a week. They’re a terrific group of people from very different backgrounds but all have become very passionate about their contribution to the BHL and are keen to continue with the project. Owing to the fairly physical and repetitive aspect of the imaging we have established a buddy system where they are paired up and swap jobs at regular intervals. Each pair is responsible for digitising their allocated books for the day and they usually exceed their targets. I haven’t even had to bribe them, but to keep them interested; we have regular morning teas as well as special viewings of the rare books and collection areas of the Museum.
Future So what of the future for the BHL? Unfortunately in many parts of the world, money is getting hard to come by for further digitisation projects. Right now, the digitisation operations are winding down among most of the US contributors and in Europe. But this doesn’t mean that the collection will remain static. The Smithsonian is the exception to rule and I believe that they are continuing scanning at a cracking pace while much of the recent contributions have been coming from China. In the US, the technical team has just received a grant to develop crowdsourcing tools to correct OCR text and identify and describe the illustrations from the collection so there will still be plenty of development going on. The Australasian branch runs out of money toward the end of this year and at the moment we are seeking new sources of funding. In spite of this, we are well positioned to continue our contribution. With the volunteer programme in place and once the Macaw portal is set up I am hopeful that we will continue digitising and uploading content even if other parts of the project are scaled back.
Conclusion The world’s biosphere is changing at an unprecedented rate in human experience and yet new species are still being discovered. If scientists are to understand the life that exists on earth today they must have access to the documentary legacy of research that has gone before. The Biodiversity Heritage Library plays an important role in giving access to this literature for the benefit of science. The more complete the library, the greater use it will be to scientists in the future.