Rose Holley, Manager of Trove and Australian Newspapers at the National Library of Australia, provided an overview of the Australian Newspapers service and the digitization process. The presentation highlighted that over 20 million newspaper articles from major titles in each state and territory between 1803-1954 have been digitized and are fully text searchable online. Volunteers have helped improve the digitized text by correcting over 17 million lines. The service has been widely used and praised for making historic Australian newspaper archives accessible online.
Overview of the Australian Newspapers Digital Collection August 2010
1. On overview of the Australian Newspapers service August 2010. Presenter: Rose Holley, Manager - Trove and Australian Newspapers, National Library of Australia Australian Law Librarians Association Event 28 August 2010: Legal research using digitised historic Australian Newspapers
46. [email_address] Rose The site you manage is a nightmare! It’s addictive. Keeps me awake at night. Congratulations! Mary Questions?
Notas do Editor
Every state and territory library in Australia is involved in this national program. By 2011 we will have digitised 4 million newspaper pages, that’s about 40 million articles. The Sydney Morning Herald will comprise 600,000 pages of this. Each state has selected a major daily newspaper to begin with. We are working in collaboration with the state and territory libraries to digitise these first 4 million pages since many of them own the microfilm copies of the newspapers that we use to create the digital images from. Regional newspapers will be included from this year onwards. Regional titles are being contributed from libraries around Australia.
The purpose of the ANDP is to provide access to historic Australian newspapers in an online environment Key Features of this service will be that: It is accessible online Access is free The service is full text searchable.
At this stage, the ANDP was approved by the Minister in November 2006. A contract was signed in March 2007, with processing beginning in July. By December 2007, 500,000 pages had been digitised, with another 2.5 million pages to be digitised over the next 3 years.
In addition to the titles selected for the program, the National Library has received a $1 million grant from the Vincent Fairfax Family Foundation to digitise The Sydney Morning Herald through to 1954. This project will be running concurrently with the ANDP, and the pages will be included in the delivery system being developed for the ANDP
This is the home page of Australian Newspapers beta. Users can either keyword search or browse by date, title or state. The service is being heavily used with around 28,000 keyword searches per day and an unknown number of browses. We have not widely publicised the service as yet since we are still in a beta version.
This is the article view. Users can zoom in or out and choose to view the article in the context of the entire page. They can also navigate to any other page within the newspaper issue. The electronically generated text created through the OCR process is displayed on the left hand side. This is also where the users can use the 3 enhancement features. They can drag the viewing pane to see more of the or less. Users can tag the article with keywords and they can write comments and notes about the article. If users login they will be able to choose to make their tags and comments public or private. So they can share their comments with all users or they can add their own private research notes that only they can access. One feature that we believe is innovative and not available in any other online newspaper service, is the ability for the user to correct the electronically generated text. There are a number of reasons why the electronically created text is not always 100% accurate, mainly due to the quality of the original newspaper that the image was created from. Users can correct the text by clicking on the ‘Help fix this text’ button. We will now use these features on this article. The article we are looking at is the first report in an Australian paper of the sinking of the titantic.It’s in the Northern Territory Times on 19 April 1912.
I want to tag the article with ‘titantic sinking’. If a user does not login when they first enter the service then the first time they want to enhance an article they will be offered the option to login. At this point they can either login or enter the captcha to verify they are human (and not a robot – attempting to do something undesirable). Once logged in or verified with captcha a user can enter their tags.
Now I want to add a comment. Those of you who read this article may have noticed that it was reported that all passengers were safely rescued from the titanic and the weather was calm. I’ll just add a comment to say this was unfortunately not the case.
Now I have zoomed in on the image and if the OCR text was inaccurate I would edit it in the box on the left. This is what we call the power edit mode. In this article the text is actually very accurate so has either OCR’d very well, or already been corrected by someone else.
Now we can review the article with all the enhancements we have made showing on the left. Tags, comments and corrections. We can view the history of all the enhancements (both ours and other peoples history).
One of the innovative features that was in the first release was the ability for members of the public to correct or enhance the OCR text. When digitising old newspapers the process is to convert a digital image into full-text by use of Optical Character Recognition software (OCR). This works well on new clear documents but on old newspapers where the font and paper is of poor quality and microfilms may be out of focus the translation often goes into gibberish. After investigating every possible way technically of being able to improve this we came to the conclusion that the best way was by hand and human eye. We could not possibly afford to pay contractors to do this ‘re-keying’ so the lead programmer Kent Fitch suggested we open it up for the public to do. If text was made accurate the searching would be instantly improved for everyone since the search works over the OCR text.
Several people can correct the same article. All corrections are saved and viewable in the history of the article. All versions of corrections are searched for. It is the last correction that is visible in the left hand pane. Articles are corrected by many users when they are either very long, very significant, or very illegible. For example this article is in the first Australian newspaper – the Sydney Gazette and NSW advertiser of March 1803. Around 20 people have made corrections to this article. It is particularly challenging because of its use of the long f instead of an s.
This is the text correction history of this article, showing all the different users and what parts they corrected.
In response to numerous requests we instigated the ‘hall of fame’. The top 5 correctors show on the home page as well as in the hall of fame. Originally the hall of fame only showed the top 10 but users wanted to see more, so now it is anyone who has corrected more than 5000 lines per month. Users are still asking for entire league tables however so they can see where they are in the big picture. This is a motivating factor for them. During development it was suggested that we need to use gaming technologies to encourage people to correct text but this has so far not proved necessary!
In the first 6 months a total of 2 million lines in 100,000 articles had been corrected. The top 5 correctors had consistently remained in the top 5 each month and were working up to 45 hrs per week on text correction. Top correctors are correcting up to 30,000 lines per month. We had many users saying that t ext correction is proving to be an ‘addictive’ or compulsive activity. They sat down to fix a few words for 5 minutes and before they knew it 3 hours had passed. This was very interesting.
Julie the top corrector has featured in the media and become a star. She loves correcting articles about Bendigo murders.
The recent interactions users have made are also displayed on the homepage for everyone to see. You can see the number of searches in the last hour, newspaper article corrections so far today, works merged or split this week, items tagged this week, and comments this month.
Thank you for listening to me today. I am happy to take questions.