The Australian Newspapers Digitisation Program: Helping Communities Access and Explore their Newspaper Heritage - Keynote. 2007
1. Helping communities access and explore
their newspaper heritage.
Rose Holley – Manager Newspaper Digitisation Program
http://www.nla.gov.au/ndp rholley@nla.gov.au
Australian Media Traditions Conference
23 November 2007, Charles Sturt University, Bathurst
1
2. Status of the Program
November 2006 Minister for Arts and
Sports approval
Budget approval -$8 million for 3 million
pages over 4 years
Contracts signed with digitisation suppliers
April 2007 program pilot phase
commences
2
3. Content and Coverage
National Content
Northern
Territory
Initially a title from each Times
state
Focus on major titles
from each state first
Anticipated that
‘regional’ titles may
Courier Mail
be contributed later
West Australian
Coverage: published Advertiser Sydney Gazette
between 1803 – 1954 Canberra Times
(out of copyright)
Argus
Mercury
3
4. First Newspaper
• First page of first
Australian newspaper
ever published
The Sydney Gazette and New
South Wales Advertiser
Saturday March 5 1803
4
5. Through 150 years
• Up to 1954 (when
Copyright applies),
and later if agreement
with publishers.
The Argus 22 August 1945
5
7. Keep Up to Date with Progress
• Website: http://www.nla.gov.au/ndp/
7
8. National Help
• NLA working with State and Territory
Libraries as part of ANPLAN.
• Libraries suggest titles and dates and
provide microfilm for digitising.
• ANPLAN members and other stakeholders
will provide feedback on the search and
delivery prototype.
• Developing model for national contribution
of regional newspapers.
8
9. Process in brief
National sourcing of selected newspaper microfilm
masters.
Masters scanned by Contractor, Sydney to tiff files.
NLA perform quality assurance, add metadata.
Contractor, India process tiff files - OCR, zoning, xml
markup.
NLA QA files, ingest to system, create derivatives for
delivery.
9
11. 6 Month Progress
• IT Infrastructure and storage implemented at NLA
• Content management and ingest software developed by
NLA to support workflow
• Quality assurance and production software developed by
US/India contractor
• Pilot data sent to contractors to test workflows, systems
and software against agreed project spec.
11
12. Next 6 months
• Acceptance of pilot data then commence
production phase (3 million pages)
• Development of search and delivery prototype
• Public launch of service with a good body of
content in 2008
• Progressive addition of content – national
program ongoing
12
13. Technology – internal NLA
Old newspapers being processed and delivered
using latest digital technology
• NLA developing in house:
– Ingest and storage system
– Workflow and content management system including
quality assurance module
– Search and delivery system
• NLA providing:
– System Infrastructure
(storage, backup, disaster recovery)
13
14. Infrastructure and Storage
Online Storage – 70 TB:
• Working space for images in processing 40TB for 1 million pages
• Search and delivery derivatives 30TB for 3 million pages
• XML files, database systems and indexes 1 TB
Offline Storage – unlimited for master images on tape.
14
19. Quality Assurance at NLA
Use 2 widescreen
monitors placed
vertically. Can view
complete page
within context of
issue.
Add metadata, sort
out missing and
duplicate pages
within an issue.
Prepare batches to
send for OCR.
19
23. Technology - external
Software developed to:
• Zone areas and articles on a page
• Flag continuing articles across multiple pages
• Categorise articles on a page
• OCR text on a page
• Re-key headings and first 4 lines of text.
• Deliver XML files (ALTO) and METS/MODS
files.
23
30. Prototype Development
Under discussion:
• Derivative sizes and zoom technology
testing
• Search and Browse features
• Results and refinement of results
• User interaction with source (web 2.0)
• Interface design
30
31. Digital Newspaper Searching
• Newspapers full text searchable
• Image captions searchable
• Search across multiple papers e.g. by
persons name.
• Refine searching by:
– Date
– Newspaper title
– State published
31
32. Refine search by categories
• News
• Advertising
• Birth Death Marriage notices
• Obituaries
• Editorial commentary and letters
• Shipping News
• Arts and leisure
• Detailed lists, results, guides
32
34. Browsing and Viewing
• Browse papers page by page
• Zoom in and out of image
– to read small text
– to view context of article within page layout
• Print article or entire page or issue
34
38. Other features
Under discussion:
• OCR correction by users
• Personal annotation of articles by users
• Tagging results
• Creating public sets (for historical events)
• Clustering results
• Searching across other relevant resources (paid
subscription services, international resources,
other digital resources)
38
39. Prototype release
• To be released to stakeholders who have
given microfilm content
• Stakeholders able to view their data
• Feedback on data quality and search
functionality
• Amendments made and then ‘search and
delivery version 1’ released to a wider
group for testing and feedback before
public launch in 2008.
39
40. Pilot Data
• Canberra Times
• Sydney Gazette
• Northern Territory Times
• South Australia Advertiser
• Hobart Town Gazette, Courier, Colonial, Mercury
• Melbourne Argus
• Perth Gazette
• West Australian
• Brisbane Courier Mail
(12 titles, 8000 issues = 50,000 pages = 500,000 articles)
40