Digitizing Newspapers: Part III – Outsourcing
I got off a long conversation with our newspaper digitization vendor and thought I should do something cathartic … like continuing our running discussion of newspaper digitization.
Washington’s National Digital Newspaper Program (outsourcing):
We last talked about the human resource intensive process of digitizing newspapers with a consumer-grade film scanner and the article-level indexing we do in-house for our Pioneer Newspaper collection.
In 2008 we were awarded an NDNP grant and began researching proposals to outsource the scanning and text conversion of 100,000 newspaper pages. OCR (optical character recognition) and scanning technology had come a long way since we began the Pioneer Newspaper collection so we were excited to see the results from our initial test scans. However, while outsourcing a large scale digitization project has its advantages, it also shares some of the same challenges already discussed, and produces a few unique ones:
Communication and Coordination: Working with people in other organizations spread all over the globe requires some coordination. Luckily many of the decisions regarding NDNP scanning and metadata specifications have been made and documented by the Library of Congress. The challenge then becomes execution; figuring out how best to comply to the specification, implementing a workflow from start to finish, and walking the line between requirements and guidelines (e.g. is using film at or below a 20x reduction ratio a requirement or a guideline?).
Another communication challenge can be the “black box” factor. When we aren’t intimately aware of the whole process we sometimes feel in the dark. This can lead to those moments when we realize (usually much later) that if we’d had a more holistic view we could have improved or changed things before problems snowballed.
Storage and Access: The sheer size of the newspaper files multiplied by the number of images creates storage and access issues. The output from the NDNP grant work results in 4 large files per page; an 8 bit grayscale tif (master image), a pdf file(derivative), a JPEG 2000 file for web access (derivative) and a METS/ALTO formatted xml file (OCR converted text), as well as other METS formatted xml files of descriptive and administrative metadata used for ingestion and display in Chronicling America.
The grant does not support the same article-level output created during the Pioneer Newspapers project but instead produces page-level, searchable text and images for 50,000 pages per year. This leaves us with the challenge of integrating 2 types of digital newspaper collections into 1 interface where users can browse, search, and access the images.
Curation and Quality Control: Also related to the issue of quantity is the difficulty of assuring quality and sustained curation of the digital files. The Library of Congress distributes software that aids in the validation (i.e. structural integrity) of the files but image and data quality are a challenge that require a carefully planned workflow and lots of time.
The scanning and OCR process our vendor employs produces an accuracy average of around 90% (with good film). And while we don’t correct OCR, we do scrutinize the descriptive metadata of each page (e.g. date, volume, issue, page information). So you can imagine the time involved when dealing with 50,000 pages a year.
Despite these new challenges we are excited to see Washington’s newspapers in Chronicling America, giving researchers the ability to search across multiple collections of newspapers from around the U.S.
For more information about Chronicling America or Washington’s digital newspaper collections, contact Laura Robinson, Washington’s National Digital Newspaper Program manager, at [email protected] or (360) 570-5568.