WA Secretary of State Blogs

Over 5.2 million pages strong… and counting

Tuesday, November 6th, 2012 Posted in Articles, Digital Collections, For Libraries, For the Public, News, State Library Collections, Technology and Resources | No Comments »


The Torch Bearer at the Library of Congress
Interior of the Library of Congress

From the futuristic desk of Shawn Schollmeyer.

With 100,000 pages contributed each two year grant cycle from over 30 states and reaching for participation by all 50 states, the National Digital Newspaper Program (NDNP) is the biggest digital newspaper project in U.S. history and sponsored by the National Endowment for the Humanities (NEH) and the Library of Congress (LC). Each of those 5.2 million pages need related lines of code and metadata along with the page images.  Title, city, date, as well as Optical Character Recognition (OCR) files that turn an image into machine-readable text, allow users to search newspaper content on the Chronicling America website.

That’s a lot of files! Who manages all these files? Less than a dozen people at Library of Congress support the websites & wikis, upload files, and help project managers learn the NDNP digitization process. Here in Washington State, we rely on this handful of people to guide us on best practices for digitization and image standards for our participation in the program.  In September, all the participating states gathered to meet our sponsors, advisors, and fellow awardees to discuss the great ways people are using the content from this project.  At the end of the three day conference, our heads are filled with practical knowledge of processes, resources, and exciting new ideas. While I was there I had the rare opportunity to meet the magicians behind the curtain…

Our main contact for the National Digital Newspaper Program in Washington, DC is Chris Ehrman. Nearly a librarian by birth (his parents are both librarians), Chris began his newspaper experience in the University of Utah Ski Archives , uploading photos and video of America’ favorite winter sport before moving on to the NDNP program in Montana. There he honed his technical expertise learning the selection and upload process for Montana’s newspaper collection, becoming a great candidate for the Library of Congress’ Digital Conversion Specialist position. Chris is our “go-to” man when we have questions about how to resolve the challenges of working with so many files and metadata. If the data checks out OK, Chris prepares the scripts to load files for the automatic ingestion process so the newspaper images will appear in the Chronicling America database. He also supports the LC’s NDNP website.

There are four Digital Conversion Specialists who evaluate and help load our submitted batches of files to the website. Missing pages, cataloging conflicts, or date misprints are among the situations that may flag a batch for further review.  These four take turns validating batches from all awardees for final approval in addition to their specialized tasks, which include validation tool support and digitizing from LC’s own historic newspaper collection.  Chris estimates that they see 150,000-180,000 pages per month, translating to about six terabytes. One of their biggest challenges is to keep the workflow moving and avoid bottlenecks in the system.

Robin Butterhof is another LC specialist. Friendly & energetic, Robin supports the NDNP wiki page that contains the technical specifications, trainings, tools, deliverables, and state by state project information. She is a woman of many talents, having held several different library jobs, including book publishing, reference librarian, non-profit work and consulting, all while attending classes as a library student. Excellent training for the many tasks she juggles daily at LC.

Chris, Shawn & Robin with “batch_wa_lacamas”
Pulling all the teams, awardees, conversion specialists, NEH contacts, and LC resources together is the NDNP Coordinator, Deborah Thomas. Deb has a long history of working with digital collections in our national library, most notably, the American Memory project, a multimedia collection of American history and culture with over nine million items. In my short interview with the team, she really helped put the national project into context for me. One of the most significant challenges is managing “a sustainable collection of significant scale produced by many organizations” which includes careful planning for maintaining access and managing the data and processes long term. She reminds us that “Digital objects are not just pictures. For newspapers, they are pictures of pages and machine-readable text from those pages and metadata that describes the pages and the relationships between pages.” In order to help people find what they’re looking for we need to figure out “how to make the cream rise to the top.” These millions of pages of newspapers would be pretty overwhelming to wade through without text search capabilities at the page level. Creating standards for metadata and text recognition software (OCR) is only a piece of making these pages accessible. Each state has their own workflow; software vendors; page or article level OCR; file storage systems; and even multiple languages that need to be filtered and standardized.

When I asked the team about what they enjoy most about their work Robin admitted she loves how “something wacky pops up every day” referring to the many series of cartoons, entertaining articles and sometimes sensational headlines. Chris agreed and mentioned his favorites are the illustrations of the future, which led to discussion of Deb’s favorite article from the December 20, 1908, New-York Tribune, “Public Library of the Future.”

Unlike the library vision in the article, we may not be sending facsimiles of our newspapers and important manuscripts through pneumatic tubes to our Congressional Library, but we will be sending a dozen or so hard drives with thousands of files of newspaper pages to real people, the people I met in the James Madison Building. These are the people who will be helping us create the new digital libraries of a very real future where we can still have “a library in every hotel, train, trolley car and steamship!”