Chronicling America and Navigating Newspapers

Chronicling America and Navigating Newspapers

Through multiple National Digital Newspaper Program grants from the National Endowment for the Humanities in conjunction with the Library of Congress, the Washington State Library has contributed over 300,000 pages of digitized Washington newspapers to Chronicling America (chroniclingamerica.loc.gov) since 2008.

The contributions from the Washington State Library are part of the over 16 million searchable newspaper pages from 48 states and two territories made freely available on Chronicling America. Ben Lee is working to extend the usability of these digitized newspapers through his Newspaper Navigator project. He was kind enough to answer a few questions about Newspaper Navigator.

The following is an interview between Ben and our Washington Digital Newspaper Program Assistant Caitlin Patterson.

Can you give a quick overview of the Newspaper Navigator project?

Definitely!  Newspaper Navigator is a project that I am carrying out as an Innovator-in-Residence at the Library of Congress, as well as a Ph.D. Student in the Paul G. Allen School for Computer Science and Engineering at the University of Washington. I’m extremely fortunate to be able to work on Newspaper Navigator with LC Labs, the National Digital Newspaper Program, and IT Design & Development at the Library of Congress, as well as with my advisor, Professor Daniel Weld, at the University of Washington.

The central goal of Newspaper Navigator is to re-imagine how the American public explores Chronicling America by utilizing emerging machine learning techniques to extract, categorize, and search over the visual content and headlines in Chronicling America’s 16.3 million pages of digitized historic newspapers. Newspaper Navigator was both inspired and directly enabled by the Library of Congress’s Beyond Words crowdsourcing initiative. Launched by LC Labs in 2017, Beyond Words engages the American public by asking volunteers to draw boxes around photographs, illustrations, maps, comics, and editorial cartoons on World War I-era pages in Chronicling America, note the visual content category, and transcribe the relevant textual information such as titles and captions. Newspaper Navigator directly builds on Beyond Words by utilizing these annotations, as well as additional annotations of headlines and advertisements, to train a machine learning model to detect visual content in historic newspapers and subsequently process all 16.3 million pages.

The first seven months of the project were devoted to building out the pipeline for processing all of the pages in Chronicling America. The pipeline was successfully run over 19 days in late March and early April. In early May, we publicly released the Newspaper Navigator dataset. I’m now working on building a search user interface to make it easier to explore the Newspaper Navigator dataset. It will launch in the late summer!

All code for the project is in the public domain and is available here, and more information on the construction of the Newspaper Navigator dataset can be found here.

A collage showing maps of the Civil War extracted from the Newspaper Navigator dataset.

How does Newspaper Navigator tie into Chronicling America and the National Digital Newspaper Program as a whole?

I see Newspaper Navigator as a part of the wonderful genealogy of work with historic American newspapers started by the National Digital Newspaper Program. The program has reached so many people and inspired so many exciting projects with Chronicling America. I see Newspaper Navigator and the National Digital Newspaper Programas having similar goals, in terms of facilitating access and exciting the American public!

What audiences do you think will benefit most from Newspaper Navigator?

My hope is that a wide range of audiences – including researchers, teachers, genealogists, and curious members of the American public – will find Newspaper Navigator useful!

What first sparked your interest in Chronicling America?

I first discovered Chronicling America in 2017 through the Beyond Words crowdsourcing initiative that I mentioned above. When the call for Innovator-in-Residence concept papers was released last summer, my mind jumped to Chronicling America for a number of reasons. First, it’s such a wonderfully rich resource that captures American history in a unique way. Second, the National Digital Newspaper Program has done such an incredible job of making Chronicling America accessible via an API, which makes it a great collection to study with computational tools. Third, Chronicling America has so many users doing so many cool things!

Has anything surprised you about the Newspaper Navigator dataset?

It’s been a ton of fun to explore the visual content in the dataset, and I continue to be surprised by what I find. I’ve particularly enjoyed looking through the maps and also the data visualizations scattered throughout, which are fascinating. It’s also been a lot of fun to explore the visual content in newspapers from my new home state of Washington and learn some local history in the process. LC Labs and I hosted a public data jam a few weeks ago, and it was so exciting to see all of the inventive ways that participants used the dataset. I’m hoping for more surprises as people continue to use the dataset!

Lastly, if you have any questions or comments about Newspaper Navigator, you are more than welcome to contact me at belee@loc.gov!

Please follow and like us:

Leave a Reply

Your email address will not be published. Required fields are marked *