Digitizing Newspapers: Part III – Outsourcing

I got off a long conversation with our newspaper digitization vendor and thought I should do something cathartic ... like continuing our running discussion of newspaper digitization. Washington's National Digital Newspaper Program (outsourcing): [caption id="attachment_2506" align="alignleft" width="300" caption="Click to see OCR example from the Pullman Herald at 100% resolution"]

[/caption] We last talked about the human resource intensive process of digitizing newspapers with a consumer-grade film scanner and the article-level indexing we do in-house for our Pioneer Newspaper collection. In 2008 we were awarded an NDNP grant and began researching proposals to outsource the scanning and text conversion of 100,000 newspaper pages. OCR (optical character recognition) and scanning technology had come a long way since we began the Pioneer Newspaper collection so we were excited to see the results from our initial test scans. However, while outsourcing a large scale digitization project has its advantages, it also shares some of the same challenges already discussed, and produces a few unique ones: Communication and Coordination: Working with people in other organizations spread all over the globe requires some coordination. Luckily many of the decisions regarding NDNP scanning and metadata specifications have been made and documented by the Library of Congress. The challenge then becomes execution; figuring out how best to comply to the specification, implementing a workflow from start to finish, and walking the line between requirements and guidelines (e.g. is using film at or below a 20x reduction ratio a requirement or a guideline?). Another communication challenge can be the "black box" factor. When we aren't intimately aware of the whole process we sometimes feel in the dark. This can lead to those moments when we realize (usually much later) that if we'd had a more holistic view we could have improved or changed things before problems snowballed. Storage and Access: The sheer size of the newspaper files multiplied by the number of images creates storage and access issues. The output from the NDNP grant work results in 4 large files per page; an 8 bit grayscale tif (master image), a pdf file(derivative), a JPEG 2000 file for web access (derivative) and a METS/ALTO formatted xml file (OCR converted text), as well as other METS formatted xml files of descriptive and administrative metadata used for ingestion and display in Chronicling America. The grant does not support the same article-level output created during the Pioneer Newspapers project but instead produces page-level, searchable text and images for 50,000 pages per year. This leaves us with the challenge of integrating 2 types of digital newspaper collections into 1 interface where users can browse, search, and access the images. Curation and Quality Control: Also related to the issue of quantity is the difficulty of assuring quality and sustained curation of the digital files. The Library of Congress distributes software that aids in the validation (i.e. structural integrity) of the files but image and data quality are a challenge that require a carefully planned workflow and lots of time. The scanning and OCR process our vendor employs produces an accuracy average of around 90% (with good film). And while we don't correct OCR, we do scrutinize the descriptive metadata of each page (e.g. date, volume, issue, page information). So you can imagine the time involved when dealing with 50,000 pages a year. Despite these new challenges we are excited to see Washington's newspapers in Chronicling America, giving researchers the ability to search across multiple collections of newspapers from around the U.S. For more information about Chronicling America or Washington's digital newspaper collections, contact Laura Robinson, Washington’s National Digital Newspaper Program manager, at [email protected] or (360) 570-5568.


(html)

Secretary of State
Steve Hobbs

Image
Image of Secretary of State Steve Hobbs

Connect with Us

Search Our Corner

About this Blog

The Washington Office of the Secretary of State’s blog provides from-the-source information about important state news and public services.

This space acts as a bridge between the public and Secretary Steve Hobbs and his staff, and we invite you to contribute often to the conversation here.

Comments Disclaimer

The comments and opinions expressed by users of this blog are theirs alone and do not reflect the opinions of the Secretary of State’s Office or its employees. The agency screens all comments in accordance with the Secretary of State’s blog use policy, and only those that comply with that policy will be approved and posted. Outside comments will not be edited by the agency.