Digitizing and parsing semi-structured historical administrative documents from the G.I. Bill mortgage guarantee program

Lafia, Sara; Bleckley, David A.; Alexander, J. Trent

Digitizing and parsing semi-structured historical administrative documents from the G.I. Bill mortgage guarantee program

dc.contributor.author	Lafia, Sara
dc.contributor.author	Bleckley, David A.
dc.contributor.author	Alexander, J. Trent
dc.date.accessioned	2023-05-10T14:44:30Z
dc.date.available	2023-05-10T14:44:30Z
dc.date.issued	2023-06-13
dc.identifier.uri	https://hdl.handle.net/2027.42/176363	en
dc.description.abstract	Many libraries and archives maintain collections of research documents, such as administrative records, with paper-based formats that limit their access to in-person use. Digitization transforms paper-based collections into more accessible and analyzable formats. As collections are digitized, there is an opportunity to incorporate deep learning techniques, such as Document Image Analysis (DIA), into workflows to increase the usability of information extracted from archival documents. This paper describes our approach using digital scanning, optical character recognition (OCR), and deep learning to create a digital archive of administrative records related to the mortgage guarantee program of the Servicemen’s Readjustment Act of 1944, also known as the G.I. Bill. We used a collection of 25,744 semi-structured paper-based records from the administration of G.I. Bill Mortgages from 1946 to 1954 to develop a digitization and processing workflow. These records include the name and city of the mortgagor, the amount of the mortgage, the location of the Reconstruction Finance Corporation agent, one or more identification numbers, and the name and location of the bank handling the loan. We extracted structured information from these scanned historical records in order to create a tabular data file and link them to other authoritative individual-level data sources. We compared the flexible character accuracy of five OCR methods. We then compared the character error rate of three text extraction approaches (regular expressions, document image analysis, and named entity recognition). We were able to obtain the highest quality structured text output using DIA with the Layout Parser toolkit by post-processing with regular expressions. Through this project, we demonstrate how DIA can improve the digitization of administrative records to automatically produce a structured data resource for researchers and the public. Our workflow is readily transferable to other archival digitization projects. Through the use of digital scanning, OCR, and DIA processes, we created the first digital microdata file of administrative records related to the G.I. Bill mortgage guarantee program available to researchers and the general public. These records offer research insights into the lives of veterans who benefited from loans, the impacts on the communities built by the loans, and the institutions that implemented them.	en_US
dc.description.sponsorship	Michigan Institute for Data Science (MIDAS) Propelling Original Data Science (PODS) Grant	en_US
dc.language.iso	en_US	en_US
dc.publisher	Journal of Documentation
dc.rights	Attribution-NonCommercial 4.0 International	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc/4.0/	*
dc.subject	archives, digitization, document image analysis, historical records, OCR, workflows	en_US
dc.title	Digitizing and parsing semi-structured historical administrative documents from the G.I. Bill mortgage guarantee program	en_US
dc.type	Preprint	en_US
dc.subject.hlbsecondlevel	Social Sciences (General)
dc.subject.hlbtoplevel	Social Sciences
dc.contributor.affiliationum	Institute for Social Research (ISR)	en_US
dc.contributor.affiliationum	ICPSR	en_US
dc.contributor.affiliationumcampus	Ann Arbor	en_US
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/176363/1/GI Bill digitization technical paper.pdf
dc.identifier.doi	https://dx.doi.org/10.7302/7212
dc.identifier.doi	10.1108/JD-03-2023-0055
dc.identifier.orcid	0000-0002-5896-7295	en_US
dc.identifier.orcid	0000-0001-7715-4348	en_US
dc.identifier.orcid	0000-0003-2161-4709	en_US
dc.rights.license	Creative Commons Attribution Non-commercial International License 4.0 (CC BY-NC 4.0)
dc.description.depositor	SELF	en_US
dc.identifier.name-orcid	Lafia, Sara; 0000-0002-5896-7295	en_US
dc.identifier.name-orcid	Bleckley, David; 0000-0001-7715-4348	en_US
dc.identifier.name-orcid	Alexander, Trent; 0000-0003-2161-4709	en_US
dc.owningcollname	Institute for Social Research (ISR)

Files in this item

Name:: license_rdf
Size:: 914bytes
Format:: application/rdf+xml

View/Open

Name:: GI Bill digitization technical ...
Size:: 1.235MB
Format:: PDF

View/Open

Institute for Social Research (ISR)

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial 4.0 International

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.