Show simple item record

Digitizing and parsing semi-structured historical administrative documents from the G.I. Bill mortgage guarantee program

dc.contributor.authorLafia, Sara
dc.contributor.authorBleckley, David A.
dc.contributor.authorAlexander, J. Trent
dc.date.accessioned2023-05-10T14:44:30Z
dc.date.available2023-05-10T14:44:30Z
dc.date.issued2023-06-13
dc.identifier.urihttps://hdl.handle.net/2027.42/176363en
dc.description.abstractMany libraries and archives maintain collections of research documents, such as administrative records, with paper-based formats that limit their access to in-person use. Digitization transforms paper-based collections into more accessible and analyzable formats. As collections are digitized, there is an opportunity to incorporate deep learning techniques, such as Document Image Analysis (DIA), into workflows to increase the usability of information extracted from archival documents. This paper describes our approach using digital scanning, optical character recognition (OCR), and deep learning to create a digital archive of administrative records related to the mortgage guarantee program of the Servicemen’s Readjustment Act of 1944, also known as the G.I. Bill. We used a collection of 25,744 semi-structured paper-based records from the administration of G.I. Bill Mortgages from 1946 to 1954 to develop a digitization and processing workflow. These records include the name and city of the mortgagor, the amount of the mortgage, the location of the Reconstruction Finance Corporation agent, one or more identification numbers, and the name and location of the bank handling the loan. We extracted structured information from these scanned historical records in order to create a tabular data file and link them to other authoritative individual-level data sources. We compared the flexible character accuracy of five OCR methods. We then compared the character error rate of three text extraction approaches (regular expressions, document image analysis, and named entity recognition). We were able to obtain the highest quality structured text output using DIA with the Layout Parser toolkit by post-processing with regular expressions. Through this project, we demonstrate how DIA can improve the digitization of administrative records to automatically produce a structured data resource for researchers and the public. Our workflow is readily transferable to other archival digitization projects. Through the use of digital scanning, OCR, and DIA processes, we created the first digital microdata file of administrative records related to the G.I. Bill mortgage guarantee program available to researchers and the general public. These records offer research insights into the lives of veterans who benefited from loans, the impacts on the communities built by the loans, and the institutions that implemented them.en_US
dc.description.sponsorshipMichigan Institute for Data Science (MIDAS) Propelling Original Data Science (PODS) Granten_US
dc.language.isoen_USen_US
dc.publisherJournal of Documentation
dc.rightsAttribution-NonCommercial 4.0 International*
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/*
dc.subjectarchives, digitization, document image analysis, historical records, OCR, workflowsen_US
dc.titleDigitizing and parsing semi-structured historical administrative documents from the G.I. Bill mortgage guarantee programen_US
dc.typePreprinten_US
dc.subject.hlbsecondlevelSocial Sciences (General)
dc.subject.hlbtoplevelSocial Sciences
dc.contributor.affiliationumInstitute for Social Research (ISR)en_US
dc.contributor.affiliationumICPSRen_US
dc.contributor.affiliationumcampusAnn Arboren_US
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/176363/1/GI Bill digitization technical paper.pdf
dc.identifier.doihttps://dx.doi.org/10.7302/7212
dc.identifier.doi10.1108/JD-03-2023-0055
dc.identifier.orcid0000-0002-5896-7295en_US
dc.identifier.orcid0000-0001-7715-4348en_US
dc.identifier.orcid0000-0003-2161-4709en_US
dc.rights.licenseCreative Commons Attribution Non-commercial International License 4.0 (CC BY-NC 4.0)
dc.description.depositorSELFen_US
dc.identifier.name-orcidLafia, Sara; 0000-0002-5896-7295en_US
dc.identifier.name-orcidBleckley, David; 0000-0001-7715-4348en_US
dc.identifier.name-orcidAlexander, Trent; 0000-0003-2161-4709en_US
dc.working.doi10.7302/7212en_US
dc.owningcollnameInstitute for Social Research (ISR)


Files in this item

Show simple item record

Attribution-NonCommercial 4.0 International
Except where otherwise noted, this item's license is described as Attribution-NonCommercial 4.0 International

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.