Digitizing and parsing semi-structured historical administrative documents from the G.I. Bill mortgage guarantee program
dc.contributor.author | Lafia, Sara | |
dc.contributor.author | Bleckley, David A. | |
dc.contributor.author | Alexander, J. Trent | |
dc.date.accessioned | 2023-05-10T14:44:30Z | |
dc.date.available | 2023-05-10T14:44:30Z | |
dc.date.issued | 2023-06-13 | |
dc.identifier.uri | https://hdl.handle.net/2027.42/176363 | en |
dc.description.abstract | Many libraries and archives maintain collections of research documents, such as administrative records, with paper-based formats that limit their access to in-person use. Digitization transforms paper-based collections into more accessible and analyzable formats. As collections are digitized, there is an opportunity to incorporate deep learning techniques, such as Document Image Analysis (DIA), into workflows to increase the usability of information extracted from archival documents. This paper describes our approach using digital scanning, optical character recognition (OCR), and deep learning to create a digital archive of administrative records related to the mortgage guarantee program of the Servicemen’s Readjustment Act of 1944, also known as the G.I. Bill. We used a collection of 25,744 semi-structured paper-based records from the administration of G.I. Bill Mortgages from 1946 to 1954 to develop a digitization and processing workflow. These records include the name and city of the mortgagor, the amount of the mortgage, the location of the Reconstruction Finance Corporation agent, one or more identification numbers, and the name and location of the bank handling the loan. We extracted structured information from these scanned historical records in order to create a tabular data file and link them to other authoritative individual-level data sources. We compared the flexible character accuracy of five OCR methods. We then compared the character error rate of three text extraction approaches (regular expressions, document image analysis, and named entity recognition). We were able to obtain the highest quality structured text output using DIA with the Layout Parser toolkit by post-processing with regular expressions. Through this project, we demonstrate how DIA can improve the digitization of administrative records to automatically produce a structured data resource for researchers and the public. Our workflow is readily transferable to other archival digitization projects. Through the use of digital scanning, OCR, and DIA processes, we created the first digital microdata file of administrative records related to the G.I. Bill mortgage guarantee program available to researchers and the general public. These records offer research insights into the lives of veterans who benefited from loans, the impacts on the communities built by the loans, and the institutions that implemented them. | en_US |
dc.description.sponsorship | Michigan Institute for Data Science (MIDAS) Propelling Original Data Science (PODS) Grant | en_US |
dc.language.iso | en_US | en_US |
dc.publisher | Journal of Documentation | |
dc.rights | Attribution-NonCommercial 4.0 International | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc/4.0/ | * |
dc.subject | archives, digitization, document image analysis, historical records, OCR, workflows | en_US |
dc.title | Digitizing and parsing semi-structured historical administrative documents from the G.I. Bill mortgage guarantee program | en_US |
dc.type | Preprint | en_US |
dc.subject.hlbsecondlevel | Social Sciences (General) | |
dc.subject.hlbtoplevel | Social Sciences | |
dc.contributor.affiliationum | Institute for Social Research (ISR) | en_US |
dc.contributor.affiliationum | ICPSR | en_US |
dc.contributor.affiliationumcampus | Ann Arbor | en_US |
dc.description.bitstreamurl | http://deepblue.lib.umich.edu/bitstream/2027.42/176363/1/GI Bill digitization technical paper.pdf | |
dc.identifier.doi | https://dx.doi.org/10.7302/7212 | |
dc.identifier.doi | 10.1108/JD-03-2023-0055 | |
dc.identifier.orcid | 0000-0002-5896-7295 | en_US |
dc.identifier.orcid | 0000-0001-7715-4348 | en_US |
dc.identifier.orcid | 0000-0003-2161-4709 | en_US |
dc.rights.license | Creative Commons Attribution Non-commercial International License 4.0 (CC BY-NC 4.0) | |
dc.description.depositor | SELF | en_US |
dc.identifier.name-orcid | Lafia, Sara; 0000-0002-5896-7295 | en_US |
dc.identifier.name-orcid | Bleckley, David; 0000-0001-7715-4348 | en_US |
dc.identifier.name-orcid | Alexander, Trent; 0000-0003-2161-4709 | en_US |
dc.owningcollname | Institute for Social Research (ISR) |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.