Show simple item record

A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature

dc.contributor.authorLafia, Sara
dc.contributor.authorFan, Lizhou
dc.contributor.authorHemphill, Libby
dc.date.accessioned2022-11-09T21:17:54Z
dc.date.available2023-11-09 16:17:53en
dc.date.available2022-11-09T21:17:54Z
dc.date.issued2022-10
dc.identifier.citationLafia, Sara; Fan, Lizhou; Hemphill, Libby (2022). "A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature." Proceedings of the Association for Information Science and Technology 59(1): 169-178.
dc.identifier.issn2373-9231
dc.identifier.issn2373-9231
dc.identifier.urihttps://hdl.handle.net/2027.42/175085
dc.description.abstractDiscovering authoritative links between publications and the datasets that they use can be a labor-intensive process. We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets, which complements the work of data librarians. We first describe the components of the pipeline and then apply it to expand an authoritative bibliography linking thousands of social science studies to the data-related publications in which they are used. The pipeline increases recall for literature to review for inclusion in data-related collections of publications and makes it possible to detect informal data references at scale. We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference. Together, these contributions enable future work on data reference, data citation networks, and data reuse.
dc.publisherJohn Wiley & Sons, Inc.
dc.subject.othernatural language processing
dc.subject.othernamed entity recognition
dc.subject.otherbibliometrics
dc.subject.otherdata citation
dc.subject.otherdata metrics
dc.titleA Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature
dc.typeArticle
dc.rights.robotsIndexNoFollow
dc.subject.hlbsecondlevelInformation Science
dc.subject.hlbtoplevelSocial Sciences
dc.description.peerreviewedPeer Reviewed
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/175085/1/pra2614.pdf
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/175085/2/pra2614_am.pdf
dc.identifier.doi10.1002/pra2.614
dc.identifier.sourceProceedings of the Association for Information Science and Technology
dc.identifier.citedreferenceMontani, I., & Honnibal, M. ( 2018 ). Prodigy: A new annotation tool for radically efficient machine teaching. https://prodi.gy/
dc.identifier.citedreferenceHemphill, L., Pienta, A., Lafia, S., Akmon, D., & Bleckley, D. ( 2021 ). How do properties of data, their curation, and their funding relate to reuse? Deep Blue. https://doi.org/ https://doi.org/10.7302/1639
dc.identifier.citedreferenceHonnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. ( 2020 ). spaCy: Industrial-strength natural language processing in python. https://spacy.io/
dc.identifier.citedreferenceHook, D. W., Porter, S. J., & Herzog, C. ( 2018 ). Dimensions: Building Context for Search and Evaluation. Frontiers in Research Metrics and Analytics, 3. https://doi.org/ https://doi.org/10.3389/frma.2018.00023
dc.identifier.citedreferenceJurgens, D., Kumar, S., Hoover, R., McFarland, D., & Jurafsky, D. ( 2018 ). Measuring the evolution of a scientific field through citation frames. Transactions of the Association for Computational Linguistics, 6, 391 – 406.
dc.identifier.citedreferenceKing, G. ( 1995 ). Replication, Replication. PS, Political Science & Politics, 28 ( 3 ), 444 – 452.
dc.identifier.citedreferenceLafia, S., Ko, J.-W., Moss, E., Kim, J., Thomer, A., & Hemphill, L. ( 2021 ). Detecting informal data references in academic Literature. Deep Blue. https://doi.org/ https://doi.org/10.7302/1671
dc.identifier.citedreferenceLammey, R. ( 2015 ). CrossRef text and data mining services. Insight, 28 ( 2 ), 62 – 68.
dc.identifier.citedreferenceLo, K., Wang, L. L., Neumann, M., Kinney, R., & Weld, D. S. ( 2020 ). S2ORC: The Semantic Scholar Open Research Corpus. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4969 – 4983.
dc.identifier.citedreferenceLopez, P. ( 2009 ). GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. Research and Advanced Technology for Digital Libraries, 473 – 474.
dc.identifier.citedreferenceMayo, C., Vision, T. J., & Hull, E. A. ( 2016 ). The location of the citation: changing practices in how publications cite original data in the Dryad Digital Repository. The location of the citation: changing practices in how publications cite original data in the Dryad Digital Repository., 11 ( 1 ), 150 – 155.
dc.identifier.citedreferenceMooney, H. ( 2011 ). Citing data sources in the social sciences: do authors do it? Learned Publishing: Journal of the Association of Learned and Professional Society Publishers, 24 ( 2 ), 99 – 108.
dc.identifier.citedreferenceMoretti, F. ( 2000 ). Conjectures on world literature. New Left Review, 54 – 68.
dc.identifier.citedreferenceMoss, E., Cave, C., & Lyle, J. ( 2015 ). Sharing and citing research data: A repository’s perspective. In H. K. P. Jayasuriya (Ed.), Big Data, Big Challenges in Evidence-based Policy Making (pp. 47 – 65 ). West Academic Publishing.
dc.identifier.citedreferenceMoss, E., & Lyle, J. ( 2018 ). Opaque data citation: Actual citation practice and its implication for tracking data use. https://deepblue.lib.umich.edu/handle/2027.42/142393
dc.identifier.citedreferenceNakov, P. I., Schwartz, A. S., Hearst, M., & Others. ( 2004 ). Citances: Citation sentences for semantic analysis of bioscience text. Proceedings of the SIGIR, 4, 81 – 88.
dc.identifier.citedreferencePark, H., You, S., & Wolfram, D. ( 2018 ). Informal data citation for data sharing and reuse is more common than formal data citation in biomedical fields. Journal of the Association for Information Science and Technology, 69 ( 11 ), 1346 – 1354.
dc.identifier.citedreferencePasquetto, I. V., Randles, B. M., & Borgman, C. L. ( 2017 ). On the Reuse of Scientific Data. Data Science Journal, 16 ( 8 ). https://doi.org/ https://doi.org/10.5334/dsj-2017-008
dc.identifier.citedreferencePeters, S., Ross, I., Czaplewski, J., Glassel, A., Husson, J., Syverson, V., Zaffos, A., & Livny, M. ( 2017 ). A New Tool for Deep-Down Data Mining. In Eos. https://doi.org/ https://doi.org/10.1029/2017eo082377
dc.identifier.citedreferencePriem, J., Groth, P., & Taraborelli, D. ( 2012 ). The altmetrics collection. PloS One, 7 ( 11 ), e48753.
dc.identifier.citedreferenceSadvilkar, N., & Neumann, M. ( 2020 ). PySBD: Pragmatic Sentence Boundary Disambiguation. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2010.09657
dc.identifier.citedreferenceYarkoni, T., Eckles, D., Heathers, J. A. J., Levenstein, M. C., Smaldino, P. E., & Lane, J. ( 2021 ). Enhancing and accelerating social science via automation: Challenges and opportunities. Harvard Data Science Review, 3 ( 2 ). https://doi.org/ https://doi.org/10.1162/99608f92.df2262f5
dc.identifier.citedreferenceZimmerman, A. S. ( 2008 ). New Knowledge from Old Data: The Role of Standards in the Sharing and Reuse of Ecological Data. Science, Technology & Human Values, 33 ( 5 ), 631 – 652.
dc.identifier.citedreferenceAcharya, A., Blackwell, M., & Sen, M. ( 2016 ). The Political Legacy of American Slavery. The Journal of Politics, 78 ( 3 ), 621 – 641.
dc.identifier.citedreferenceBoland, K., Ritze, D., Eckert, K., & Mathiak, B. ( 2012 ). Identifying References to Datasets in Publications. Theory and Practice of Digital Libraries, 150 – 161.
dc.identifier.citedreferenceBuneman, P., Dosso, D., Lissandrini, M., & Silvello, G. ( 2022 ). Data citation and the citation graph. Quantitative Science Studies, 2 ( 4 ), 1399 – 1422.
dc.identifier.citedreferenceChao, T. C. ( 2011 ). Disciplinary reach: Investigating the impact of dataset reuse in the earth sciences. In Proceedings of the American Society for Information Science and Technology (Vol. 48, Issue 1, pp. 1 – 8 ). https://doi.org/10.1002/meet.2011.14504801125
dc.identifier.citedreferenceCousijn, H., Feeney, P., Lowenberg, D., Presani, E., & Simons, N. ( 2019 ). Bringing citations and usage metrics together to make data count. Data Science Journal, 18. https://doi.org/ https://doi.org/10.5334/dsj-2019-009
dc.identifier.citedreferenceDu, C., Cohoon, J., Lopez, P., & Howison, J. ( 2021 ). Softcite dataset: A dataset of software mentions in biomedical and economic research publications. Journal of the Association for Information Science and Technology, 72 ( 7 ), 870 – 884.
dc.identifier.citedreferenceFan, L., Lafia, S., Bleckley, D., Moss, E., Thomer, A., & Hemphill, L. ( 2022 ). Librarian-in-the-Loop: A Natural Language Processing Paradigm for Detecting Informal Mentions of Research Data in Academic Literature. In arXiv [cs.DL]. arXiv. http://arxiv.org/abs/2203.05112
dc.identifier.citedreferenceFenner, M. ( 2019 ). Introducing the PID Graph. Front Matter. https:// https://doi.org/10.53731/r79sf9h-97aq74v-ag4wp
dc.identifier.citedreferenceHeddes, J., Meerdink, P., Pieters, M., & Marx, M. ( 2021 ). The Automatic Detection of Dataset Names in Scientific Articles. Data, 6 ( 8 ), 84.
dc.identifier.citedreferenceHeidorn, P. B. ( 2008 ). Shedding Light on the Dark Data in the Long Tail of Science. Library Trends, 57 ( 2 ), 280 – 299.
dc.identifier.citedreferenceHe, L., & Han, Z. ( 2017 ). Do usage counts of scientific data make sense? An investigation of the Dryad repository. Library Hi Tech, 35 ( 2 ), 332 – 342.
dc.identifier.citedreferenceHellerstein, J. M., Sreekanti, V., Gonzalez, J. E., & Dalton, J. ( 2017 ). Ground: A Data Context Service. CIDR. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1071.9562&rep=rep1&type=pdf
dc.working.doiNOen
dc.owningcollnameInterdisciplinary and Peer-Reviewed


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.