A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature
dc.contributor.author | Lafia, Sara | |
dc.contributor.author | Fan, Lizhou | |
dc.contributor.author | Hemphill, Libby | |
dc.date.accessioned | 2022-11-09T21:17:54Z | |
dc.date.available | 2023-11-09 16:17:53 | en |
dc.date.available | 2022-11-09T21:17:54Z | |
dc.date.issued | 2022-10 | |
dc.identifier.citation | Lafia, Sara; Fan, Lizhou; Hemphill, Libby (2022). "A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature." Proceedings of the Association for Information Science and Technology 59(1): 169-178. | |
dc.identifier.issn | 2373-9231 | |
dc.identifier.issn | 2373-9231 | |
dc.identifier.uri | https://hdl.handle.net/2027.42/175085 | |
dc.description.abstract | Discovering authoritative links between publications and the datasets that they use can be a labor-intensive process. We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets, which complements the work of data librarians. We first describe the components of the pipeline and then apply it to expand an authoritative bibliography linking thousands of social science studies to the data-related publications in which they are used. The pipeline increases recall for literature to review for inclusion in data-related collections of publications and makes it possible to detect informal data references at scale. We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference. Together, these contributions enable future work on data reference, data citation networks, and data reuse. | |
dc.publisher | John Wiley & Sons, Inc. | |
dc.subject.other | natural language processing | |
dc.subject.other | named entity recognition | |
dc.subject.other | bibliometrics | |
dc.subject.other | data citation | |
dc.subject.other | data metrics | |
dc.title | A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature | |
dc.type | Article | |
dc.rights.robots | IndexNoFollow | |
dc.subject.hlbsecondlevel | Information Science | |
dc.subject.hlbtoplevel | Social Sciences | |
dc.description.peerreviewed | Peer Reviewed | |
dc.description.bitstreamurl | http://deepblue.lib.umich.edu/bitstream/2027.42/175085/1/pra2614.pdf | |
dc.description.bitstreamurl | http://deepblue.lib.umich.edu/bitstream/2027.42/175085/2/pra2614_am.pdf | |
dc.identifier.doi | 10.1002/pra2.614 | |
dc.identifier.source | Proceedings of the Association for Information Science and Technology | |
dc.identifier.citedreference | Montani, I., & Honnibal, M. ( 2018 ). Prodigy: A new annotation tool for radically efficient machine teaching. https://prodi.gy/ | |
dc.identifier.citedreference | Hemphill, L., Pienta, A., Lafia, S., Akmon, D., & Bleckley, D. ( 2021 ). How do properties of data, their curation, and their funding relate to reuse? Deep Blue. https://doi.org/ https://doi.org/10.7302/1639 | |
dc.identifier.citedreference | Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. ( 2020 ). spaCy: Industrial-strength natural language processing in python. https://spacy.io/ | |
dc.identifier.citedreference | Hook, D. W., Porter, S. J., & Herzog, C. ( 2018 ). Dimensions: Building Context for Search and Evaluation. Frontiers in Research Metrics and Analytics, 3. https://doi.org/ https://doi.org/10.3389/frma.2018.00023 | |
dc.identifier.citedreference | Jurgens, D., Kumar, S., Hoover, R., McFarland, D., & Jurafsky, D. ( 2018 ). Measuring the evolution of a scientific field through citation frames. Transactions of the Association for Computational Linguistics, 6, 391 – 406. | |
dc.identifier.citedreference | King, G. ( 1995 ). Replication, Replication. PS, Political Science & Politics, 28 ( 3 ), 444 – 452. | |
dc.identifier.citedreference | Lafia, S., Ko, J.-W., Moss, E., Kim, J., Thomer, A., & Hemphill, L. ( 2021 ). Detecting informal data references in academic Literature. Deep Blue. https://doi.org/ https://doi.org/10.7302/1671 | |
dc.identifier.citedreference | Lammey, R. ( 2015 ). CrossRef text and data mining services. Insight, 28 ( 2 ), 62 – 68. | |
dc.identifier.citedreference | Lo, K., Wang, L. L., Neumann, M., Kinney, R., & Weld, D. S. ( 2020 ). S2ORC: The Semantic Scholar Open Research Corpus. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4969 – 4983. | |
dc.identifier.citedreference | Lopez, P. ( 2009 ). GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. Research and Advanced Technology for Digital Libraries, 473 – 474. | |
dc.identifier.citedreference | Mayo, C., Vision, T. J., & Hull, E. A. ( 2016 ). The location of the citation: changing practices in how publications cite original data in the Dryad Digital Repository. The location of the citation: changing practices in how publications cite original data in the Dryad Digital Repository., 11 ( 1 ), 150 – 155. | |
dc.identifier.citedreference | Mooney, H. ( 2011 ). Citing data sources in the social sciences: do authors do it? Learned Publishing: Journal of the Association of Learned and Professional Society Publishers, 24 ( 2 ), 99 – 108. | |
dc.identifier.citedreference | Moretti, F. ( 2000 ). Conjectures on world literature. New Left Review, 54 – 68. | |
dc.identifier.citedreference | Moss, E., Cave, C., & Lyle, J. ( 2015 ). Sharing and citing research data: A repository’s perspective. In H. K. P. Jayasuriya (Ed.), Big Data, Big Challenges in Evidence-based Policy Making (pp. 47 – 65 ). West Academic Publishing. | |
dc.identifier.citedreference | Moss, E., & Lyle, J. ( 2018 ). Opaque data citation: Actual citation practice and its implication for tracking data use. https://deepblue.lib.umich.edu/handle/2027.42/142393 | |
dc.identifier.citedreference | Nakov, P. I., Schwartz, A. S., Hearst, M., & Others. ( 2004 ). Citances: Citation sentences for semantic analysis of bioscience text. Proceedings of the SIGIR, 4, 81 – 88. | |
dc.identifier.citedreference | Park, H., You, S., & Wolfram, D. ( 2018 ). Informal data citation for data sharing and reuse is more common than formal data citation in biomedical fields. Journal of the Association for Information Science and Technology, 69 ( 11 ), 1346 – 1354. | |
dc.identifier.citedreference | Pasquetto, I. V., Randles, B. M., & Borgman, C. L. ( 2017 ). On the Reuse of Scientific Data. Data Science Journal, 16 ( 8 ). https://doi.org/ https://doi.org/10.5334/dsj-2017-008 | |
dc.identifier.citedreference | Peters, S., Ross, I., Czaplewski, J., Glassel, A., Husson, J., Syverson, V., Zaffos, A., & Livny, M. ( 2017 ). A New Tool for Deep-Down Data Mining. In Eos. https://doi.org/ https://doi.org/10.1029/2017eo082377 | |
dc.identifier.citedreference | Priem, J., Groth, P., & Taraborelli, D. ( 2012 ). The altmetrics collection. PloS One, 7 ( 11 ), e48753. | |
dc.identifier.citedreference | Sadvilkar, N., & Neumann, M. ( 2020 ). PySBD: Pragmatic Sentence Boundary Disambiguation. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2010.09657 | |
dc.identifier.citedreference | Yarkoni, T., Eckles, D., Heathers, J. A. J., Levenstein, M. C., Smaldino, P. E., & Lane, J. ( 2021 ). Enhancing and accelerating social science via automation: Challenges and opportunities. Harvard Data Science Review, 3 ( 2 ). https://doi.org/ https://doi.org/10.1162/99608f92.df2262f5 | |
dc.identifier.citedreference | Zimmerman, A. S. ( 2008 ). New Knowledge from Old Data: The Role of Standards in the Sharing and Reuse of Ecological Data. Science, Technology & Human Values, 33 ( 5 ), 631 – 652. | |
dc.identifier.citedreference | Acharya, A., Blackwell, M., & Sen, M. ( 2016 ). The Political Legacy of American Slavery. The Journal of Politics, 78 ( 3 ), 621 – 641. | |
dc.identifier.citedreference | Boland, K., Ritze, D., Eckert, K., & Mathiak, B. ( 2012 ). Identifying References to Datasets in Publications. Theory and Practice of Digital Libraries, 150 – 161. | |
dc.identifier.citedreference | Buneman, P., Dosso, D., Lissandrini, M., & Silvello, G. ( 2022 ). Data citation and the citation graph. Quantitative Science Studies, 2 ( 4 ), 1399 – 1422. | |
dc.identifier.citedreference | Chao, T. C. ( 2011 ). Disciplinary reach: Investigating the impact of dataset reuse in the earth sciences. In Proceedings of the American Society for Information Science and Technology (Vol. 48, Issue 1, pp. 1 – 8 ). https://doi.org/10.1002/meet.2011.14504801125 | |
dc.identifier.citedreference | Cousijn, H., Feeney, P., Lowenberg, D., Presani, E., & Simons, N. ( 2019 ). Bringing citations and usage metrics together to make data count. Data Science Journal, 18. https://doi.org/ https://doi.org/10.5334/dsj-2019-009 | |
dc.identifier.citedreference | Du, C., Cohoon, J., Lopez, P., & Howison, J. ( 2021 ). Softcite dataset: A dataset of software mentions in biomedical and economic research publications. Journal of the Association for Information Science and Technology, 72 ( 7 ), 870 – 884. | |
dc.identifier.citedreference | Fan, L., Lafia, S., Bleckley, D., Moss, E., Thomer, A., & Hemphill, L. ( 2022 ). Librarian-in-the-Loop: A Natural Language Processing Paradigm for Detecting Informal Mentions of Research Data in Academic Literature. In arXiv [cs.DL]. arXiv. http://arxiv.org/abs/2203.05112 | |
dc.identifier.citedreference | Fenner, M. ( 2019 ). Introducing the PID Graph. Front Matter. https:// https://doi.org/10.53731/r79sf9h-97aq74v-ag4wp | |
dc.identifier.citedreference | Heddes, J., Meerdink, P., Pieters, M., & Marx, M. ( 2021 ). The Automatic Detection of Dataset Names in Scientific Articles. Data, 6 ( 8 ), 84. | |
dc.identifier.citedreference | Heidorn, P. B. ( 2008 ). Shedding Light on the Dark Data in the Long Tail of Science. Library Trends, 57 ( 2 ), 280 – 299. | |
dc.identifier.citedreference | He, L., & Han, Z. ( 2017 ). Do usage counts of scientific data make sense? An investigation of the Dryad repository. Library Hi Tech, 35 ( 2 ), 332 – 342. | |
dc.identifier.citedreference | Hellerstein, J. M., Sreekanti, V., Gonzalez, J. E., & Dalton, J. ( 2017 ). Ground: A Data Context Service. CIDR. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1071.9562&rep=rep1&type=pdf | |
dc.working.doi | NO | en |
dc.owningcollname | Interdisciplinary and Peer-Reviewed |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.