Effect of forename string on author name disambiguation

Kim, Jinseok; Kim, Jenna

Effect of forename string on author name disambiguation

dc.contributor.author	Kim, Jinseok
dc.contributor.author	Kim, Jenna
dc.date.accessioned	2020-07-02T20:33:15Z
dc.date.available	WITHHELD_13_MONTHS
dc.date.available	2020-07-02T20:33:15Z
dc.date.issued	2020-07
dc.identifier.citation	Kim, Jinseok; Kim, Jenna (2020). "Effect of forename string on author name disambiguation." Journal of the Association for Information Science and Technology 71(7): 839-855.
dc.identifier.issn	2330-1635
dc.identifier.issn	2330-1643
dc.identifier.uri	https://hdl.handle.net/2027.42/155924
dc.description.abstract	In author name disambiguation, author forenames are used to decide which name instances are disambiguated together and how much they are likely to refer to the same author. Despite such a crucial role of forenames, their effect on the performance of heuristic (string matching) and algorithmic disambiguation is not well understood. This study assesses the contributions of forenames in author name disambiguation using multiple labeled data sets under varying ratios and lengths of full forenames, reflecting real‐world scenarios in which an author is represented by forename variants (synonym) and some authors share the same forenames (homonym). The results show that increasing the ratios of full forenames substantially improves both heuristic and machine‐learning‐based disambiguation. Performance gains by algorithmic disambiguation are pronounced when many forenames are initialized or homonyms are prevalent. As the ratios of full forenames increase, however, they become marginal compared to those by string matching. Using a small portion of forename strings does not reduce much the performances of both heuristic and algorithmic disambiguation methods compared to using full‐length strings. These findings provide practical suggestions, such as restoring initialized forenames into a full‐string format via record linkage for improved disambiguation performances.
dc.publisher	John Wiley & Sons, Inc.
dc.title	Effect of forename string on author name disambiguation
dc.type	Article
dc.rights.robots	IndexNoFollow
dc.subject.hlbsecondlevel	Information Science
dc.subject.hlbtoplevel	Social Sciences
dc.description.peerreviewed	Peer Reviewed
dc.description.bitstreamurl	https://deepblue.lib.umich.edu/bitstream/2027.42/155924/1/asi24298.pdf
dc.description.bitstreamurl	https://deepblue.lib.umich.edu/bitstream/2027.42/155924/2/asi24298_am.pdf
dc.identifier.doi	10.1002/asi.24298
dc.identifier.source	Journal of the Association for Information Science and Technology
dc.identifier.citedreference	Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. ( 2008 ). ArnetMiner: extraction and mining of academic social networks. Paper presented at the Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV.
dc.identifier.citedreference	Reitz, F., & Hoffmann, O. ( 2013 ). Learning from the past: An analysis of person name corrections in the DBLP collection and social network properties of affected entities. In T. Özyer, J. Rokne, G. Wagner, & A. H. P. Reuser (Eds.), The influence of technology on social network analysis and mining (pp. 427 – 453 ). Vienna: Springer Vienna.
dc.identifier.citedreference	Saeys, Y., Abeel, T., & Van de Peer, Y. ( 2008 ). Robust feature selection using ensemble feature selection techniques. Paper presented at the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2008), Antwerp, Belgium.
dc.identifier.citedreference	Santana, A. F., Gonçalves, M. A., Laender, A. H. F., & Ferreira, A. A. ( 2017 ). Incremental author name disambiguation by exploiting domain‐specific heuristics. Journal of the Association for Information Science and Technology, 68 ( 4 ), 931 – 945.
dc.identifier.citedreference	Schulz, J. ( 2016 ). Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses. Scientometrics, 107 ( 3 ), 1283 – 1298.
dc.identifier.citedreference	Shin, D., Kim, T., Choi, J., & Kim, J. ( 2014 ). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100 ( 1 ), 15 – 50.
dc.identifier.citedreference	Song, M., Kim, E. H. J., & Kim, H. J. ( 2015 ). Exploring author name disambiguation on PubMed‐scale. Journal of Informetrics, 9 ( 4 ), 924 – 941.
dc.identifier.citedreference	Torvik, V. I., & Smalheiser, N. R. ( 2009 ). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3 ( 3 ), 1 – 29.
dc.identifier.citedreference	Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. ( 2005 ). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56 ( 2 ), 140 – 158.
dc.identifier.citedreference	Treeratpituk, P., & Giles, C.L. ( 2009 ). Disambiguating authors in academic publications using random forests. JCDL 2009: Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries (pp. 39‐48). Austin, Texas.
dc.identifier.citedreference	Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. ( 2012 ). A boosted‐trees method for name disambiguation. Scientometrics, 93 ( 2 ), 391 – 411.
dc.identifier.citedreference	Wang, X., Tang, J., Cheng, H., & Yu, P.S. ( 2011 ). ADANA: Active Name Disambiguation. Paper presented at the 2011 IEEE 11th International Conference on Data Mining, Vancouver, Canada.
dc.identifier.citedreference	Wu, H., Li, B., Pei, Y. J., & He, J. ( 2014 ). Unsupervised author disambiguation using Dempster‐Shafer theory. Scientometrics, 101 ( 3 ), 1955 – 1972.
dc.identifier.citedreference	Wu, J., & Ding, X. H. ( 2013 ). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics, 96 ( 3 ), 683 – 697.
dc.identifier.citedreference	Xie, Z., Ouyang, Z., Li, J., Dong, E., & Yi, D. ( 2018 ). Modelling transition phenomena of scientific coauthorship networks. Journal of the Association for Information Science and Technology, 69 ( 2 ), 305 – 317.
dc.identifier.citedreference	Zhu, J., Wu, X., Lin, X., Huang, C., Fung, G. P. C., & Tang, Y. ( 2018 ). A novel multiple layers name disambiguation framework for digital libraries using dynamic clustering. Scientometrics, 114 ( 3 ), 781 – 794.
dc.identifier.citedreference	Ackermann, M.R., & Reitz, F. ( 2018 ). Homonym detection in curated bibliographies: Learning from DBLP’s experience. Paper presented at the International Conference on Theory and Practice of Digital Libraries (TPDL) 2018, Porto, Portugal.
dc.identifier.citedreference	Backes, T. ( 2018 ). The Impact of name‐matching and blocking on author disambiguation. Paper presented at the Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy.
dc.identifier.citedreference	Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. F. ( 2010 ). An unsupervised heuristic‐based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61 ( 9 ), 1853 – 1870.
dc.identifier.citedreference	Delgado, A. D., Martínez, R., Montalvo, S., & Fresno, V. ( 2017 ). Person name disambiguation in the web using adaptive threshold clustering. Journal of the Association for Information Science and Technology, 68 ( 7 ), 1751 – 1762.
dc.identifier.citedreference	Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. F. ( 2014 ). Self‐training author name disambiguation for information scarce scenarios. Journal of the Association for Information Science and Technology, 65 ( 6 ), 1257 – 1278.
dc.identifier.citedreference	Guyon, I., & Elisseeff, A. ( 2003 ). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157 – 1182.
dc.identifier.citedreference	Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. ( 2004 ). Two supervised learning approaches for name disambiguation in author citations. JCDL 2004: Proceedings of the Fourth ACM/IEEE Joint Conference on Digital Libraries (pp. 296 ‐ 305 ). Tucson, Arizona.
dc.identifier.citedreference	Han, H., Xu, W., Zha, H., & Giles, C.L. ( 2005 ). A hierarchical naive Bayes mixture model for name disambiguation in author citations. Paper presented at the Proceedings of the 2005 ACM symposium on Applied computing ‐ SAC ’05, Santa Fe, NM.
dc.identifier.citedreference	Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. ( 2011 ). Construction of a large‐scale test set for author disambiguation. Information Processing & Management, 47 ( 3 ), 452 – 465.
dc.identifier.citedreference	Kang, I. S., Na, S. H., Lee, S., Jung, H., Kim, P., Sung, W. K., & Lee, J. H. ( 2009 ). On co‐authorship for author disambiguation. Information Processing & Management, 45 ( 1 ), 84 – 97.
dc.identifier.citedreference	Kim, J. ( 2018 ). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116 ( 3 ), 1867 – 1886.
dc.identifier.citedreference	Kim, J. ( 2019 ). A fast and integrative algorithm for clustering performance evaluation in author name disambiguation. Scientometrics, 120 (2), 661 – 681.
dc.identifier.citedreference	Kim, J., & Diesner, J. ( 2016 ). Distortive effects of initial‐based name disambiguation on measurements of large‐scale coauthorship networks. Journal of the Association for Information Science and Technology, 67 ( 6 ), 1446 – 1461.
dc.identifier.citedreference	Kim, J., & Kim, J. ( 2018 ). The impact of imbalanced training data on machine learning for author name disambiguation. Scientometrics, 117 ( 1 ), 511 – 526.
dc.identifier.citedreference	Kim, J., Kim, J., & Owen‐Smith, J. ( 2019 ). Generating automatically labeled data for author name disambiguation: An iterative clustering method. Scientometrics, 118 ( 1 ), 253 – 280.
dc.identifier.citedreference	Kim, K., Sefid, A., & Giles, C.L. ( 2017 ). Scaling author name disambiguation with CNF blocking. arXiv preprint arXiv:1709.09657.
dc.identifier.citedreference	Kim, K., Sefid, A., Weinberg, B.A., & Giles, C.L. ( 2018 ). A web service for author name disambiguation in scholarly databases. Paper presented at the 2018 IEEE International Conference on Web Services (ICWS), San Francisco, California.
dc.identifier.citedreference	Levin, M., Krawczyk, S., Bethard, S., & Jurafsky, D. ( 2012 ). Citation‐based bootstrapping for large‐scale author disambiguation. Journal of the American Society for Information Science and Technology, 63 ( 5 ), 1030 – 1047.
dc.identifier.citedreference	Liben‐Nowell, D., & Kleinberg, J. ( 2007 ). The link‐prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58 ( 7 ), 1019 – 1031.
dc.identifier.citedreference	Liu, W., Islamaj Dogan, R., Kim, S., Comeau, D. C., Kim, W., … Wilbur, W. J. ( 2014 ). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology, 65 ( 4 ), 765 – 781.
dc.identifier.citedreference	Liu, Y., Li, W., Huang, Z., & Fang, Q. ( 2015 ). A fast method based on multiple clustering for name disambiguation in bibliographic citations. Journal of the Association for Information Science and Technology, 66 ( 3 ), 634 – 644.
dc.identifier.citedreference	Louppe, G., Al‐Natsheh, H. T., Susik, M., & Maguire, E. J. ( 2016 ). Ethnicity sensitive author disambiguation using semi‐supervised learning. Knowledge Engineering and Semantic Web, 2016 ( 649 ), 272 – 287.
dc.identifier.citedreference	Martin, T., Ball, B., Karrer, B., & Newman, M. E. J. ( 2013 ). Coauthorship and citation patterns in the physical review. Physical Review E, 88 ( 1 ), 012814 – 012819.
dc.identifier.citedreference	Momeni, F., & Mayr, P. ( 2016 ). Evaluating co‐authorship networks in author name disambiguation for common names. Paper presented at the 20th International Conference on Theory and Practice of Digital Libraries (TPDL 2016), Hannover, Germany.
dc.identifier.citedreference	Müller, M. C., Reitz, F., & Roy, N. ( 2017 ). Data sets for author name disambiguation: An empirical analysis and a new resource. Scientometrics, 111 ( 3 ), 1467 – 1500.
dc.identifier.citedreference	Onodera, N., Iwasawa, M., Midorikawa, N., Yoshikane, F., Amano, K., Ootani, Y., … Yamazaki, S. ( 2011 ). A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search. Journal of the American Society for Information Science and Technology, 62 ( 4 ), 677 – 690.
dc.identifier.citedreference	Pereira, D.A., Ribeiro‐Neto, B., Ziviani, N., Laender, A.H.F., Gonçalves, M.A., & Ferreira, A. A. ( 2009 ). Using web information for author name disambiguation. Paper presented at the Proceedings of the 9th ACM/IEEE‐CS Joint Conference on Digital Libraries, Austin, TX.
dc.identifier.citedreference	Porter, M. ( 1980 ). An algorithm for suffix stripping. Program, 14 ( 3 ), 130 – 137.
dc.identifier.citedreference	Qian, Y., Zheng, Q., Sakai, T., Ye, J., & Liu, J. ( 2015 ). Dynamic author name disambiguation for growing digital libraries. Information Retrieval Journal, 18 ( 5 ), 379 – 412.
dc.owningcollname	Interdisciplinary and Peer-Reviewed

Files in this item

Name:: asi24298.pdf
Size:: 3.832MB
Format:: PDF

View/Open

Name:: asi24298_am.pdf
Size:: 2.027MB
Format:: PDF

View/Open

Interdisciplinary and Peer-Reviewed

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.