Show simple item record

Effect of forename string on author name disambiguation

dc.contributor.authorKim, Jinseok
dc.contributor.authorKim, Jenna
dc.date.accessioned2020-07-02T20:33:15Z
dc.date.availableWITHHELD_13_MONTHS
dc.date.available2020-07-02T20:33:15Z
dc.date.issued2020-07
dc.identifier.citationKim, Jinseok; Kim, Jenna (2020). "Effect of forename string on author name disambiguation." Journal of the Association for Information Science and Technology 71(7): 839-855.
dc.identifier.issn2330-1635
dc.identifier.issn2330-1643
dc.identifier.urihttps://hdl.handle.net/2027.42/155924
dc.description.abstractIn author name disambiguation, author forenames are used to decide which name instances are disambiguated together and how much they are likely to refer to the same author. Despite such a crucial role of forenames, their effect on the performance of heuristic (string matching) and algorithmic disambiguation is not well understood. This study assesses the contributions of forenames in author name disambiguation using multiple labeled data sets under varying ratios and lengths of full forenames, reflecting real‐world scenarios in which an author is represented by forename variants (synonym) and some authors share the same forenames (homonym). The results show that increasing the ratios of full forenames substantially improves both heuristic and machine‐learning‐based disambiguation. Performance gains by algorithmic disambiguation are pronounced when many forenames are initialized or homonyms are prevalent. As the ratios of full forenames increase, however, they become marginal compared to those by string matching. Using a small portion of forename strings does not reduce much the performances of both heuristic and algorithmic disambiguation methods compared to using full‐length strings. These findings provide practical suggestions, such as restoring initialized forenames into a full‐string format via record linkage for improved disambiguation performances.
dc.publisherJohn Wiley & Sons, Inc.
dc.titleEffect of forename string on author name disambiguation
dc.typeArticle
dc.rights.robotsIndexNoFollow
dc.subject.hlbsecondlevelInformation Science
dc.subject.hlbtoplevelSocial Sciences
dc.description.peerreviewedPeer Reviewed
dc.description.bitstreamurlhttps://deepblue.lib.umich.edu/bitstream/2027.42/155924/1/asi24298.pdf
dc.description.bitstreamurlhttps://deepblue.lib.umich.edu/bitstream/2027.42/155924/2/asi24298_am.pdf
dc.identifier.doi10.1002/asi.24298
dc.identifier.sourceJournal of the Association for Information Science and Technology
dc.identifier.citedreferenceTang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. ( 2008 ). ArnetMiner: extraction and mining of academic social networks. Paper presented at the Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV.
dc.identifier.citedreferenceReitz, F., & Hoffmann, O. ( 2013 ). Learning from the past: An analysis of person name corrections in the DBLP collection and social network properties of affected entities. In T. Özyer, J. Rokne, G. Wagner, & A. H. P. Reuser (Eds.), The influence of technology on social network analysis and mining (pp. 427 – 453 ). Vienna: Springer Vienna.
dc.identifier.citedreferenceSaeys, Y., Abeel, T., & Van de Peer, Y. ( 2008 ). Robust feature selection using ensemble feature selection techniques. Paper presented at the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2008), Antwerp, Belgium.
dc.identifier.citedreferenceSantana, A. F., Gonçalves, M. A., Laender, A. H. F., & Ferreira, A. A. ( 2017 ). Incremental author name disambiguation by exploiting domain‐specific heuristics. Journal of the Association for Information Science and Technology, 68 ( 4 ), 931 – 945.
dc.identifier.citedreferenceSchulz, J. ( 2016 ). Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses. Scientometrics, 107 ( 3 ), 1283 – 1298.
dc.identifier.citedreferenceShin, D., Kim, T., Choi, J., & Kim, J. ( 2014 ). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100 ( 1 ), 15 – 50.
dc.identifier.citedreferenceSong, M., Kim, E. H. J., & Kim, H. J. ( 2015 ). Exploring author name disambiguation on PubMed‐scale. Journal of Informetrics, 9 ( 4 ), 924 – 941.
dc.identifier.citedreferenceTorvik, V. I., & Smalheiser, N. R. ( 2009 ). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3 ( 3 ), 1 – 29.
dc.identifier.citedreferenceTorvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. ( 2005 ). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56 ( 2 ), 140 – 158.
dc.identifier.citedreferenceTreeratpituk, P., & Giles, C.L. ( 2009 ). Disambiguating authors in academic publications using random forests. JCDL 2009: Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries (pp. 39‐48). Austin, Texas.
dc.identifier.citedreferenceWang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. ( 2012 ). A boosted‐trees method for name disambiguation. Scientometrics, 93 ( 2 ), 391 – 411.
dc.identifier.citedreferenceWang, X., Tang, J., Cheng, H., & Yu, P.S. ( 2011 ). ADANA: Active Name Disambiguation. Paper presented at the 2011 IEEE 11th International Conference on Data Mining, Vancouver, Canada.
dc.identifier.citedreferenceWu, H., Li, B., Pei, Y. J., & He, J. ( 2014 ). Unsupervised author disambiguation using Dempster‐Shafer theory. Scientometrics, 101 ( 3 ), 1955 – 1972.
dc.identifier.citedreferenceWu, J., & Ding, X. H. ( 2013 ). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics, 96 ( 3 ), 683 – 697.
dc.identifier.citedreferenceXie, Z., Ouyang, Z., Li, J., Dong, E., & Yi, D. ( 2018 ). Modelling transition phenomena of scientific coauthorship networks. Journal of the Association for Information Science and Technology, 69 ( 2 ), 305 – 317.
dc.identifier.citedreferenceZhu, J., Wu, X., Lin, X., Huang, C., Fung, G. P. C., & Tang, Y. ( 2018 ). A novel multiple layers name disambiguation framework for digital libraries using dynamic clustering. Scientometrics, 114 ( 3 ), 781 – 794.
dc.identifier.citedreferenceAckermann, M.R., & Reitz, F. ( 2018 ). Homonym detection in curated bibliographies: Learning from DBLP’s experience. Paper presented at the International Conference on Theory and Practice of Digital Libraries (TPDL) 2018, Porto, Portugal.
dc.identifier.citedreferenceBackes, T. ( 2018 ). The Impact of name‐matching and blocking on author disambiguation. Paper presented at the Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy.
dc.identifier.citedreferenceCota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. F. ( 2010 ). An unsupervised heuristic‐based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61 ( 9 ), 1853 – 1870.
dc.identifier.citedreferenceDelgado, A. D., Martínez, R., Montalvo, S., & Fresno, V. ( 2017 ). Person name disambiguation in the web using adaptive threshold clustering. Journal of the Association for Information Science and Technology, 68 ( 7 ), 1751 – 1762.
dc.identifier.citedreferenceFerreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. F. ( 2014 ). Self‐training author name disambiguation for information scarce scenarios. Journal of the Association for Information Science and Technology, 65 ( 6 ), 1257 – 1278.
dc.identifier.citedreferenceGuyon, I., & Elisseeff, A. ( 2003 ). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157 – 1182.
dc.identifier.citedreferenceHan, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. ( 2004 ). Two supervised learning approaches for name disambiguation in author citations. JCDL 2004: Proceedings of the Fourth ACM/IEEE Joint Conference on Digital Libraries (pp. 296 ‐ 305 ). Tucson, Arizona.
dc.identifier.citedreferenceHan, H., Xu, W., Zha, H., & Giles, C.L. ( 2005 ). A hierarchical naive Bayes mixture model for name disambiguation in author citations. Paper presented at the Proceedings of the 2005 ACM symposium on Applied computing ‐ SAC ’05, Santa Fe, NM.
dc.identifier.citedreferenceKang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. ( 2011 ). Construction of a large‐scale test set for author disambiguation. Information Processing & Management, 47 ( 3 ), 452 – 465.
dc.identifier.citedreferenceKang, I. S., Na, S. H., Lee, S., Jung, H., Kim, P., Sung, W. K., & Lee, J. H. ( 2009 ). On co‐authorship for author disambiguation. Information Processing & Management, 45 ( 1 ), 84 – 97.
dc.identifier.citedreferenceKim, J. ( 2018 ). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116 ( 3 ), 1867 – 1886.
dc.identifier.citedreferenceKim, J. ( 2019 ). A fast and integrative algorithm for clustering performance evaluation in author name disambiguation. Scientometrics, 120 (2), 661 – 681.
dc.identifier.citedreferenceKim, J., & Diesner, J. ( 2016 ). Distortive effects of initial‐based name disambiguation on measurements of large‐scale coauthorship networks. Journal of the Association for Information Science and Technology, 67 ( 6 ), 1446 – 1461.
dc.identifier.citedreferenceKim, J., & Kim, J. ( 2018 ). The impact of imbalanced training data on machine learning for author name disambiguation. Scientometrics, 117 ( 1 ), 511 – 526.
dc.identifier.citedreferenceKim, J., Kim, J., & Owen‐Smith, J. ( 2019 ). Generating automatically labeled data for author name disambiguation: An iterative clustering method. Scientometrics, 118 ( 1 ), 253 – 280.
dc.identifier.citedreferenceKim, K., Sefid, A., & Giles, C.L. ( 2017 ). Scaling author name disambiguation with CNF blocking. arXiv preprint arXiv:1709.09657.
dc.identifier.citedreferenceKim, K., Sefid, A., Weinberg, B.A., & Giles, C.L. ( 2018 ). A web service for author name disambiguation in scholarly databases. Paper presented at the 2018 IEEE International Conference on Web Services (ICWS), San Francisco, California.
dc.identifier.citedreferenceLevin, M., Krawczyk, S., Bethard, S., & Jurafsky, D. ( 2012 ). Citation‐based bootstrapping for large‐scale author disambiguation. Journal of the American Society for Information Science and Technology, 63 ( 5 ), 1030 – 1047.
dc.identifier.citedreferenceLiben‐Nowell, D., & Kleinberg, J. ( 2007 ). The link‐prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58 ( 7 ), 1019 – 1031.
dc.identifier.citedreferenceLiu, W., Islamaj Dogan, R., Kim, S., Comeau, D. C., Kim, W., … Wilbur, W. J. ( 2014 ). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology, 65 ( 4 ), 765 – 781.
dc.identifier.citedreferenceLiu, Y., Li, W., Huang, Z., & Fang, Q. ( 2015 ). A fast method based on multiple clustering for name disambiguation in bibliographic citations. Journal of the Association for Information Science and Technology, 66 ( 3 ), 634 – 644.
dc.identifier.citedreferenceLouppe, G., Al‐Natsheh, H. T., Susik, M., & Maguire, E. J. ( 2016 ). Ethnicity sensitive author disambiguation using semi‐supervised learning. Knowledge Engineering and Semantic Web, 2016 ( 649 ), 272 – 287.
dc.identifier.citedreferenceMartin, T., Ball, B., Karrer, B., & Newman, M. E. J. ( 2013 ). Coauthorship and citation patterns in the physical review. Physical Review E, 88 ( 1 ), 012814 – 012819.
dc.identifier.citedreferenceMomeni, F., & Mayr, P. ( 2016 ). Evaluating co‐authorship networks in author name disambiguation for common names. Paper presented at the 20th International Conference on Theory and Practice of Digital Libraries (TPDL 2016), Hannover, Germany.
dc.identifier.citedreferenceMüller, M. C., Reitz, F., & Roy, N. ( 2017 ). Data sets for author name disambiguation: An empirical analysis and a new resource. Scientometrics, 111 ( 3 ), 1467 – 1500.
dc.identifier.citedreferenceOnodera, N., Iwasawa, M., Midorikawa, N., Yoshikane, F., Amano, K., Ootani, Y., … Yamazaki, S. ( 2011 ). A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search. Journal of the American Society for Information Science and Technology, 62 ( 4 ), 677 – 690.
dc.identifier.citedreferencePereira, D.A., Ribeiro‐Neto, B., Ziviani, N., Laender, A.H.F., Gonçalves, M.A., & Ferreira, A. A. ( 2009 ). Using web information for author name disambiguation. Paper presented at the Proceedings of the 9th ACM/IEEE‐CS Joint Conference on Digital Libraries, Austin, TX.
dc.identifier.citedreferencePorter, M. ( 1980 ). An algorithm for suffix stripping. Program, 14 ( 3 ), 130 – 137.
dc.identifier.citedreferenceQian, Y., Zheng, Q., Sakai, T., Ye, J., & Liu, J. ( 2015 ). Dynamic author name disambiguation for growing digital libraries. Information Retrieval Journal, 18 ( 5 ), 379 – 412.
dc.owningcollnameInterdisciplinary and Peer-Reviewed


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.