Show simple item record

The use of decision trees for cost‐sensitive classification: an empirical study in software quality prediction

dc.contributor.authorSeliya, Naeemen_US
dc.contributor.authorKhoshgoftaar, Taghi M.en_US
dc.date.accessioned2011-11-10T15:39:37Z
dc.date.available2012-11-02T18:56:54Zen_US
dc.date.issued2011-09en_US
dc.identifier.citationSeliya, Naeem; Khoshgoftaar, Taghi M. (2011). "The use of decision trees for cost‐sensitive classification: an empirical study in software quality prediction." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(5): 448-459. <http://hdl.handle.net/2027.42/87156>en_US
dc.identifier.issn1942-4787en_US
dc.identifier.issn1942-4795en_US
dc.identifier.urihttps://hdl.handle.net/2027.42/87156
dc.description.abstractThis empirical study investigates two commonly used decision tree classification algorithms in the context of cost‐sensitive learning. A review of the literature shows that the cost‐based performance of a software quality prediction model is usually determined after the model‐training process has been completed. In contrast, we incorporate cost‐sensitive learning during the model‐training process. The C4.5 and Random Forest decision tree algorithms are used to build defect predictors either with, or without, any cost‐sensitive learning technique. The paper investigates six different cost‐sensitive learning techniques: AdaCost, Adc2, Csb2, MetaCost, Weighting, and Random Undersampling (RUS). The data come from case study include 15 software measurement datasets obtained from several high‐assurance systems. In addition, to a unique insight into the cost‐based performance of defection prediction models, this study is one of the first to use misclassification cost as a parameter during the model‐training process. The practical appeal of this research is that it provides a software quality practitioner with a clear process for how to consider (during model training) and analyze (during model evaluation) the cost‐based performance of a defect prediction model. RUS is ranked as the best cost‐sensitive technique among those considered in this study. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 448–459 DOI: 10.1002/widm.38en_US
dc.publisherJohn Wiley & Sons, Inc.en_US
dc.titleThe use of decision trees for cost‐sensitive classification: an empirical study in software quality predictionen_US
dc.typeArticleen_US
dc.rights.robotsIndexNoFollowen_US
dc.subject.hlbsecondlevelInformation Scienceen_US
dc.subject.hlbtoplevelSocial Sciencesen_US
dc.description.peerreviewedPeer Revieweden_US
dc.contributor.affiliationumComputer and Information Science, University of Michigan—Dearborn, Dearborn, MI, USAen_US
dc.contributor.affiliationotherComputer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USAen_US
dc.contributor.affiliationotherComputer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USAen_US
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/87156/1/38_ftp.pdf
dc.identifier.doi10.1002/widm.38en_US
dc.identifier.sourceWiley Interdisciplinary Reviews: Data Mining and Knowledge Discoveryen_US
dc.identifier.citedreferenceKhoshgoftaar TM, Cukic B, Seliya N. An empirical assessment on program module‐order models. Qual Technol Quant Manag 2007, 4: 171 – 190.en_US
dc.identifier.citedreferenceEmam KE, Benlarbi S, Goel N, Rai SN. Comparing case‐based reasoning classifiers for predicting high‐risk software componenets. J Syst Softw 2001, 55: 301 – 320.en_US
dc.identifier.citedreferenceKhoshgoftaar TM, Seliya N. Comparative assessment of software quality classification techniques: an empirical case study. Empir Softw Eng J 2004, 9: 229 – 257.en_US
dc.identifier.citedreferenceLessmann S, Baesens B, Mues C, Pietsch S. Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 2008, 34: 485 – 496.en_US
dc.identifier.citedreferenceLiu Y, Khoshgoftaar TM, Seliya N. Evolutionary optimization of software quality modeling with multiple repositories. IEEE Trans Softw Eng 2010, 36: 852 – 864.en_US
dc.identifier.citedreferenceWitten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco, CA: Morgan Kaufmann; 2005.en_US
dc.identifier.citedreferenceBreiman L. Random forests. Mach Learn 2001, 45: 5 – 32.en_US
dc.identifier.citedreferenceFan W, Stolfo SJ, Zhang J, Chan PK. Adacost: misclassification cost‐sensitive boosting. In: Proceedings of 16th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann; 1999, 97 – 105.en_US
dc.identifier.citedreferenceTing KM. A comparative study of cost‐sensitive boosting algorithms. In: Proceedings of 17th International Conference on Machine Learning. Stanford, CA: Morgan Kaufmann; 2000, 983 – 990.en_US
dc.identifier.citedreferenceSun Y, Kamel MS, Wong AKC, Wang Y. Cost‐sensitive boosting for classification of imbalanced data. Pattern Recognit 2007, 40: 3358 – 3378.en_US
dc.identifier.citedreferenceDomingos P. Metacost: a general method for making classifiers cost‐sensitive. In: Proceedings of Knowledge Discovery and Data Mining. New York: ACM Press; 1999, 155 – 164.en_US
dc.identifier.citedreferenceElkan C. The foundations of cost‐sensitive learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence. Vol. 2. San Francisco, CA: Morgan Kaufmann Publishers Inc.; 2001, 973 – 978.en_US
dc.identifier.citedreferenceJiang Y, Cukic B, Menzies T. Cost curve evaluation of fault prediction models. In: Proceedings of the 19th International Symposium on Software Reliability Engineering. Seattle, WA: IEEE Computer Society; 2008, 197 – 206.en_US
dc.identifier.citedreferenceKhoshgoftaar TM, Seliya N, Herzberg A. Resource‐oriented software quality classification models. J Syst Softw 2005, 76: 111 – 126.en_US
dc.identifier.citedreferenceKhoshgoftaar TM, Liu Y, Seliya N. A multi‐objective module‐order model for software quality enhancement. IEEE Trans Evolution Comput 2004, 8: 593 – 608.en_US
dc.identifier.citedreferenceDrummond C, Holte RC. Cost curves: an improved method for visualizing classifier performance. Mach Learn 2006, 65: 95 – 130.en_US
dc.identifier.citedreferenceSeliya N, Khoshgoftaar TM. Value‐based software quality modeling. In: SEKE. Skokie, IL: Knowledge Systems Institute Graduate School; 2009, 116 – 121.en_US
dc.identifier.citedreferenceFreund Y, Schapire R. Experiments with a new boosting algorithm. In: Proceedings of 13th International Conference on Machine Learning. Bari: Morgan Kaufmann; 1996, 148 – 156.en_US
dc.identifier.citedreferenceBreiman L. Bagging predictors. Mach Learn 1996, 26: 123 – 140.en_US
dc.identifier.citedreferenceVan Hulse J, Khoshgoftaar TM, Napolitano A. Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning. New York: ACM Press; 2007, 935 – 945.en_US
dc.identifier.citedreferenceSayyad Shirabad J, Menzies TJ. The PROMISE repository of software engineering databases, School of Information Technology and Engineering, University of Ottawa, Canada; 2005.en_US
dc.identifier.citedreferenceKhoshgoftaar TM, Zhong S, Joshi V. Noise elimination with ensemble‐classifier filtering for software quality estimation. Intell Data Anal: An Int J 2005, 9: 3 – 27.en_US
dc.identifier.citedreferenceSeliya N, Khoshgoftaar TM. Software quality analysis of unlabeled program modules with semi‐supervised clustering. IEEE Trans Syst Man Cybern 2007, 37: 201 – 211.en_US
dc.identifier.citedreferenceKhoshgoftaar TM, Allen EB. Logistic regression modeling of software quality. Int J Reliab Qual Saf Eng 1999, 6: 303 – 317.en_US
dc.identifier.citedreferenceBerenson ML, Levine DM, Goldstein M. Intermediate Statistical Methods and Applications: A Computer Package Approach. Englewood Cliffs, NJ: Prentice‐Hall, Inc.; 1989.en_US
dc.owningcollnameInterdisciplinary and Peer-Reviewed


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.