The use of decision trees for cost‐sensitive classification: an empirical study in software quality prediction

Seliya, Naeem; Khoshgoftaar, Taghi M.

The use of decision trees for cost‐sensitive classification: an empirical study in software quality prediction

dc.contributor.author	Seliya, Naeem	en_US
dc.contributor.author	Khoshgoftaar, Taghi M.	en_US
dc.date.accessioned	2011-11-10T15:39:37Z
dc.date.available	2012-11-02T18:56:54Z	en_US
dc.date.issued	2011-09	en_US
dc.identifier.citation	Seliya, Naeem; Khoshgoftaar, Taghi M. (2011). "The use of decision trees for cost‐sensitive classification: an empirical study in software quality prediction." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(5): 448-459. <http://hdl.handle.net/2027.42/87156>	en_US
dc.identifier.issn	1942-4787	en_US
dc.identifier.issn	1942-4795	en_US
dc.identifier.uri	https://hdl.handle.net/2027.42/87156
dc.description.abstract	This empirical study investigates two commonly used decision tree classification algorithms in the context of cost‐sensitive learning. A review of the literature shows that the cost‐based performance of a software quality prediction model is usually determined after the model‐training process has been completed. In contrast, we incorporate cost‐sensitive learning during the model‐training process. The C4.5 and Random Forest decision tree algorithms are used to build defect predictors either with, or without, any cost‐sensitive learning technique. The paper investigates six different cost‐sensitive learning techniques: AdaCost, Adc2, Csb2, MetaCost, Weighting, and Random Undersampling (RUS). The data come from case study include 15 software measurement datasets obtained from several high‐assurance systems. In addition, to a unique insight into the cost‐based performance of defection prediction models, this study is one of the first to use misclassification cost as a parameter during the model‐training process. The practical appeal of this research is that it provides a software quality practitioner with a clear process for how to consider (during model training) and analyze (during model evaluation) the cost‐based performance of a defect prediction model. RUS is ranked as the best cost‐sensitive technique among those considered in this study. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 448–459 DOI: 10.1002/widm.38	en_US
dc.publisher	John Wiley & Sons, Inc.	en_US
dc.title	The use of decision trees for cost‐sensitive classification: an empirical study in software quality prediction	en_US
dc.type	Article	en_US
dc.rights.robots	IndexNoFollow	en_US
dc.subject.hlbsecondlevel	Information Science	en_US
dc.subject.hlbtoplevel	Social Sciences	en_US
dc.description.peerreviewed	Peer Reviewed	en_US
dc.contributor.affiliationum	Computer and Information Science, University of Michigan—Dearborn, Dearborn, MI, USA	en_US
dc.contributor.affiliationother	Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA	en_US
dc.contributor.affiliationother	Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA	en_US
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/87156/1/38_ftp.pdf
dc.identifier.doi	10.1002/widm.38	en_US
dc.identifier.source	Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery	en_US
dc.identifier.citedreference	Khoshgoftaar TM, Cukic B, Seliya N. An empirical assessment on program module‐order models. Qual Technol Quant Manag 2007, 4: 171 – 190.	en_US
dc.identifier.citedreference	Emam KE, Benlarbi S, Goel N, Rai SN. Comparing case‐based reasoning classifiers for predicting high‐risk software componenets. J Syst Softw 2001, 55: 301 – 320.	en_US
dc.identifier.citedreference	Khoshgoftaar TM, Seliya N. Comparative assessment of software quality classification techniques: an empirical case study. Empir Softw Eng J 2004, 9: 229 – 257.	en_US
dc.identifier.citedreference	Lessmann S, Baesens B, Mues C, Pietsch S. Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 2008, 34: 485 – 496.	en_US
dc.identifier.citedreference	Liu Y, Khoshgoftaar TM, Seliya N. Evolutionary optimization of software quality modeling with multiple repositories. IEEE Trans Softw Eng 2010, 36: 852 – 864.	en_US
dc.identifier.citedreference	Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco, CA: Morgan Kaufmann; 2005.	en_US
dc.identifier.citedreference	Breiman L. Random forests. Mach Learn 2001, 45: 5 – 32.	en_US
dc.identifier.citedreference	Fan W, Stolfo SJ, Zhang J, Chan PK. Adacost: misclassification cost‐sensitive boosting. In: Proceedings of 16th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann; 1999, 97 – 105.	en_US
dc.identifier.citedreference	Ting KM. A comparative study of cost‐sensitive boosting algorithms. In: Proceedings of 17th International Conference on Machine Learning. Stanford, CA: Morgan Kaufmann; 2000, 983 – 990.	en_US
dc.identifier.citedreference	Sun Y, Kamel MS, Wong AKC, Wang Y. Cost‐sensitive boosting for classification of imbalanced data. Pattern Recognit 2007, 40: 3358 – 3378.	en_US
dc.identifier.citedreference	Domingos P. Metacost: a general method for making classifiers cost‐sensitive. In: Proceedings of Knowledge Discovery and Data Mining. New York: ACM Press; 1999, 155 – 164.	en_US
dc.identifier.citedreference	Elkan C. The foundations of cost‐sensitive learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence. Vol. 2. San Francisco, CA: Morgan Kaufmann Publishers Inc.; 2001, 973 – 978.	en_US
dc.identifier.citedreference	Jiang Y, Cukic B, Menzies T. Cost curve evaluation of fault prediction models. In: Proceedings of the 19th International Symposium on Software Reliability Engineering. Seattle, WA: IEEE Computer Society; 2008, 197 – 206.	en_US
dc.identifier.citedreference	Khoshgoftaar TM, Seliya N, Herzberg A. Resource‐oriented software quality classification models. J Syst Softw 2005, 76: 111 – 126.	en_US
dc.identifier.citedreference	Khoshgoftaar TM, Liu Y, Seliya N. A multi‐objective module‐order model for software quality enhancement. IEEE Trans Evolution Comput 2004, 8: 593 – 608.	en_US
dc.identifier.citedreference	Drummond C, Holte RC. Cost curves: an improved method for visualizing classifier performance. Mach Learn 2006, 65: 95 – 130.	en_US
dc.identifier.citedreference	Seliya N, Khoshgoftaar TM. Value‐based software quality modeling. In: SEKE. Skokie, IL: Knowledge Systems Institute Graduate School; 2009, 116 – 121.	en_US
dc.identifier.citedreference	Freund Y, Schapire R. Experiments with a new boosting algorithm. In: Proceedings of 13th International Conference on Machine Learning. Bari: Morgan Kaufmann; 1996, 148 – 156.	en_US
dc.identifier.citedreference	Breiman L. Bagging predictors. Mach Learn 1996, 26: 123 – 140.	en_US
dc.identifier.citedreference	Van Hulse J, Khoshgoftaar TM, Napolitano A. Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning. New York: ACM Press; 2007, 935 – 945.	en_US
dc.identifier.citedreference	Sayyad Shirabad J, Menzies TJ. The PROMISE repository of software engineering databases, School of Information Technology and Engineering, University of Ottawa, Canada; 2005.	en_US
dc.identifier.citedreference	Khoshgoftaar TM, Zhong S, Joshi V. Noise elimination with ensemble‐classifier filtering for software quality estimation. Intell Data Anal: An Int J 2005, 9: 3 – 27.	en_US
dc.identifier.citedreference	Seliya N, Khoshgoftaar TM. Software quality analysis of unlabeled program modules with semi‐supervised clustering. IEEE Trans Syst Man Cybern 2007, 37: 201 – 211.	en_US
dc.identifier.citedreference	Khoshgoftaar TM, Allen EB. Logistic regression modeling of software quality. Int J Reliab Qual Saf Eng 1999, 6: 303 – 317.	en_US
dc.identifier.citedreference	Berenson ML, Levine DM, Goldstein M. Intermediate Statistical Methods and Applications: A Computer Package Approach. Englewood Cliffs, NJ: Prentice‐Hall, Inc.; 1989.	en_US
dc.owningcollname	Interdisciplinary and Peer-Reviewed

Files in this item

Name:: 38_ftp.pdf
Size:: 165.4KB
Format:: PDF

View/Open

Interdisciplinary and Peer-Reviewed

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.