The use of decision trees for cost‐sensitive classification: an empirical study in software quality prediction
dc.contributor.author | Seliya, Naeem | en_US |
dc.contributor.author | Khoshgoftaar, Taghi M. | en_US |
dc.date.accessioned | 2011-11-10T15:39:37Z | |
dc.date.available | 2012-11-02T18:56:54Z | en_US |
dc.date.issued | 2011-09 | en_US |
dc.identifier.citation | Seliya, Naeem; Khoshgoftaar, Taghi M. (2011). "The use of decision trees for cost‐sensitive classification: an empirical study in software quality prediction." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(5): 448-459. <http://hdl.handle.net/2027.42/87156> | en_US |
dc.identifier.issn | 1942-4787 | en_US |
dc.identifier.issn | 1942-4795 | en_US |
dc.identifier.uri | https://hdl.handle.net/2027.42/87156 | |
dc.description.abstract | This empirical study investigates two commonly used decision tree classification algorithms in the context of cost‐sensitive learning. A review of the literature shows that the cost‐based performance of a software quality prediction model is usually determined after the model‐training process has been completed. In contrast, we incorporate cost‐sensitive learning during the model‐training process. The C4.5 and Random Forest decision tree algorithms are used to build defect predictors either with, or without, any cost‐sensitive learning technique. The paper investigates six different cost‐sensitive learning techniques: AdaCost, Adc2, Csb2, MetaCost, Weighting, and Random Undersampling (RUS). The data come from case study include 15 software measurement datasets obtained from several high‐assurance systems. In addition, to a unique insight into the cost‐based performance of defection prediction models, this study is one of the first to use misclassification cost as a parameter during the model‐training process. The practical appeal of this research is that it provides a software quality practitioner with a clear process for how to consider (during model training) and analyze (during model evaluation) the cost‐based performance of a defect prediction model. RUS is ranked as the best cost‐sensitive technique among those considered in this study. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 448–459 DOI: 10.1002/widm.38 | en_US |
dc.publisher | John Wiley & Sons, Inc. | en_US |
dc.title | The use of decision trees for cost‐sensitive classification: an empirical study in software quality prediction | en_US |
dc.type | Article | en_US |
dc.rights.robots | IndexNoFollow | en_US |
dc.subject.hlbsecondlevel | Information Science | en_US |
dc.subject.hlbtoplevel | Social Sciences | en_US |
dc.description.peerreviewed | Peer Reviewed | en_US |
dc.contributor.affiliationum | Computer and Information Science, University of Michigan—Dearborn, Dearborn, MI, USA | en_US |
dc.contributor.affiliationother | Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA | en_US |
dc.contributor.affiliationother | Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA | en_US |
dc.description.bitstreamurl | http://deepblue.lib.umich.edu/bitstream/2027.42/87156/1/38_ftp.pdf | |
dc.identifier.doi | 10.1002/widm.38 | en_US |
dc.identifier.source | Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery | en_US |
dc.identifier.citedreference | Khoshgoftaar TM, Cukic B, Seliya N. An empirical assessment on program module‐order models. Qual Technol Quant Manag 2007, 4: 171 – 190. | en_US |
dc.identifier.citedreference | Emam KE, Benlarbi S, Goel N, Rai SN. Comparing case‐based reasoning classifiers for predicting high‐risk software componenets. J Syst Softw 2001, 55: 301 – 320. | en_US |
dc.identifier.citedreference | Khoshgoftaar TM, Seliya N. Comparative assessment of software quality classification techniques: an empirical case study. Empir Softw Eng J 2004, 9: 229 – 257. | en_US |
dc.identifier.citedreference | Lessmann S, Baesens B, Mues C, Pietsch S. Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 2008, 34: 485 – 496. | en_US |
dc.identifier.citedreference | Liu Y, Khoshgoftaar TM, Seliya N. Evolutionary optimization of software quality modeling with multiple repositories. IEEE Trans Softw Eng 2010, 36: 852 – 864. | en_US |
dc.identifier.citedreference | Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco, CA: Morgan Kaufmann; 2005. | en_US |
dc.identifier.citedreference | Breiman L. Random forests. Mach Learn 2001, 45: 5 – 32. | en_US |
dc.identifier.citedreference | Fan W, Stolfo SJ, Zhang J, Chan PK. Adacost: misclassification cost‐sensitive boosting. In: Proceedings of 16th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann; 1999, 97 – 105. | en_US |
dc.identifier.citedreference | Ting KM. A comparative study of cost‐sensitive boosting algorithms. In: Proceedings of 17th International Conference on Machine Learning. Stanford, CA: Morgan Kaufmann; 2000, 983 – 990. | en_US |
dc.identifier.citedreference | Sun Y, Kamel MS, Wong AKC, Wang Y. Cost‐sensitive boosting for classification of imbalanced data. Pattern Recognit 2007, 40: 3358 – 3378. | en_US |
dc.identifier.citedreference | Domingos P. Metacost: a general method for making classifiers cost‐sensitive. In: Proceedings of Knowledge Discovery and Data Mining. New York: ACM Press; 1999, 155 – 164. | en_US |
dc.identifier.citedreference | Elkan C. The foundations of cost‐sensitive learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence. Vol. 2. San Francisco, CA: Morgan Kaufmann Publishers Inc.; 2001, 973 – 978. | en_US |
dc.identifier.citedreference | Jiang Y, Cukic B, Menzies T. Cost curve evaluation of fault prediction models. In: Proceedings of the 19th International Symposium on Software Reliability Engineering. Seattle, WA: IEEE Computer Society; 2008, 197 – 206. | en_US |
dc.identifier.citedreference | Khoshgoftaar TM, Seliya N, Herzberg A. Resource‐oriented software quality classification models. J Syst Softw 2005, 76: 111 – 126. | en_US |
dc.identifier.citedreference | Khoshgoftaar TM, Liu Y, Seliya N. A multi‐objective module‐order model for software quality enhancement. IEEE Trans Evolution Comput 2004, 8: 593 – 608. | en_US |
dc.identifier.citedreference | Drummond C, Holte RC. Cost curves: an improved method for visualizing classifier performance. Mach Learn 2006, 65: 95 – 130. | en_US |
dc.identifier.citedreference | Seliya N, Khoshgoftaar TM. Value‐based software quality modeling. In: SEKE. Skokie, IL: Knowledge Systems Institute Graduate School; 2009, 116 – 121. | en_US |
dc.identifier.citedreference | Freund Y, Schapire R. Experiments with a new boosting algorithm. In: Proceedings of 13th International Conference on Machine Learning. Bari: Morgan Kaufmann; 1996, 148 – 156. | en_US |
dc.identifier.citedreference | Breiman L. Bagging predictors. Mach Learn 1996, 26: 123 – 140. | en_US |
dc.identifier.citedreference | Van Hulse J, Khoshgoftaar TM, Napolitano A. Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning. New York: ACM Press; 2007, 935 – 945. | en_US |
dc.identifier.citedreference | Sayyad Shirabad J, Menzies TJ. The PROMISE repository of software engineering databases, School of Information Technology and Engineering, University of Ottawa, Canada; 2005. | en_US |
dc.identifier.citedreference | Khoshgoftaar TM, Zhong S, Joshi V. Noise elimination with ensemble‐classifier filtering for software quality estimation. Intell Data Anal: An Int J 2005, 9: 3 – 27. | en_US |
dc.identifier.citedreference | Seliya N, Khoshgoftaar TM. Software quality analysis of unlabeled program modules with semi‐supervised clustering. IEEE Trans Syst Man Cybern 2007, 37: 201 – 211. | en_US |
dc.identifier.citedreference | Khoshgoftaar TM, Allen EB. Logistic regression modeling of software quality. Int J Reliab Qual Saf Eng 1999, 6: 303 – 317. | en_US |
dc.identifier.citedreference | Berenson ML, Levine DM, Goldstein M. Intermediate Statistical Methods and Applications: A Computer Package Approach. Englewood Cliffs, NJ: Prentice‐Hall, Inc.; 1989. | en_US |
dc.owningcollname | Interdisciplinary and Peer-Reviewed |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.