Cluster Validation Measures for Label Noise Filtering
2018 (English)In: 9th International Conference on Intelligent Systems 2018: Theory, Research and Innovation in Applications, IS 2018 - Proceedings / [ed] JardimGoncalves, R; Mendonca, JP; Jotsov, V; Marques, M; Martins, J; Bierwolf, R, Institute of Electrical and Electronics Engineers Inc. , 2018, p. 109-116Conference paper, Published paper (Refereed)
Abstract [en]
Cluster validation measures are designed to find the partitioning that best fits the underlying data. In this paper, we show that these well-known and scientifically proven validation measures can also be used in a different context, i.e., for filtering mislabeled instances or class outliers prior to training in super-vised learning problems. A technique, entitled CVI-based Outlier Filtering, is proposed in which mislabeled instances are identified and eliminated from the training set, and a classification hypothesis is then built from the set of remaining instances. The proposed approach assigns each instance several cluster validation scores representing its potential of being an outlier with respect to the clustering properties the used validation measures assess. We examine CVI-based Outlier Filtering and compare it against the LOF detection method on ten data sets from the UCI data repository using five well-known learning algorithms and three different cluster validation indices. In addition, we study two approaches for filtering mislabeled instances: local and global. Our results show that for most learning algorithms and data sets, the proposed CVI-based outlier filtering algorithm outperforms the baseline method (LOF). The greatest increase in classification accuracy has been achieved by combining at least two of the used cluster validation indices and global filtering of mislabeled instances. © 2018 IEEE.
Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers Inc. , 2018. p. 109-116
Keywords [en]
Class noise, Classification, Cluster validation measures, Label noise, Classification (of information), Intelligent systems, Learning algorithms, Statistics, Classification accuracy, Cluster validation, Clustering properties, Data repositories, Detection methods, Filtering algorithm, Learning problem, Clustering algorithms
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:bth-18023DOI: 10.1109/IS.2018.8710495ISI: 000469337900017Scopus ID: 2-s2.0-85065973083ISBN: 9781538670972 (print)OAI: oai:DiVA.org:bth-18023DiVA, id: diva2:1324906
Conference
9th International Conference on Intelligent Systems, IS 2018; Funchal - Madeira; Portugal; 25 September 2018 through 27
Part of project
Bigdata@BTH- Scalable resource-efficient systems for big data analytics, Knowledge Foundation2019-06-142019-06-142021-07-26Bibliographically approved