Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-based Systems
2022 (English)In: Proceedings - 1st International Conference on AI Engineering - Software Engineering for AI, CAIN 2022, Institute of Electrical and Electronics Engineers (IEEE), 2022, p. 229-239Conference paper, Published paper (Refereed)
Abstract [en]
High data quality is fundamental for today's AI-based systems. However, although data quality has been an object of research for decades, there is a clear lack of research on potential data quality issues (e.g., ambiguous, extraneous values). These kinds of issues are latent in nature and thus often not obvious. Nevertheless, they can be associated with an increased risk of future problems in AI-based systems (e.g., technical debt, data-induced faults). As a counterpart to code smells in software engineering, we refer to such issues as Data Smells. This article conceptualizes data smells and elaborates on their causes, consequences, detection, and use in the context of AI-based systems. In addition, a catalogue of 36 data smells divided into three categories (i.e., Believability Smells, Understandability Smells, Consistency Smells) is presented. Moreover, the article outlines tool support for detecting data smells and presents the result of an initial smell detection on more than 240 real-world datasets.
Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2022. p. 229-239
Keywords [en]
Data reduction, Odors, Code smell, Data engineering, Data quality, Data smell, On potentials, Quality issues, Technical debts, Three categories, Tool support, Understandability, Software engineering, artificial intelligence, data smells
National Category
Software Engineering
Identifiers
URN: urn:nbn:se:bth-23541DOI: 10.1145/3522664.3528590Scopus ID: 2-s2.0-85133411277ISBN: 9781450392754 (print)OAI: oai:DiVA.org:bth-23541DiVA, id: diva2:1687071
Conference
1st International Conference on AI Engineering - Software Engineering for AI, CAIN 2022, Pittsburgh, 16 May 2022 through 17 May 2022
Note
open access
2022-08-122022-08-122022-12-13Bibliographically approved