Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Data Mining Approaches for Outlier Detection Analysis
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science.ORCID iD: 0000-0002-3010-8798
2020 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Outlier detection is studied and applied in many domains. Outliers arise due to different reasons such as fraudulent activities, structural defects, health problems, and mechanical issues. The detection of outliers is a challenging task that can reveal system faults, fraud, and save people's lives. Outlier detection techniques are often domain-specific. The main challenge in outlier detection relates to modelling the normal behaviour in order to identify abnormalities. The choice of model is important, i.e., an unsuitable data model can lead to poor results. This requires a good understanding and interpretation of the data, the constraints, and requirements of the domain problem. Outlier detection is largely an unsupervised problem due to unavailability of labeled data and the fact that labeled data is expensive. 

In this thesis, we study and apply a combination of both machine learning and data mining techniques to build data-driven and domain-oriented outlier detection models. We focus on three real-world application domains: maritime surveillance, district heating, and online media and sequence datasets. We show the importance of data preprocessing as well as feature selection in building suitable methods for data modelling. We take advantage of both supervised and unsupervised techniques to create hybrid methods. 

More specifically, we propose a rule-based anomaly detection system using open data for the maritime surveillance domain. We exploit sequential pattern mining for identifying contextual and collective outliers in online media data. We propose a minimum spanning tree clustering technique for detection of groups of outliers in online media and sequence data. We develop a few higher order mining approaches for identifying manual changes and deviating behaviours in the heating systems at the building level. The proposed approaches are shown to be capable of explaining the underlying properties of the detected outliers. This can facilitate domain experts in narrowing down the scope of analysis and understanding the reasons of such anomalous behaviours. We also investigate the reproducibility of the proposed models in similar application domains.

Place, publisher, year, edition, pages
Karlskrona: Blekinge Tekniska Högskola, 2020. , p. 251
Series
Blekinge Institute of Technology Doctoral Dissertation Series, ISSN 1653-2090 ; 9
Keywords [en]
outlier detection, data modelling, machine learning, clustering analysis, data stream mining
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:bth-20454ISBN: 9789172954090 (print)OAI: oai:DiVA.org:bth-20454DiVA, id: diva2:1474986
Public defence
2020-12-01, J1630, Karlskrona, 13:00 (English)
Opponent
Supervisors
Funder
Knowledge Foundation, 20140032Available from: 2020-10-16 Created: 2020-10-12 Last updated: 2020-12-14Bibliographically approved
List of papers
1. Open Data for Anomaly Detection in Maritime Surveillance
Open this publication in new window or tab >>Open Data for Anomaly Detection in Maritime Surveillance
Show others...
2013 (English)In: Expert Systems with Applications, ISSN 0957-4174, Vol. 40, no 14, p. 5719-5729Article in journal (Refereed) Published
Abstract [en]

Maritime Surveillance has received increased attention from a civilian perspective in recent years. Anomaly detection is one of many techniques available for improving the safety and security in this domain. Maritime authorities use confidential data sources for monitoring the maritime activities; however, a paradigm shift on the Internet has created new open sources of data. We investigate the potential of using open data as a complementary resource for anomaly detection in maritime surveillance. We present and evaluate a decision support system based on open data and expert rules for this purpose. We conduct a case study in which experts from the Swedish coastguard participate to conduct a real-world validation of the system. We conclude that the exploitation of open data as a complementary resource is feasible since our results indicate improvements in the efficiency and effectiveness of the existing surveillance systems by increasing the accuracy and covering unseen aspects of maritime activities.

Place, publisher, year, edition, pages
Elsevier, 2013
Keywords
Open data, Anomaly detection, Maritime security, Maritime domain awareness
National Category
Computer Sciences
Identifiers
urn:nbn:se:bth-6807 (URN)10.1016/j.eswa.2013.04.029 (DOI)000321089200029 ()oai:bth.se:forskinfoD455168E88392FDDC1257B6200290B99 (Local ID)oai:bth.se:forskinfoD455168E88392FDDC1257B6200290B99 (Archive number)oai:bth.se:forskinfoD455168E88392FDDC1257B6200290B99 (OAI)
Available from: 2013-12-17 Created: 2013-05-05 Last updated: 2021-03-26Bibliographically approved
2. Outlier Detection for Video Session Data Using Sequential Pattern Mining
Open this publication in new window or tab >>Outlier Detection for Video Session Data Using Sequential Pattern Mining
Show others...
2018 (English)In: ACM SIGKDD Workshop On Outlier Detection De-constructed, 2018Conference paper, Oral presentation only (Refereed)
Abstract [en]

The growth of Internet video and over-the-top transmission techniqueshas enabled online video service providers to deliver highquality video content to viewers. To maintain and improve thequality of experience, video providers need to detect unexpectedissues that can highly affect the viewers’ experience. This requiresanalyzing massive amounts of video session data in order to findunexpected sequences of events. In this paper we combine sequentialpattern mining and clustering to discover such event sequences.The proposed approach applies sequential pattern mining to findfrequent patterns by considering contextual and collective outliers.In order to distinguish between the normal and abnormal behaviorof the system, we initially identify the most frequent patterns. Thena clustering algorithm is applied on the most frequent patterns.The generated clustering model together with Silhouette Index areused for further analysis of less frequent patterns and detectionof potential outliers. Our results show that the proposed approachcan detect outliers at the system level.

Keywords
Cluster Analysis, Data Stream Mining, Outlier Detection, Sequential Pattern Mining
National Category
Computer Sciences
Identifiers
urn:nbn:se:bth-16944 (URN)
Conference
ACM SIGKDD Workshop On Outlier Detection De-constructed, London,
Funder
Knowledge Foundation, 20140032
Available from: 2018-10-01 Created: 2018-10-01 Last updated: 2021-07-26Bibliographically approved
3. A Minimum Spanning Tree Clustering Approach for Outlier Detection in Event Sequences
Open this publication in new window or tab >>A Minimum Spanning Tree Clustering Approach for Outlier Detection in Event Sequences
Show others...
2018 (English)In: 2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA) / [ed] Wani M.A.,Sayed-Mouchaweh M.,Lughofer E.,Gama J.,Kantardzic M., IEEE, 2018, p. 1123-1130, article id 8614207Conference paper, Published paper (Refereed)
Abstract [en]

Outlier detection has been studied in many domains. Outliers arise due to different reasons such as mechanical issues, fraudulent behavior, and human error. In this paper, we propose an unsupervised approach for outlier detection in a sequence dataset. The proposed approach combines sequential pattern mining, cluster analysis, and a minimum spanning tree algorithm in order to identify clusters of outliers. Initially, the sequential pattern mining is used to extract frequent sequential patterns. Next, the extracted patterns are clustered into groups of similar patterns. Finally, the minimum spanning tree algorithm is used to find groups of outliers. The proposed approach has been evaluated on two different real datasets, i.e., smart meter data and video session data. The obtained results have shown that our approach can be applied to narrow down the space of events to a set of potential outliers and facilitate domain experts in further analysis and identification of system level issues.

Place, publisher, year, edition, pages
IEEE, 2018
Keywords
Clustering, Minimum spanning tree, Outlier detection, Sequential pattern mining
National Category
Computer Sciences
Identifiers
urn:nbn:se:bth-17100 (URN)10.1109/ICMLA.2018.00182 (DOI)000463034400174 ()9781538668047 (ISBN)
Conference
17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018; Orlando; United States; 17 December 2018 through 20 December
Funder
Knowledge Foundation, 20140032
Available from: 2018-10-09 Created: 2018-10-09 Last updated: 2021-07-26Bibliographically approved
4. Trend analysis to automatically identify heat program changes
Open this publication in new window or tab >>Trend analysis to automatically identify heat program changes
Show others...
2017 (English)In: Energy Procedia, Elsevier, 2017, Vol. 116, p. 407-415Conference paper, Published paper (Refereed)
Abstract [en]

The aim of this study is to improve the monitoring and controlling of heating systems located at customer buildings through the use of a decision support system. To achieve this, the proposed system applies a two-step classifier to detect manual changes of the temperature of the heating system. We apply data from the Swedish company NODA, active in energy optimization and services for energy efficiency, to train and test the suggested system. The decision support system is evaluated through an experiment and the results are validated by experts at NODA. The results show that the decision support system can detect changes within three days after their occurrence and only by considering daily average measurements.

Place, publisher, year, edition, pages
Elsevier, 2017
Series
Energy Procedia, ISSN 1876-6102 ; 116
Keywords
District heating, Trend analysis, Change detection, Smart automated system
National Category
Computer Systems
Identifiers
urn:nbn:se:bth-12894 (URN)10.1016/j.egypro.2017.05.088 (DOI)000406743000039 ()
Conference
15th International Symposium on District Heating and Cooling (DHC2016), Seoul
Funder
Knowledge Foundation, 20140032
Note

Open access

Available from: 2016-09-26 Created: 2016-07-13 Last updated: 2021-05-05Bibliographically approved
5. District Heating Substation Behaviour Modelling for Annotating the Performance
Open this publication in new window or tab >>District Heating Substation Behaviour Modelling for Annotating the Performance
2020 (English)In: Communications in Computer and Information Science / [ed] Cellier, P, Driessens, K, Springer , 2020, Vol. 1168, p. 3-11Conference paper, Published paper (Refereed)
Abstract [en]

In this ongoing study, we propose a higher order data mining approach for modelling district heating (DH) substations’ behaviour and linking operational behaviour representative profiles with different performance indicators. We initially create substation’s operational behaviour models by extracting weekly patterns and clustering them into groups of similar patterns. The built models are further analyzed and integrated into an overall substation model by applying consensus clustering. The different operational behaviour profiles represented by the exemplars of the consensus clustering model are then linked to performance indicators. The labelled behaviour profiles are deployed over the whole heating season to derive diverse insights about the substation’s performance. The results show that the proposed method can be used for modelling, analyzing and understanding the deviating and sub-optimal DH substation’s behaviours. © 2020, Springer Nature Switzerland AG.

Place, publisher, year, edition, pages
Springer, 2020
Series
Communications in Computer and Information Science, ISSN 1865-0929
Keywords
Clustering analysis, District heating, Higher order mining, Outlier detection, Benchmarking, Cluster analysis, Machine learning, Behaviour modelling, Behaviour models, Consensus clustering, Heating season, Heating substations, Performance indicators, Similar pattern, Substation models, Data mining
National Category
Energy Engineering
Identifiers
urn:nbn:se:bth-19425 (URN)10.1007/978-3-030-43887-6_1 (DOI)000718590300001 ()2-s2.0-85083637427 (Scopus ID)9783030438869 (ISBN)
Conference
19th Joint European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2019; Wurzburg; Germany; 16 September 2019 through 20 September 2019
Funder
Knowledge Foundation, 20140032
Available from: 2020-05-03 Created: 2020-05-03 Last updated: 2021-12-03Bibliographically approved
6. Multi-view Clustering Analyses for District Heating Substations
Open this publication in new window or tab >>Multi-view Clustering Analyses for District Heating Substations
2020 (English)In: DATA 2020 - Proceedings of the 9th International Conference on Data Science, Technology and Applications2020, / [ed] Hammoudi S.,Quix C.,Bernardino J., SciTePress, 2020, p. 158-168Conference paper, Published paper (Refereed)
Abstract [en]

In this study, we propose a multi-view clustering approach for mining and analysing multi-view network datasets. The proposed approach is applied and evaluated on a real-world scenario for monitoring and analysing district heating (DH) network conditions and identifying substations with sub-optimal behaviour. Initially, geographical locations of the substations are used to build an approximate graph representation of the DH network. Two different analyses can further be applied in this context: step-wise and parallel-wise multi-view clustering. The step-wise analysis is meant to sequentially consider and analyse substations with respect to a few different views. At each step, a new clustering solution is built on top of the one generated by the previously considered view, which organizes the substations in a hierarchical structure that can be used for multi-view comparisons. The parallel-wise analysis on the other hand, provides the opportunity to analyse substations with regards to two different views in parallel. Such analysis is aimed to represent and identify the relationships between substations by organizing them in a bipartite graph and analysing the substations’ distribution with respect to each view. The proposed data analysis and visualization approach arms domain experts with means for analysing DH network performance. In addition, it will facilitate the identification of substations with deviating operational behaviour based on comparative analysis with their closely located neighbours.

Place, publisher, year, edition, pages
SciTePress, 2020
Keywords
Data Mining, Multi-view Clustering, Multi-layer Clustering, Time Series, District Heating Substation
National Category
Computer Sciences
Identifiers
urn:nbn:se:bth-20452 (URN)10.5220/0009780001580168 (DOI)
Conference
9th International Conference on Data Science, Technology and Applications, DATA 2020, Virtual, Online; France, 7 July 2020 through 9 July 2020
Funder
Knowledge Foundation, 20140032
Note

open access

Available from: 2020-09-22 Created: 2020-09-22 Last updated: 2021-07-31Bibliographically approved
7. A Higher Order Mining Approach for the Analysis of Real-World Datasets
Open this publication in new window or tab >>A Higher Order Mining Approach for the Analysis of Real-World Datasets
2020 (English)In: Energies, E-ISSN 1996-1073, Vol. 13, no 21, article id 5781Article in journal (Refereed) Published
Abstract [en]

In this study, we propose a higher order mining approach that can be used for the analysis of real-world datasets. The approach can be used to monitor and identify the deviating operational behaviour of the studied phenomenon in the absence of prior knowledge about the data. The proposed approach consists of several different data analysis techniques, such as sequential pattern mining, clustering analysis, consensus clustering and the minimum spanning tree (MST). Initially, a clustering analysis is performed on the extracted patterns to model the behavioural modes of the studied phenomenon for a given time interval. The generated clustering models, which correspond to every two consecutive time intervals, can further be assessed to determine changes in the monitored behaviour. In cases in which significant differences are observed, further analysis is performed by integrating the generated models into a consensus clustering and applying an MST to identify deviating behaviours. The validity and potential of the proposed approach is demonstrated on a real-world dataset originating from a network of district heating (DH) substations. The obtained results show that our approach is capable of detecting deviating and sub-optimal behaviours of DH substations.

Place, publisher, year, edition, pages
MDPI, 2020
Keywords
outlier detection, fault detection, higher order mining, clustering analysis, minimum spanning tree, data mining, district heating substations
National Category
Energy Systems
Identifiers
urn:nbn:se:bth-20453 (URN)10.3390/en13215781 (DOI)000588863900001 ()
Funder
Knowledge Foundation, 20140032
Note

open access

Available from: 2020-09-22 Created: 2020-09-22 Last updated: 2023-08-28Bibliographically approved

Open Access in DiVA

fulltext(17735 kB)4629 downloads
File information
File name FULLTEXT02.pdfFile size 17735 kBChecksum SHA-512
8ba764255452eafa6ef11c4f2b4af0436ba10d725adf217c49f83052f4b1e7370381afcc064d10782805cb3f99047f96833a8b16c2269caade75ddab857a8899
Type fulltextMimetype application/pdf

Authority records

Abghari, Shahrooz

Search in DiVA

By author/editor
Abghari, Shahrooz
By organisation
Department of Computer Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 4657 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1966 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf