In this study, we propose a higher order mining approach that can be used for the analysis of real-world datasets. The approach can be used to monitor and identify the deviating operational behaviour of the studied phenomenon in the absence of prior knowledge about the data. The proposed approach consists of several different data analysis techniques, such as sequential pattern mining, clustering analysis, consensus clustering and the minimum spanning tree (MST). Initially, a clustering analysis is performed on the extracted patterns to model the behavioural modes of the studied phenomenon for a given time interval. The generated clustering models, which correspond to every two consecutive time intervals, can further be assessed to determine changes in the monitored behaviour. In cases in which significant differences are observed, further analysis is performed by integrating the generated models into a consensus clustering and applying an MST to identify deviating behaviours. The validity and potential of the proposed approach is demonstrated on a real-world dataset originating from a network of district heating (DH) substations. The obtained results show that our approach is capable of detecting deviating and sub-optimal behaviours of DH substations.
In this study, we propose a multi-view clustering approach for mining and analysing multi-view network datasets. The proposed approach is applied and evaluated on a real-world scenario for monitoring and analysing district heating (DH) network conditions and identifying substations with sub-optimal behaviour. Initially, geographical locations of the substations are used to build an approximate graph representation of the DH network. Two different analyses can further be applied in this context: step-wise and parallel-wise multi-view clustering. The step-wise analysis is meant to sequentially consider and analyse substations with respect to a few different views. At each step, a new clustering solution is built on top of the one generated by the previously considered view, which organizes the substations in a hierarchical structure that can be used for multi-view comparisons. The parallel-wise analysis on the other hand, provides the opportunity to analyse substations with regards to two different views in parallel. Such analysis is aimed to represent and identify the relationships between substations by organizing them in a bipartite graph and analysing the substations’ distribution with respect to each view. The proposed data analysis and visualization approach arms domain experts with means for analysing DH network performance. In addition, it will facilitate the identification of substations with deviating operational behaviour based on comparative analysis with their closely located neighbours.
In this ongoing study, we propose a higher order data mining approach for modelling district heating (DH) substations’ behaviour and linking operational behaviour representative profiles with different performance indicators. We initially create substation’s operational behaviour models by extracting weekly patterns and clustering them into groups of similar patterns. The built models are further analyzed and integrated into an overall substation model by applying consensus clustering. The different operational behaviour profiles represented by the exemplars of the consensus clustering model are then linked to performance indicators. The labelled behaviour profiles are deployed over the whole heating season to derive diverse insights about the substation’s performance. The results show that the proposed method can be used for modelling, analyzing and understanding the deviating and sub-optimal DH substation’s behaviours. © 2020, Springer Nature Switzerland AG.
We propose a higher order mining (HOM) approach for modelling, monitoring and analyzing district heating (DH) substations' operational behaviour and performance. HOM is concerned with mining over patterns rather than primary or raw data. The proposed approach uses a combination of different data analysis techniques such as sequential pattern mining, clustering analysis, consensus clustering and minimum spanning tree (MST). Initially, a substation's operational behaviour is modeled by extracting weekly patterns and performing clustering analysis. The substation's performance is monitored by assessing its modeled behaviour for every two consecutive weeks. In case some significant difference is observed, further analysis is performed by integrating the built models into a consensus clustering and applying an MST for identifying deviating behaviours. The results of the study show that our method is robust for detecting deviating and sub-optimal behaviours of DH substations. In addition, the proposed method can facilitate domain experts in the interpretation and understanding of the substations' behaviour and performance by providing different data analysis and visualization techniques. © 2019 IEEE.
In this paper, we propose a Global Navigation Satellite System (GNSS) component activation model for mobile tracking devices that automatically detects indoor/outdoor environments using the radio signals received from Long-Term Evolution (LTE) base stations. We use an Inductive System Monitoring (ISM) technique to model environmental scenarios captured by a smart tracker via extracting clusters of corresponding value ranges from LTE base stations’ signal strength. The ISM-based model is built by using the tracker’s historical data labeled with GPS coordinates. The built model is further refined by applying it to additional data without GPS location collected by the same device. This procedure allows us to identify the clusters that describe semi-outdoor scenarios. In that way, the model discriminates between two outdoor environmental categories: open outdoor and semi-outdoor. The proposed ISM-based GNSS activation approach is studied and evaluated on a real-world dataset contains radio signal measurements collected by five smart trackers and their geographical location in various environmental scenarios.
The growth of Internet video and over-the-top transmission techniqueshas enabled online video service providers to deliver highquality video content to viewers. To maintain and improve thequality of experience, video providers need to detect unexpectedissues that can highly affect the viewers’ experience. This requiresanalyzing massive amounts of video session data in order to findunexpected sequences of events. In this paper we combine sequentialpattern mining and clustering to discover such event sequences.The proposed approach applies sequential pattern mining to findfrequent patterns by considering contextual and collective outliers.In order to distinguish between the normal and abnormal behaviorof the system, we initially identify the most frequent patterns. Thena clustering algorithm is applied on the most frequent patterns.The generated clustering model together with Silhouette Index areused for further analysis of less frequent patterns and detectionof potential outliers. Our results show that the proposed approachcan detect outliers at the system level.
Outlier detection has been studied in many domains. Outliers arise due to different reasons such as mechanical issues, fraudulent behavior, and human error. In this paper, we propose an unsupervised approach for outlier detection in a sequence dataset. The proposed approach combines sequential pattern mining, cluster analysis, and a minimum spanning tree algorithm in order to identify clusters of outliers. Initially, the sequential pattern mining is used to extract frequent sequential patterns. Next, the extracted patterns are clustered into groups of similar patterns. Finally, the minimum spanning tree algorithm is used to find groups of outliers. The proposed approach has been evaluated on two different real datasets, i.e., smart meter data and video session data. The obtained results have shown that our approach can be applied to narrow down the space of events to a set of potential outliers and facilitate domain experts in further analysis and identification of system level issues.
Human Activity Recognition (HAR) plays a significant role in recent years due to its applications in various fields including health care and well-being. Traditional centralized methods reach very high recognition rates, but they incur privacy and scalability issues. Federated learning (FL) is a leading distributed machine learning (ML) paradigm, to train a global model collaboratively on distributed data in a privacy-preserving manner. However, for HAR scenarios, the existing action recognition system mainly focuses on a unified model, i.e. it does not provide users with personalized recognition of activities. Furthermore, the heterogeneity of data across user devices can lead to degraded performance of traditional FL models in the smart applications such as personalized health care. To this end, we propose a novel federated learning model that tries to cope with a statistically heterogeneous federated learning environment by introducing a group-personalized FL (GP-FL) solution. The proposed GP-FL algorithm builds several global ML models, each one trained iteratively on a dynamic group of clients with homogeneous class probability estimations. The performance of the proposed FL scheme is studied and evaluated on real-world HAR data. The evaluation results demonstrate that our approach has advantages in terms of model performance and convergence speed with respect to two baseline FL algorithms used for comparison. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
Federated learning (FL), a decentralized machine learning framework that allows edge devices (i.e., clients) to train a global model while preserving data/client privacy, has become increasingly popular recently. In FL, a shared global model is built by aggregating the updated parameters in a distributed manner. To incentivize data owners to participate in FL, it is essential for service providers to fairly evaluate the contribution of each data owner to the shared model during the learning process. To the best of our knowledge, most existing solutions are resource-demanding and usually run as an additional evaluation procedure. The latter produces an expensive computational cost for large data owners. In this paper, we present simple and effective FL solutions that show how the clients’ behavior can be evaluated during the training process with respect to reliability, and this is demonstrated for two existing FL models, Cluster Analysis-based Federated Learning (CA-FL) and Group-Personalized FL (GP-FL), respectively. In the former model, CA-FL, the frequency of each client to be selected as a cluster representative and in that way to be involved in the building of the shared model is assessed. This can eventually be considered as a measure of the respective client data reliability. In the latter model, GP-FL, we calculate how many times each client changes a cluster it belongs to during FL training, which can be interpreted as a measure of the client's unstable behavior, i.e., it can be considered as not very reliable. We validate our FL approaches on three LEAF datasets and benchmark their performance to two baseline contribution evaluation approaches. The experimental results demonstrate that by applying the two FL models we are able to get robust evaluations of clients’ behavior during the training process. These evaluations can be used for further studying, comparing, understanding, and eventually predicting clients’ contributions to the shared global model.
Federated Learning (FL) provides a promising solution for preserving privacy in learning shared models on distributed devices without sharing local data on a central server. However, most existing work shows that FL incurs high communication costs. To address this challenge, we propose a clustering-based federated solution, entitled Federated Learning via Clustering Optimization (FedCO), which optimizes model aggregation and reduces communication costs. In order to reduce the communication costs, we first divide the participating workers into groups based on the similarity of their model parameters and then select only one representative, the best performing worker, from each group to communicate with the central server. Then, in each successive round, we apply the Silhouette validation technique to check whether each representative is still made tight with its current cluster. If not, the representative is either moved into a more appropriate cluster or forms a cluster singleton. Finally, we use split optimization to update and improve the whole clustering solution. The updated clustering is used to select new cluster representatives. In that way, the proposed FedCO approach updates clusters by repeatedly evaluating and splitting clusters if doing so is necessary to improve the workers’ partitioning. The potential of the proposed method is demonstrated on publicly available datasets and LEAF datasets under the IID and Non-IID data distribution settings. The experimental results indicate that our proposed FedCO approach is superior to the state-of-the-art FL approaches, i.e., FedAvg, FedProx, and CMFL, in reducing communication costs and achieving a better accuracy in both the IID and Non-IID cases. © 2022 by the authors.
Training of machine learning models in a Datacenter, with data originated from edge nodes, incurs high communication overheads and violates a user's privacy. These challenges may be tackled by employing Federated Learning (FL) machine learning technique to train a model across multiple decentralized edge devices (workers) using local data. In this paper, we explore an approach that identifies the most representative updates made by workers and those are only uploaded to the central server for reducing network communication costs. Based on this idea, we propose a FL model that can mitigate communication overheads via clustering analysis of the worker local updates. The Cluster Analysis-based Federated Learning (CA-FL) model is studied and evaluated in human activity recognition (HAR) datasets. Our evaluation results show the robustness of CA-FL in comparison with traditional FL in terms of accuracy and communication costs on both IID and non-IID cases.
Recent advances in sensor technology are expected to lead to a greater use of wireless sensor networks (WSNs) in industry, logistics, healthcare, etc. On the other hand, advances in artificial intelligence (AI), machine learning (ML), and deep learning (DL) are becoming dominant solutions for processing large amounts of data from edge-synthesized heterogeneous sensors and drawing accurate conclusions with better understanding of the situation. Integration of the two areas WSN and AI has resulted in more accurate measurements, context-aware analysis and prediction useful for smart sensing applications. In this paper, a comprehensive overview of the latest developments in context-aware intelligent systems using sensor technology is provided. In addition, it also discusses the areas in which they are used, related challenges, motivations for adopting AI solutions, focusing on edge computing, i.e., sensor and AI techniques, along with analysis of existing research gaps. Another contribution of this study is the use of a semantic-aware approach to extract survey-relevant subjects. The latter specifically identifies eleven main research topics supported by the articles included in the work. These are analyzed from various angles to answer five main research questions. Finally, potential future research directions are also discussed.
The successful convergence of Internet of Things (IoT) technology and distributed machine learning have leveraged to realise the concept of Federated Learning (FL) with the collaborative efforts of a large number of low-powered and small-sized edge nodes. In Wireless Networks (WN), an energy-efficient transmission is a fundamental challenge since the energy resource of edge nodes is restricted.In this paper, we propose an Energy-aware Multi-Criteria Federated Learning (EaMC-FL) model for edge computing. The proposed model enables to collaboratively train a shared global model by aggregating locally trained models in selected representative edge nodes (workers). The involved workers are initially partitioned into a number of clusters with respect to the similarity of their local model parameters. At each training round a small set of representative workers is selected on the based of multi-criteria evaluation that scores each node representativeness (importance) by taking into account the trade-off among the node local model performance, consumed energy and battery lifetime. We have demonstrated through experimental results the proposed EaMC-FL model is capable of reducing the energy consumed by the edge nodes by lowering the transmitted data.
In this paper, we propose a novel evolving clustering algorithm for streaming data entitled EdgeCluster. The proposed algorithm is resource efficient, making it suitable for use at edge devices with limited storage and computational capacity. The EdgeCluster is capable of modeling and monitoring a streaming data phenomenon and identifying outlying behavior. In parallel with the monitoring, the EdgeCluster algorithm dynamically maintains the set of clusters that models the phenomenon's normal behavioral scenarios by taking newly arrived data into account and updating the clustering model accordingly. The EdgeCluster algorithm is evaluated and benchmarked to another resource-Aware stream clustering algorithm, EvolveCluster, in two experimental data scenarios using synthetic and real-world datasets. © 2024 IEEE.
Finding experts in academics is an important practical problem, e.g. recruiting reviewersfor reviewing conference, journal or project submissions, partner matching for researchproposals, finding relevant M. Sc. or Ph. D. supervisors etc. In this work, we discuss anexpertise recommender system that is built on data extracted from the Blekinge Instituteof Technology (BTH) instance of the institutional repository system DiVA (DigitalScientific Archive). DiVA is a publication and archiving platform for research publicationsand student essays used by 46 publicly funded universities and authorities in Sweden andthe rest of the Nordic countries (www.diva-portal.org). The DiVA classification system isbased on the Swedish Higher Education Authority (UKÄ) and the Statistic Sweden's (SCB)three levels classification system. Using the classification terms associated with studentM. Sc. and B. Sc. theses published in the DiVA platform, we have developed a prototypesystem which can be used to identify and recommend subject thesis supervisors inacademy.
Finding experts in academics is an important practical problem, e.g. recruiting reviewersfor reviewing conference, journal or project submissions, partner matching for researchproposals, finding relevant M. Sc. or Ph. D. supervisors etc. In this work, we discuss anexpertise recommender system that is built on data extracted from the Blekinge Instituteof Technology (BTH) instance of the institutional repository system DiVA (DigitalScientific Archive). DiVA is a publication and archiving platform for research publicationsand student essays used by 46 publicly funded universities and authorities in Sweden andthe rest of the Nordic countries (www.diva-portal.org). The DiVA classification system isbased on the Swedish Higher Education Authority (UKÄ) and the Statistic Sweden's (SCB)three levels classification system. Using the classification terms associated with studentM. Sc. and B. Sc. theses published in the DiVA platform, we have developed a prototypesystem which can be used to identify and recommend subject thesis supervisors in academy.
We propose a split-merge framework for evolutionary clustering. The proposed clustering technique, entitled Split-Merge Evolutionary Clustering is supposed to be more robust to concept drift scenarios by providing the flexibility to consider at each step a portion of the data and derive clusters from it to be used subsequently to update the existing clustering solution. The proposed framework is built around the idea to model two clustering solutions as a bipartite graph, which guides the update of the existing clustering solution by merging some clusters with ones from the newly constructed clustering while others are transformed by splitting their elements among several new clusters. We have evaluated and compared the discussed evolutionary clustering technique with two other state of the art algorithms: a bipartite correlation clustering (PivotBiCluster) and an incremental evolving clustering (Dynamic split-and-merge). © Springer Nature Switzerland AG 2019.
The problem addressed in this article concerns the development of evolutionary clustering techniques that can be applied to adapt the existing clustering solution to a clustering of newly collected data elements. We are interested in clustering approaches that are specially suited for adapting clustering solutions in the expertise retrieval domain. This interest is inspired by practical applications such as expertise retrieval systems where the information available in the system database is periodically updated by extracting new data. The experts available in the system database are usually partitioned into a number of disjoint subject categories. It is becoming impractical to re-cluster this large volume of available information. Therefore, the objective is to update the existing expert partitioning by the clustering produced on the newly extracted experts. Three different evolutionary clustering techniques are considered to be suitable for this scenario. The proposed techniques are initially evaluated by applying the algorithms on data extracted from the PubMed repository. Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved.
In this article we propose a bipartite correlation clustering technique that can be used to adapt the existing clustering solution to a clustering of newly collected data elements. The proposed technique is supposed to provide the flexibility to compute clusters on a new portion of data collected over a defined time period and to update the existing clustering solution by the computed new one. Such an updating clustering should better reflect the current characteristics of the data by being able to examine clusters occurring in the considered time period and eventually capture interesting trends in the area. For example, some clusters will be updated by merging with ones from newly constructed clustering while others will be transformed by splitting their elements among several new clusters. The proposed clustering algorithm, entitled Split-Merge Evolutionary Clustering, is evaluated and compared to another bipartite correlation clustering technique (PivotBiCluster) on two different case studies: expertise retrieval and patient profiling in healthcare. Copyright © 2019 by SCITEPRESS - Science and Technology Publications, Lda. All rights reserved
Cluster validation measures are designed to find the partitioning that best fits the underlying data. In this paper, we show that these well-known and scientifically proven validation measures can also be used in a different context, i.e., for filtering mislabeled instances or class outliers prior to training in super-vised learning problems. A technique, entitled CVI-based Outlier Filtering, is proposed in which mislabeled instances are identified and eliminated from the training set, and a classification hypothesis is then built from the set of remaining instances. The proposed approach assigns each instance several cluster validation scores representing its potential of being an outlier with respect to the clustering properties the used validation measures assess. We examine CVI-based Outlier Filtering and compare it against the LOF detection method on ten data sets from the UCI data repository using five well-known learning algorithms and three different cluster validation indices. In addition, we study two approaches for filtering mislabeled instances: local and global. Our results show that for most learning algorithms and data sets, the proposed CVI-based outlier filtering algorithm outperforms the baseline method (LOF). The greatest increase in classification accuracy has been achieved by combining at least two of the used cluster validation indices and global filtering of mislabeled instances. © 2018 IEEE.
In this work, we report an ongoing study that aims to apply cluster validation measures for analyzing email communications at an organizational level of a company. This analysis can be used to evaluate the company structure and to produce further recommendations for structural improvements. Our initial evaluations, based on data in the forms of emails logs and organizational structure for a large European telecommunication company, show that cluster validation techniques can be useful tools for assessing the organizational structure using objective analysis of internal email communications, and for simulating and studying different reorganization scenarios.
In this work, we apply cluster validation measures for analyzing email communications at an organizational level of a company. This analysis can be used to evaluate the company structure and to produce further recommendations for structural improvements. Our evaluations, based on data in the forms of email logs and organizational structure for a large European telecommunication company, show that cluster validation techniques can be useful tools for assessing the organizational structure using objective analysis of internal email communications, and for simulating and studying different reorganization scenarios.
In this paper we address the problem of modeling the evolution of clusters over time by applying sequential clustering. We propose a sequential partitioning algorithm that can be applied for grouping distinct snapshots of streaming data so that a clustering model is built on each data snapshot. The algorithm is initialized by a clustering solution built on available historical data. Then a new clustering solution is generated on each data snapshot by applying a partitioning algorithm seeded with the centroids of the clustering model obtained at the previous time interval. At each step the algorithm also conducts model adapting operations in order to reflect the evolution in the clustering structure. In that way, it enables to deal with both incremental and dynamic aspects of modeling evolving behavior problems. In addition, the proposed approach is able to trace back evolution through the detection of clusters' transitions, such as splits and merges. We have illustrated and initially evaluated our ideas on household electricity consumption data. The results have shown that the proposed sequential clustering algorithm is robust to modeling evolving behavior by being enable to mine changes and update the model, respectively.
Law enforcement agencies strive to link crimes perpetrated by the same offenders into crime series in order to improve investigation efficiency. Such crime linkage can be done using both physical traces (e.g., DNA or fingerprints) or 'soft evidence' in the form of offenders' modus operandi (MO), i.e. their behaviors during crimes. However, physical traces are only present for a fraction of crimes, unlike behavioral evidence. This work-in-progress paper presents a method for aggregating multiple criminal profilers' ratings of offenders' behavioral characteristics based on feature-rich crime scene descriptions. The method calculates consensus ratings from individual experts' ratings, which then are used as a basis for classification algorithms. The classification algorithms can automatically generalize offenders' behavioral characteristics from cues in the crime scene data. Models trained on the consensus rating are evaluated against models trained on individual profiler's ratings. Thus, whether the consensus model shows improved performance over individual models. © 2018 IEEE.
Clustering algorithms have been used to divide genes into groups according to the degree of their expression similarity. Such a grouping may suggest that the respective genes are correlated and/or co-regulated, and subsequently indicates that the genes could possibly share a common biological role. In this paper, four clustering algorithms are investigated: k-means, cut-clustering, spectral and expectation-maximization. The algorithms are benchmarked against each other. The performance of the four clustering algorithms is studied on time series expression data using Dynamic TimeWarping distance in order to measure similarity between gene expression profiles. Four different cluster validation measures are used to evaluate the clustering algorithms: Connectivity and Silhouette Index for estimating the quality of clusters, Jaccard Index for evaluating the stability of a cluster method and Rand Index for assessing the accuracy. The obtained results are analyzed by Friedman's test and the Nemenyi post-hoc test. K-means is demonstrated to be significantly better than the spectral clustering algorithm under the Silhouette and Rand validation indices.
Many machine learning models deployed on smart or edge devices experience a phase where there is a drop in their performance due to the arrival of data from new domains. This paper proposes a novel unsupervised domain adaptation algorithm called DIBCA++ to deal with such situations. The algorithm uses only the clusters’ mean, standard deviation, and size, which makes the proposed algorithm modest in terms of the required storage and computation. The study also presents the explainability aspect of the algorithm. DIBCA++ is compared with its predecessor, DIBCA, and its applicability and performance are studied and evaluated in two real-world scenarios. One is coping with the Global Navigation Satellite System activation problem from the smart logistics domain, while the other identifies different activities a person performs and deals with a human activity recognition task. Both scenarios involve time series data phenomena, i.e., DIBCA++ also contributes towards addressing the current gap regarding domain adaptation solutions for time series data. Based on the experimental results, DIBCA++ has improved performance compared to DIBCA. The DIBCA++ has performed better in all human activity recognition task experiments and 82.5% of experimental scenarios on the smart logistics use case. The results also showcase the need and benefit of personalizing the models using DIBCA++, along with the ability to transfer new knowledge between domains, leading to improved performance. The adapted source and target models have performed better in 70% and 80% of cases in an experimental scenario conducted on smart logistics.
Data available today in smart monitoring applications such as smart buildings, machine health monitoring, smart healthcare, etc., is not centralized and usually supplied by a number of different devices (sensors, mobile devices and edge nodes). Due to which the data has a heterogeneous nature and provides different perspectives (views) about the studied phenomenon. This makes the monitoring task very challenging, requiring machine learning and data mining models that are not only able to continuously integrate and analyze multi-view streaming data, but also are capable of adapting to concept drift scenarios of newly arriving data. This study presents a multi-view clustering approach that can be applied for monitoring and analysis of streaming data scenarios. The approach allows for parallel monitoring of the individual view clustering models and mining view correlations in the integrated (global) clustering models. The global model built at each data chunk is a formal concept lattice generated by a formal context consisting of closed patterns representing the most typical correlations among the views. The proposed approach is evaluated on two different data sets. The obtained results demonstrate that it is suitable for modelling and monitoring multi-view streaming phenomena by providing means for continuous analysis and pattern mining. © 2021, IFIP International Federation for Information Processing.
Domain shift is a common problem in many real-world applications using machine learning models. Most of the existing solutions are based on supervised and deep-learning models. This paper proposes a novel clustering algorithm capable of producing an adapted and/or integrated clustering model for the considered domains. Source and target domains are represented by clustering models such that each cluster of a domain models a specific scenario of the studied phenomenon by defining a range of allowable values for each attribute in a given data vector. The proposed domain integration algorithm works in two steps: (i) cross-labeling and (ii) integration. Initially, each clustering model is crossly applied to label the cluster representatives of the other model. These labels are used to determine the correlations between the two models to identify the common clusters for both domains, which must be integrated within the second step. Different features of the proposed algorithm are studied and evaluated on a publicly available human activity recognition (HAR) data set and real-world data from a smart logistics use case provided by an industrial partner. The experiment's goal on the HAR data set is to showcase the algorithm's potential in automatic data labeling. While the conducted experiments on the smart logistics use case evaluate and compare the performance of the integrated and two adapted models in different domains. © 2022 IEEE.
In smart buildings, many different systems work in coordination to accomplish their tasks. In this process, the sensors associated with these systems collect large amounts of data generated in a streaming fashion, which is prone to concept drift. Such data are heterogeneous due to the wide range of sensors collecting information about different characteristics of the monitored systems. All these make the monitoring task very challenging. Traditional clustering algorithms are not well equipped to address the mentioned challenges. In this work, we study the use of MV Multi-Instance Clustering algorithm for multi-view analysis and mining of smart building systems’ sensor data. It is demonstrated how this algorithm can be used to perform contextual as well as integrated analysis of the systems. Various scenarios in which the algorithm can be used to analyze the data generated by the systems of a smart building are examined and discussed in this study. In addition, it is also shown how the extracted knowledge can be visualized to detect trends in the systems’ behavior and how it can aid domain experts in the systems’ maintenance. In the experiments conducted, the proposed approach was able to successfully detect the deviating behaviors known to have previously occurred and was also able to identify some new deviations during the monitored period. Based on the results obtained from the experiments, it can be concluded that the proposed algorithm has the ability to be used for monitoring, analysis, and detecting deviating behaviors of the systems in a smart building domain. © 2021 by the authors. Licensee MDPI, Basel, Switzerland.
In this study, we propose a new multi-view stream clustering approach, called MV Split-Merge Clustering. The proposed approach is an extension of an existing split-merge evolutionary clustering algorithm (entitled Split-Merge Clustering) to multi-view data applications. The extended version can be used to integrate data from multiple views in a streaming manner and discover cluster structure for each data chunk. The MV Split-Merge Clustering can be applied for grouping distinct chunks of multi-view streaming data so that a global integrated clustering model is built on each data chunk. At each time window, an updated clustering solution (local model) is initially produced on each view of the current data chunk by applying the Split-Merge Clustering algorithm. Formal Concept Analysis is then used in order to integrate information from the multiple views (local clustering models) and generate a global model (formal concept lattice) that reveals the correlations among the clusters of the local models. The proposed MV Split-Merge Clustering has been initially evaluated on a publicly available data set. Our results show that the approach is able to identify a clustering structure and relationships among the different views comparable to those produced in a batch scenario. © 2020 The Authors. Published by Elsevier B.V.
Many industrial scenarios are concerned with the exploration of high-dimensional heterogeneous data sets originating from diverse sources and often incomplete, i.e., containing a substantial amount of missing values. This paper proposes a novel unsupervised method that efficiently facilitates the exploration and analysis of such data sets. The methodology combines in an exploratory workflow multi-layer data analysis with shared nearest neighbor similarity and hypergraph clustering. It produces overlapping homogeneous clusters, i.e., assuming that the assets within each cluster exhibit comparable behavior. The latter can be used for computing relevant KPIs per cluster for the purpose of performance analysis and comparison. More concretely, such KPIs have the potential to aid domain experts in monitoring and understanding asset performance and, subsequently, enable the identification of outliers and the timely detection of performance degradation.
Wind turbines are typically organised as a fleet in a wind park, subject to similar, but varying, environmental conditions. This makes it possible to assess and benchmark a turbine’s output performance by comparing it to the other assets in the fleet. However, such a comparison cannot be performed straightforwardly on time series production data since the performance of a wind turbine is affected by a diverse set of factors (e.g., weather conditions). All these factors also produce a continuous stream of data, which, if discretised in an appropriate fashion, might allow us to uncover relevant insights into the turbine’s operations and behaviour. In this paper, we exploit the outcome of two inherently different discretisation approaches by statistical and visual analytics. As the first discretisation method, a complex layered integration approach is used. The DNA-like outcome allows us to apply advanced visual analytics, facilitating insightful operating mode monitoring. The second discretisation approach is applying a novel circular binning approach, capitalising on the circular nature of the angular variables. The resulting bins are then used to construct circular power maps and extract prototypical profiles via non-negative matrix factorisation, enabling us to detect anomalies and perform production forecasts. © 2021 by the authors. Licensee MDPI, Basel, Switzerland.
In this study, we propose a novel data analysis approach that can be used for multi-view analysis and integration of heterogeneous temporal data originating from multiple sources. The proposed approach consists of several distinctive layers: (i) select a suitable set (view) of parameters in order to identify characteristic behaviour within each individual source (ii) exploit an alternative set (view) of raw parameters (or high-level features) to derive some complementary representations (e.g. related to source performance) of the results obtained in the first layer with the aim to facilitate comparison and mediation across the different sources (iii) integrate those representations in an appropriate way, allowing to trace back similar cross-source performance to certain characteristic behaviour of the individual sources. The validity and the potential of the proposed approach has been demonstrated on a real-world dataset of a fleet of wind turbines. © Springer Nature Switzerland AG 2020.
In industrial settings, continuous monitoring of the operation of assets generates a vast amount of data originating from a multitude of very diverse sources. This data allows to study and understand asset performance in real operating conditions, paving the way for failure prediction, machine setting optimisation and many other industrial applications. However, it is not always feasible and neither wise to approach data analytics for such applications by merging all the available data into a single data set, which often leads to information loss. The literature lacks methods to inspect asset performance based on splitting the data in different views corresponding to different types of monitored parameters. The multi-view data analysis method proposed in this work allows to extract operating modes for an industrial asset and subsequently, profile their performance. In this two-step approach, the endogeneous (internal working) data view is first exploited to detect and characterise distinct operating modes, while an exogeneous (operating context) data representation (disjoint with the endogeneous view) of these operating modes is subsequently used to derive prototypical performance profiles via non-negative matrix factorisation. The application potential and validity of the proposed method is illustrated based on real-world data from a wind turbine. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
In this study, we propose a multi-view data analysis approach that can be used for modelling and monitoring smart control valve system behaviour. The proposed approach consists of four distinctive steps: (i) multi-view interpretation of the available data attributes by separating them into several representations (views), e.g., operational parameters, contextual factors, and performance indicators; (ii) modelling different control valve system operating modes by clustering analyses of the operational data view; (iii) annotating each operating mode (cluster) by using the remaining views (i.e., contextual and system performance data); (iv) context-aware monitoring of the control valve system operating behaviour by applying the built model. In addition, the data points (daily profiles) observed during the monitoring can be annotated by comparing them with the known typical behavioural modes. This information can be further analysed and used for continuous updating and improvement of the model.The potential of the proposed approach has been evaluated and demonstrated on real-world sensor data originating from a company in the smart building domain. The obtained results show the robustness of the proposed approach in modelling, analysing, and monitoring the control valve system behaviour. © 2020 IEEE.
Recently machine learning researchers are designing algorithms that can run in embedded and mobile devices, which introduces additional constraints compared to traditional algorithm design approaches. One of these constraints is energy consumption, which directly translates to battery capacity for these devices. Streaming algorithms, such as the Very Fast Decision Tree (VFDT), are designed to run in such devices due to their high velocity and low memory requirements. However, they have not been designed with an energy efficiency focus. This paper addresses this challenge by presenting the nmin adaptation method, which reduces the energy consumption of the VFDT algorithm with only minor effects on accuracy. nmin adaptation allows the algorithm to grow faster in those branches where there is more confidence to create a split, and delays the split on the less confident branches. This removes unnecessary computations related to checking for splits but maintains similar levels of accuracy. We have conducted extensive experiments on 29 public datasets, showing that the VFDT with nmin adaptation consumes up to 31% less energy than the original VFDT, and up to 96% less energy than the CVFDT (VFDT adapted for concept drift scenarios), trading off up to 1.7 percent of accuracy.
Machine learning software accounts for a significant amount of energy consumed in data centers. These algorithms are usually optimized towards predictive performance, i.e. accuracy, and scalability. This is the case of data stream mining algorithms. Although these algorithms are adaptive to the incoming data, they have fixed parameters from the beginning of the execution. We have observed that having fixed parameters lead to unnecessary computations, thus making the algorithm energy inefficient.In this paper we present the nmin adaptation method for Hoeffding trees. This method adapts the value of the nmin pa- rameter, which significantly affects the energy consumption of the algorithm. The method reduces unnecessary computations and memory accesses, thus reducing the energy, while the accuracy is only marginally affected. We experimentally compared VFDT (Very Fast Decision Tree, the first Hoeffding tree algorithm) and CVFDT (Concept-adapting VFDT) with the VFDT-nmin (VFDT with nmin adaptation). The results show that VFDT-nmin consumes up to 27% less energy than the standard VFDT, and up to 92% less energy than CVFDT, trading off a few percent of accuracy in a few datasets.
Machine learning algorithms are responsible for a significant amount of computations. These computations are increasing with the advancements in different machine learning fields. For example, fields such as deep learning require algorithms to run during weeks consuming vast amounts of energy. While there is a trend in optimizing machine learning algorithms for performance and energy consumption, still there is little knowledge on how to estimate an algorithm’s energy consumption. Currently, a straightforward cross-platform approach to estimate energy consumption for different types of algorithms does not exist. For that reason, well-known researchers in computer architecture have published extensive works on approaches to estimate the energy consumption. This study presents a survey of methods to estimate energy consumption, and maps them to specific machine learning scenarios. Finally, we illustrate our mapping suggestions with a case study, where we measure energy consumption in a big data stream mining scenario. Our ultimate goal is to bridge the current gap that exists to estimate energy consumption in machine learning scenarios.
This study is devoted to understanding traffic cruising causation through exploring and enhancing parking data. Five recent (2017-2020) studies modeling parking congestion relied on occupancy as their only parking lot feature, then compared modeling techniques using this feature, to find the best performance. However, recently some computer scientists pointed out that it is more effective for the computer science community to focus more on data preparation for performance improvements, rather than exclusively comparing modeling techniques. This inspired us to add more parking lot features and evaluate them, to investigate how they should be composed into a congestion score, acting as a more accurate picture of reality. The score is then compared to the performance of a version where occupancy is the only parking lot feature. An experimental case study is designed in three parts. The first measures how the features should be summed into a score according to drivers' expectations. The second analyzes how much data can be reused from the real data, and whether spatial or temporal comparisons are better for data synthesis of parking data. The third part compares the performance of the score against the occupancy-only version using k-means clustering algorithm and dynamic time warping distance. The experimental results show performance improvements in all spatial and temporal categories, and increasing improvement as the sample sizes grow. © 2021 IEEE.
The success of Federated Learning (FL) hinges upon the active participation and contributions of edge devices as they collaboratively train a global model while preserving data privacy. Understanding the behavior of individual clients within the FL framework is essential for enhancing model performance, ensuring system reliability, and protecting data privacy. However, analyzing client behavior poses a significant challenge due to the decentralized nature of FL, the variety of participating devices, and the complex interplay between client models throughout the training process. This research proposes a novel approach based on eccentricity analysis to address the challenges associated with understanding the different clients' behavior in the federation. We study how the eccentricity analysis can be applied to monitor the clients' behaviors through the training process by assessing the eccentricity metrics of clients' local models and clients' data representation in the global model. The Kendall ranking method is used for evaluating the correlations between the defined eccentricity metrics and the clients' benefit from the federation and influence on the federation, respectively. Our initial experiments on a publicly available data set demonstrate that the defined eccentricity measures can provide valuable information for monitoring the clients' behavior and eventually identify clients with deviating behavioral patterns. © 2024 IEEE.
Requirements engineering (RE) literature acknowledges the importance of stakeholder identification early in the software engineering activities. However, literature overlooks the challenge of identifying and selecting the right stakeholders and the potential of using other inanimate requirements sources for RE activities for market-driven products.
Market-driven products are influenced by a large number of stakeholders. Consulting all stakeholders directly is impractical, and companies utilize indirect data sources, e.g. documents and representatives of larger groups of stakeholders. However, without a systematic approach, companies often use easy to access or hard to ignore data sources for RE activities. As a consequence, companies waste resources on collecting irrelevant data or develop the product based on the input from a few sources, thus missing market opportunities.
We propose a collaborative and structured method to support analysts in the identification and selection of the most relevant data sources for market-driven product engineering. The method consists of four steps and aims to build consensus between different perspectives in an organization and facilitates the identification of most relevant data sources. We demonstrate the use of the method with two industrial case studies.
Our results show that the method can support market-driven requirements engineering in two ways: (1) by providing systematic steps to identify and prioritize data sources for RE, and (2) by highlighting and resolving discrepancies between different perspectives in an organization.
Cluster validation measures are designed to find the partitioning that best fits the underlying data. In this study, we show that these measures can be used for identifying mislabeled instances or class outliers prior to training in supervised learning problems. We introduce an ensemble technique, entitled CVI-based Outlier Filtering, which identifies and eliminates mislabeled instances from the training set, and then builds a classification hypothesis from the set of remaining instances. Our approach assigns to each instance in the training set several cluster validation scores representing its potential of being a class outlier with respect to the clustering properties the used validation measures assess. In this respect, the proposed approach may be referred to a multi-criteria outlier filtering measure. In this work, we specifically study and evaluate valued-based ensembles of cluster validation indices. The added value of this approach in comparison to the logical and rank-based ensemble solutions are discussed and further demonstrated. © 2020, Springer Nature Switzerland AG.
The communication of sustainability values shared between product developers and customers is an important catalyst for effective collaboration that inspires sustainable consumption. Despite the many tools developed for assessing and communicating product’s sustainability performance, customers are facing difficulties in understanding product sustainability information. The knowledge gaps remain underexplored about how product sustainability information is perceived and how this impacts customer purchasing behaviour. This paper outlines a new approach, driven by both backcasting and forecasting thinking, for understanding and modelling customer preferences for product sustainability information. We report findings from a case study of a large workplace furniture manufacturer. The study explored the potential of i) identifying prioritised sustainability attributes using Sustainability Design Space (SDS), and ii) applying machine learning to model customer preferences.
Individual purchasing behavior has substantial impact on the environment and our society. To encourage sustainable consumption, this paper explores the application of clustering analysis techniques for modelling customer preference for sustainability information. This study has analyzed sales data provided by a furniture company that covers a one-year period and 7602 customer accounts. The analysis focused on the purchases of office chairs. Clustering analysis was applied to build preference models of the customers. This study has identified 3 typical customer behavior signatures w.r.t. the sustainability categories used in a sustainability index. We have shown how these models can be used to predict new customers’ sustainability preferences. The stability of the proposed solutions has been studied by comparing the preference models generated on different product groups. The results can provide insights for designing sustainability communication strategies to attract potential customers.
We present a new method, called hyperplane folding, that increases the margin in Support Vector Machines (SVMs). Based on the location of the support vectors, the method splits the dataset into two parts, rotates one part of the dataset and then merges the two parts again. This procedure increases the margin as long as the margin is smaller than half of the shortest distance between any pair of data points from the two different classes. We provide an algorithm for the general case with n-dimensional data points. A small experiment with three folding iterations on 3-dimensional data points with non-linear relations shows that the margin does indeed increase and that the accuracy improves with a larger margin. The method can use any standard SVM implementation plus some basic manipulation of the data points, i.e., splitting, rotating and merging. Hyperplane folding also increases the interpretability of the data. © 2019 Association for Computing Machinery.
This paper proposes a novel multi-stream clustering algorithm, MultiStream EvolveCluster (MS-EC), that can be used for continuous and distributed monitoring and analysis ofevolving time series phenomena. It can maintain evolving clustering solutions separatelyfor each stream/view and consensus clustering solutions reflecting evolving interrelationsamong the streams. Each stream behavior can be analyzed by different clustering techniques using a distance measure and data granularity that is specially selected for it. Theproperties of the MultiStream EvolveCluster algorithm are studied and evaluated withrespect to different consensus clustering techniques, distance measures, and cluster evaluation measures in synthetic and real-world smart building datasets. Our evaluation resultsshow a stable algorithm performance in synthetic data scenarios. In the case of real-worlddata, the algorithm behavior demonstrates sensitivity to the individual streams’ data quality and the used consensus clustering technique.
Data has become an integral part of our society in the past years, arriving faster and in larger quantities than before. Traditional clustering algorithms rely on the availability of entire datasets to model them correctly and efficiently. Such requirements are not possible in the data stream clustering scenario, where data arrives and needs to be analyzed continuously. This paper proposes a novel evolutionary clustering algorithm, entitled EvolveCluster, capable of modeling evolving data streams. We compare EvolveCluster against two other evolutionary clustering algorithms, PivotBiCluster and Split-Merge Evolutionary Clustering, by conducting experiments on three different datasets. Furthermore, we perform additional experiments on EvolveCluster to further evaluate its capabilities on clustering evolving data streams. Our results show that EvolveCluster manages to capture evolving data stream behaviors and adapts accordingly.
In this paper, we present an ongoing work on using a household electricity consumption behavior model for recognizing changes in sleep patterns. The work is inspired by recent studies in neuroscience revealing an association between dementia and sleep disorders and more particularly, supporting the hypothesis that insomnia may be a predictor for dementia in older adults. Our approach initially creates a clustering model of normal electricity consumption behavior of the household by using historical data. Then we build a new clustering model on a new set of electricity consumption data collected over a predefined time period and compare the existing model with the built new electricity consumption behavior model. If a discrepancy between the two clustering models is discovered a further analysis of the current electricity consumption behavior is conducted in order to investigate whether this discrepancy is associated with alterations in the resident’s sleep patterns. The approach is studied and initially evaluated on electricity consumption data collected from a single randomly selected anonymous household. The obtained results show that our approach is robust to mining changes in the resident daily routines by monitoring and analyzing their electricity consumption behavior model.
In this study we apply clustering techniques for analyzing and understanding households’ electricity consumption data. The knowledge extracted by this analysis is used to create a model of normal electricity consumption behavior for each particular household. Initially, the household’s electricity consumption data are partitioned into a number of clusters with similar daily electricity consumption profiles. The centroids of the generated clusters can be considered as representative signatures of a household’s electricity consumption behavior. The proposed approach is evaluated by conducting a number of experiments on electricity consumption data of ten selected households. The obtained results show that the proposed approach is suitable for data organizing and understanding, and can be applied for modeling electricity consumption behavior on a household level. © Springer Nature Switzerland AG 2019.