Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Automated Malware Detection and Classification Using Supervised Learning
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science.
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Malware has been one of the key concerns for Information Technology security researchers for decades. Every year, anti-malware companies release alarming statistics suggesting a continuous increase in the number and types of malware.  This is mainly due to the constant development of new and more sophisticated malicious functionalities, propagation vectors, and infection tactics for malware. To combat this ever-evolving threat, anti-malware companies analyze thousands of malicious samples on a daily basis, either manually or through semi-automated means, to identify their type (whether it's a variant or zero-day) and family. After the analysis, signature databases or rule databases of anti-malware products are updated in order to detect known malware.  However, due to the ever-growing capabilities of malware, the malware analysis process is challenging and requires significant human effort. As a result, researchers are focusing on data-driven approaches based on machine learning to develop intelligent malware detectors with high accuracy. Specifically, they are focused on extracting static features from malware in the form of n-grams for experimental purposes. However, the previous research is inconclusive in terms of optimal feature representation and detection accuracy.

The primary objective of this thesis is to present state-of-the-art automated techniques for detecting and classifying malware using supervised learning algorithms. In particular, the focus is on two critical aspects of supervised learning-based malware detection: optimal feature representation and improved detection accuracy. Malware detection can be accomplished using two methods: static analysis, which extracts patterns without executing malware, and dynamic analysis, which captures behaviors through executing malware. This thesis focuses on static analysis instead of dynamic analysis because static analysis requires fewer computing resources. An additional benefit of static analysis is that present-day malware cannot evade it. To achieve the goals of this thesis, two new feature representations for static analysis are proposed. Furthermore, three customized ensembles are introduced to enhance malware detection accuracy, and their feasibility is experimentally demonstrated.  

The experiments incorporate customized malware data sets including Spyware, Adware, Scareware, and Android malware samples, and public malware data sets from Microsoft's having samples from nine distinct malware families. Artificially generated data sets are employed to mitigate class imbalance issues and represent inter-family and intra-family examples. Reverse engineering is performed to transform the data sets as feature data sets using both byte code and assembly language instructions. Further, existing and new feature representations along with various feature selection algorithms and feature fusion techniques are explored. To enhance detection accuracy, different decision theories from social choice theory, such as veto and consensus, are integrated into customized ensembles. The experimental results indicate that the proposed methods are capable of detecting known and zero-day malware. The proposed ensembles are also tested on the UCI public data sets, such as Forest CoverType, and the results demonstrate their effectiveness in classification. Further, these methods are designed to be portable and adaptable to different operating systems, and they can also be scaled for multi-class malware detection.

Place, publisher, year, edition, pages
Karlskrona: Blekinge Tekniska Högskola, 2024.
Series
Blekinge Institute of Technology Doctoral Dissertation Series, ISSN 1653-2090 ; 3
Keywords [en]
Malware Detection, Android Malware, Machine Learning, Static Malware Analysis, Cyber Security, Ensemble learning, Supervised Learning, Feature Selection
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:bth-25793ISBN: 978-91-7295-475-5 (print)OAI: oai:DiVA.org:bth-25793DiVA, id: diva2:1825596
Public defence
2024-01-31, J1630, Campus Karlskrona, 13:00 (English)
Opponent
Supervisors
Available from: 2024-01-09 Created: 2024-01-09 Last updated: 2024-01-11Bibliographically approved
List of papers
1. Detection of Spyware by Mining Executable Files
Open this publication in new window or tab >>Detection of Spyware by Mining Executable Files
2010 (English)Conference paper, Published paper (Refereed) Published
Abstract [en]

Spyware represents a serious threat to confidentiality since it may result in loss of control over private data for computer users. This type of software might collect the data and send it to a third party without informed user consent. Traditionally two approaches have been presented for the purpose of spyware detection: Signature-based Detection and Heuristic-based Detection. These approaches perform well against known Spyware but have not been proven to be successful at detecting new spyware. This paper presents a Spyware detection approach by using Data Mining (DM) technologies. Our approach is inspired by DM-based malicious code detectors, which are known to work well for detecting viruses and similar software. However, this type of detector has not been investigated in terms of how well it is able to detect spyware. We extract binary features, called n-grams, from both spyware and legitimate software and apply five different supervised learning algorithms to train classifiers that are able to classify unknown binaries by analyzing extracted n-grams. The experimental results suggest that our method is successful even when the training data is scarce.

Place, publisher, year, edition, pages
Krakow: IEEE Computer Society, 2010
Keywords
Spyware Detection, Data Mining, Malicious Code, Feature Extraction
National Category
Computer Sciences
Identifiers
urn:nbn:se:bth-7837 (URN)10.1109/ARES.2010.105 (DOI)000278197800042 ()oai:bth.se:forskinfo22AC5EFE2DB008C0C12576EC0066347A (Local ID)oai:bth.se:forskinfo22AC5EFE2DB008C0C12576EC0066347A (Archive number)oai:bth.se:forskinfo22AC5EFE2DB008C0C12576EC0066347A (OAI)
Conference
The Fifth International Conference on Availability, Reliability and Security (ARES 2010)
Available from: 2012-09-18 Created: 2010-03-20 Last updated: 2024-01-09Bibliographically approved
2. Accurate Adware Detection using Opcode Sequence Extraction
Open this publication in new window or tab >>Accurate Adware Detection using Opcode Sequence Extraction
2011 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Adware represents a possible threat to the security and privacy of computer users. Traditional signature-based and heuristic-based methods have not been proven to be successful at detecting this type of software. This paper presents an adware detection approach based on the application of data mining on disassembled code. The main contributions of the paper is a large publicly available adware data set, an accurate adware detection algorithm, and an extensive empirical evaluation of several candidate machine learning techniques that can be used in conjunction with the algorithm. We have extracted sequences of opcodes from adware and benign software and we have then applied feature selection, using different configurations, to obtain 63 data sets. Six data mining algorithms have been evaluated on these data sets in order to find an efficient and accurate detector. Our experimental results show that the proposed approach can be used to accurately detect both novel and known adware instances even though the binary difference between adware and legitimate software is usually small.

Place, publisher, year, edition, pages
Vienna: IEEE Press, 2011
Keywords
Data Mining, Adware Detection, Binary Classification, Static Analysis, Disassembly, Instruction Sequences
National Category
Computer Sciences
Identifiers
urn:nbn:se:bth-7462 (URN)10.1109/ARES.2011.35 (DOI)oai:bth.se:forskinfo596323F8D63E0D5DC12578FD004443B0 (Local ID)978-0-7695-4485-4 (ISBN)oai:bth.se:forskinfo596323F8D63E0D5DC12578FD004443B0 (Archive number)oai:bth.se:forskinfo596323F8D63E0D5DC12578FD004443B0 (OAI)
Conference
Sixth International Conference on Availability, Reliability and Security
Available from: 2012-09-18 Created: 2011-08-31 Last updated: 2024-01-09Bibliographically approved
3. Detecting Scareware by Mining Variable Length Instruction Sequences
Open this publication in new window or tab >>Detecting Scareware by Mining Variable Length Instruction Sequences
2011 (English)Conference paper, Published paper (Refereed) Published
Abstract [en]

Scareware is a recent type of malicious software that may pose financial and privacy-related threats to novice users. Traditional countermeasures, such as anti-virus software, require regular updates and often lack the capability of detecting novel (unseen) instances. This paper presents a scareware detection method that is based on the application of machine learning algorithms to learn patterns in extracted variable length opcode sequences derived from instruction sequences of binary files. The patterns are then used to classify software as legitimate or scareware but they may also reveal interpretable behavior that is unique to either type of software. We have obtained a large number of real world scareware applications and designed a data set with 550 scareware instances and 250 benign instances. The experimental results show that several common data mining algorithms are able to generate accurate models from the data set. The Random Forest algorithm is shown to outperform the other algorithms in the experiment. Essentially, our study shows that, even though the differences between scareware and legitimate software are subtler than between, say, viruses and legitimate software, the same type of machine learning approach can be used in both of these dissimilar cases.

Place, publisher, year, edition, pages
Johannesburg: IEEE Press, 2011
Keywords
Scareware, Instruction Sequences, Classification
National Category
Computer Sciences
Identifiers
urn:nbn:se:bth-7464 (URN)oai:bth.se:forskinfo7F7D2C29F6FBC1E9C12578FC0035D21D (Local ID)978-1-4577-1482-5 (ISBN)oai:bth.se:forskinfo7F7D2C29F6FBC1E9C12578FC0035D21D (Archive number)oai:bth.se:forskinfo7F7D2C29F6FBC1E9C12578FC0035D21D (OAI)
Conference
Information Security for South Africa
Available from: 2012-09-18 Created: 2011-08-30 Last updated: 2024-01-09Bibliographically approved
4. Comparative Analysis of Voting Schemes for Ensemble-based Malware Detection
Open this publication in new window or tab >>Comparative Analysis of Voting Schemes for Ensemble-based Malware Detection
2013 (English)In: Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications, ISSN 2093-5374, E-ISSN 2093-5382, Vol. 4, no 1, p. 98-117Article in journal (Refereed) Published
Abstract [en]

Malicious software (malware) represents a threat to the security and the privacy of computer users. Traditional signature-based and heuristic-based methods are inadequate for detecting some forms of malware. This paper presents a malware detection method based on supervised learning. The main contributions of the paper are two ensemble learning algorithms, two pre-processing techniques, and an empirical evaluation of the proposed algorithms. Sequences of operational codes are extracted as features from malware and benign files. These sequences are used to create three different data sets with different configurations. A set of learning algorithms is evaluated on the data sets. The predictions from the learning algorithms are combined by an ensemble algorithm. The predicted outcome of the ensemble algorithm is decided on the basis of voting. The experimental results show that the veto approach can accurately detect both novel and known malware instances with the higher recall in comparison to majority voting, however, the precision of the veto voting is lower than the majority voting. The veto voting is further extended as trust-based veto voting. A comparison of the majority voting, the veto voting, and the trust-based veto voting is performed. The experimental results indicate the suitability of each voting scheme for detecting a particular class of software. The experimental results for the composite F1-measure indicate that the majority voting is slightly better than the trusted veto voting while the trusted veto is significantly better than the veto classifier.

Place, publisher, year, edition, pages
Innovative Information Science & Technology Research Group, 2013
Keywords
Malware detection, scareware, veto voting, feature extraction, classification, majority voting, ensemble, trust, malicious software
National Category
Computer Sciences
Identifiers
urn:nbn:se:bth-7001 (URN)oai:bth.se:forskinfo026B75A577C2FBD6C1257B3400281F31 (Local ID)oai:bth.se:forskinfo026B75A577C2FBD6C1257B3400281F31 (Archive number)oai:bth.se:forskinfo026B75A577C2FBD6C1257B3400281F31 (OAI)
External cooperation:
Note

Open Access Journal

Available from: 2013-03-20 Created: 2013-03-20 Last updated: 2024-01-09Bibliographically approved
5. Consensus decision making in random forests
Open this publication in new window or tab >>Consensus decision making in random forests
2015 (English)In: Revised Selected Papers of the First International Workshop on Machine Learning, Optimization, and Big Data, Springer, 2015, Vol. 9432, p. 347-358Conference paper, Published paper (Refereed)
Abstract [en]

The applications of Random Forests, an ensemble learner, are investigated in different domains including malware classification. Random Forests uses the majority rule for the outcome, however, a decision from the majority rule faces different challenges such as the decision may not be representative or supported by all trees in Random Forests. To address such problems and increase accuracy in decisions, a consensus decision making (CDM) is suggested. The decision mechanism of Random Forests is replaced with the CDM. The updated Random Forests algorithm is evaluated mainly on malware data sets, and results are compared with unmodified Random Forests. The empirical results suggest that the proposed Random Forests, i.e., with CDM performs better than the original Random Forests.

Place, publisher, year, edition, pages
Springer, 2015
Series
Machine Learning, Optimization, and Big Data, ISSN 0302-9743 ; 9432
National Category
Computer Sciences
Identifiers
urn:nbn:se:bth-12949 (URN)10.1007/978-3-319-27926-8_31 (DOI)
Conference
International Workshop on Machine learning, Optimization and big Data, Taormina, Sicily
Available from: 2016-08-25 Created: 2016-08-25 Last updated: 2024-04-12Bibliographically approved
6. A Hybrid Approach for Malware Classification Using Secondary Features Fusion
Open this publication in new window or tab >>A Hybrid Approach for Malware Classification Using Secondary Features Fusion
(English)Manuscript (preprint) (Other academic)
National Category
Computer Sciences
Identifiers
urn:nbn:se:bth-25881 (URN)
Available from: 2024-01-09 Created: 2024-01-09 Last updated: 2024-01-09Bibliographically approved
7. Android malware detection using feature fusion and artificial data
Open this publication in new window or tab >>Android malware detection using feature fusion and artificial data
2018 (English)In: 16th IEEE International Conference on Dependable, Autonomic and Secure Computing, IEEE 16th International Conference on Pervasive Intelligence and Computing, IEEE 4th International Conference on Big Data Intelligence and Computing and IEEE 3rd Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec, Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 702-709Conference paper, Published paper (Refereed)
Abstract [en]

For the Android malware detection / classification anti-malware community has relied on traditional malware detection methods as a countermeasure. However, traditional detection methods are developed for detecting the computer malware, which is different from Android malware in structure and characteristics. Thus, they may not be useful for Android malware detection. Moreover, majority of suggested detection approaches may not be generalized and are incapable of detecting zero-day malware due to different reasons such as available data set with specific set of examples. Thus, their detection accuracy may be questionable. To address this problem, this paper presents a malware classification approach with a reliable detection accuracy and evaluate the approach using artificially generated examples. The suggested approach generates the signature profiles and behavior profiles of each application in the data set, which are further used as input for the classification task. For improving the detection accuracy, feature fusion of features from filter methods and wrapper method and algorithm fusion is investigated. Without affecting the detection accuracy, the optimal balance between real world examples and synthetic examples is also investigated. The experimental results suggest that both AUC and F1 can be obtained up to 0.94 for both known and unknown malware using original examples and synthetic examples. 

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Keywords
Android (operating system), Big data, Classification (of information), Computer crime, Feature extraction, Classification tasks, Computer malware, Detection accuracy, Detection approach, Detection methods, Malware classifications, Malware detection, Reliable detection
National Category
Computer Sciences
Identifiers
urn:nbn:se:bth-25882 (URN)10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.00123 (DOI)2-s2.0-85056882366 (Scopus ID)9781538675182 (ISBN)
Conference
16th IEEE International Conference on Dependable, Autonomic and Secure Computing, IEEE 16th International Conference on Pervasive Intelligence and Computing, IEEE 4th International Conference on Big Data Intelligence and Computing and IEEE 3rd Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec, Athens 12-15 August 2018
Available from: 2024-01-09 Created: 2024-01-09 Last updated: 2024-01-09Bibliographically approved

Open Access in DiVA

fulltext(2908 kB)2241 downloads
File information
File name FULLTEXT01.pdfFile size 2908 kBChecksum SHA-512
a897ff202744b9dea9e666e645d1461b37a780ea537092a2020d67e20249ece2d2e84894f40d01f1e93b520f9529d86780effb65298b98295526b67b1833d9b6
Type fulltextMimetype application/pdf

Authority records

Shahzad, Raja Muhammad Khurram

Search in DiVA

By author/editor
Shahzad, Raja Muhammad Khurram
By organisation
Department of Computer Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 2241 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 2288 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf