An Evaluation of Machine Learning Approaches for Hierarchical Malware Classification
2019 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE credits
Student thesis
Abstract [en]
With an evermore growing threat of new malware that keeps growing in both number and complexity, the necessity for improvement in automatic detection and classification of malware is increasing. The signature-based approaches used by several Anti-Virus companies struggle with the increasing amount of polymorphic malware. The polymorphic malware change some minor aspects of the code to be able to remain undetected. Malware classification using machine learning have been used to try to solve this issue in previous research. In the proposed work, different hierarchical machine learning approaches are implemented to conduct three experiments. The methods utilise a hierarchical structure in various ways to be able to get a better classification performance. A selection of hierarchical levels and machine learning models are used in the experiments to evaluate how the results are affected.
A data set is created, containing over 90000 different labelled malware samples. The proposed work also includes the creation of a labelling method that can be helpful for researchers in malware classification that needs labels for a created data set.The feature vector used contains 500 n-gram features and 3521 Import Address Table features. In the experiments for the proposed work, the thesis includes the testing of four machine learning models and three different amount of hierarchical levels. Stratified 5-fold cross validation is used in the proposed work to reduce bias and variance in the results.
The results from the classification approach shows it achieves the highest hF-score, using Random Forest (RF) as the machine learning model and having four hierarchical levels, which got an hF-score of 0.858228. To be able to compare the proposed work with other related work, pure-flat classification accuracy was generated. The highest generated accuracy score was 0.8512816, which was not the highest compared to other related work.
Place, publisher, year, edition, pages
2019. , p. 107
Keywords [en]
Machine Learning, Hierarchical Malware Classification, Static Malware Analysis, Mnemonic N-grams
National Category
Other Computer and Information Science
Identifiers
URN: urn:nbn:se:bth-18260OAI: oai:DiVA.org:bth-18260DiVA, id: diva2:1332898
External cooperation
Securelink
Subject / course
Degree Project in Master of Science in Engineering 30,0 hp
Educational program
DVACD Master of Science in Computer Security
Presentation
2019-06-05, J1280, Valhallavägen 1, Karlskrona, 09:00 (English)
Supervisors
Examiners
2019-07-012019-06-282022-05-12Bibliographically approved