Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Classification of Potentially Unwanted Programs Using Supervised Learning
Blekinge Institute of Technology, School of Computing.
2013 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

Malicious software authors have shifted their focus from illegal and clearly malicious software to potentially unwanted programs (PUPs) to earn revenue. PUPs blur the border between legitimate and illegitimate programs and thus fall into a grey zone. Existing anti-virus and anti-spyware software are in many instances unable to detect previously unseen or zero-day attacks and separate PUPs from legitimate software. Many tools also require frequent updates to be effective. By predicting the class of particular piece of software, users can get support before taking the decision to install the software. This Licentiate thesis introduces approaches to distinguish PUP from legitimate software based on the supervised learning of file features represented as n-grams. The overall research method applied in this thesis is experiments. For these experiments, malicious software applications were obtained from anti-malware industrial partners. The legitimate software applications were collected from various online repositories. The general steps of supervised learning, from data preparation (n-gram generation) to evaluation were, followed. Different data representations, such as byte codes and operation codes, with different configurations, such as fixed-size, variable-length, and overlap, were investigated to generate different n-gram sizes. The experimental variables were controlled to measure the correlation between n-gram size, the number of features required for optimal training, and classifier performance. The thesis results suggest that, despite the subtle difference between legitimate software and PUP, this type of software can be classified accurately with a low false positive and false negative rate. The thesis results further suggest an optimal size of operation code-based n-grams for data representation. Finally, the results indicate that classification accuracy can be increased by using a customized ensemble learner that makes use of multiple representations of the data set. The investigated approaches can be implemented as a software tool with a less frequently required update in comparison to existing commercial tools.

Place, publisher, year, edition, pages
Karlskrona: Blekinge Institute of Technology , 2013. , 154 p. p.
Series
Blekinge Institute of Technology Licentiate Dissertation Series, ISSN 1650-2140 ; 2
National Category
Computer Science
Identifiers
URN: urn:nbn:se:bth-00548Local ID: oai:bth.se:forskinfo2408DD58FDB082BDC1257AFC00473485ISBN: 978-91-7295-247-8 (print)OAI: oai:DiVA.org:bth-00548DiVA: diva2:834563
Available from: 2013-04-23 Created: 2013-01-23 Last updated: 2017-03-16Bibliographically approved

Open Access in DiVA

fulltext(1687 kB)178 downloads
File information
File name FULLTEXT01.pdfFile size 1687 kBChecksum SHA-512
d30445564f0fe7e489ad67def72d25911689535486429ec9cbf7446c3315ddcf5c8bf8e34496078ae70c8bb89a4b69cd31ecd26907d6ee697f78923a82ceacd4
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Shahzad, Raja Muhammad Khurram
By organisation
School of Computing
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 178 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 86 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf