System disruptions
We are currently experiencing disruptions on the search portals due to high traffic. We are working to resolve the issue, you may temporarily encounter an error message.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Similarity assessment for removal of noisy end user license agreements
Blekinge Institute of Technology, School of Computing.
Blekinge Institute of Technology, School of Computing.
Responsible organisation
2012 (English)In: Knowledge and Information Systems, ISSN 0219-1377, Vol. 32, no 1, p. 167-189Article in journal (Refereed) Published
Abstract [en]

In previous work, we have shown the possibility to automatically discriminate between legitimate software and spyware-associated software by performing supervised learning of end user license agreements (EULAs). However, the amount of false positives (spyware classified as legitimate software) was too large for practical use. In this study, the false positives problem is addressed by removing noisy EULAs, which are identified by performing similarity analysis of the previously studied EULAs. Two candidate similarity analysis methods for this purpose are experimentally compared: cosine similarity assessment in conjunction with latent semantic analysis (LSA) and normalized compression distance (NCD). The results show that the number of false positives can be reduced significantly by removing noise identified by either method. However, the experimental results also indicate subtle performance differences between LSA and NCD. To improve the performance even further and to decrease the large number of attributes, the categorical proportional difference (CPD) feature selection algorithm was applied. CPD managed to greatly reduce the number of attributes while at the same time increase classification performance on the original data set, as well as on the LSA- and NCD-based data sets.

Place, publisher, year, edition, pages
Springer , 2012. Vol. 32, no 1, p. 167-189
Keywords [en]
End user license agreement, Latent semantic analysis, Normalized compression distance, Spyware
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:bth-7176DOI: 10.1007/s10115-011-0438-9ISI: 000305692000007Local ID: oai:bth.se:forskinfoFBDEF2128A7A7A8AC12578DE000AEEB5OAI: oai:DiVA.org:bth-7176DiVA, id: diva2:834758
Available from: 2012-11-27 Created: 2011-07-31 Last updated: 2018-01-11Bibliographically approved

Open Access in DiVA

fulltext(188 kB)551 downloads
File information
File name FULLTEXT01.pdfFile size 188 kBChecksum SHA-512
3a38e40331d4d87640a4786446dcb9fe779dd1c4e80c5558370d65aaa003a5a760e258ca203ddc2d5a4dd23b713bb40c96fddb72a4aa737e03ffd57bc1f7c162
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records

Lavesson, NiklasAxelsson, Stefan

Search in DiVA

By author/editor
Lavesson, NiklasAxelsson, Stefan
By organisation
School of Computing
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 551 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 263 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf