Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Efficient Compression of XML andJSON Event Log Records
Blekinge Institute of Technology, Faculty of Computing, Department of Software Engineering.
Blekinge Institute of Technology, Faculty of Computing, Department of Software Engineering.
2025 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The exponential growth of digital data presents significant storage and transmission challenges. General-purpose compressors often under perform on small input data due to limited context to find and exploit redundancies. This, in turn, prevents the ability to efficiently do individual compression on small inputs like log entries, which is in favour of chunked compression, hindering the ability for random access decompression. 

Using a design science research methodology, we investigated the characteristics of logs at Ericsson and developed compression strategies for individual XML and JSON log entries that exploit their structural redundancies through shared context. These strategies were benchmarked against commonly used compressors (Gzip, bzip2, LZ4 and Zstandard) using metrics including compression ratio, compression/decompression speed, memory usage and training sensitivity. The results demonstrate that domain-specific compression strategies can significantly improve the compression ratio, but often at the cost of an increase in computation, leading to reduced throughput. Our compression strategies saw compression ratio improvements of 4.8x and 2.8x compared to Gzip, with the latter matching its compression speed. However, results are highly sensitive to training, resulting in decreased compression ratios when context drift occurs. Using a user perception study, we found that industry professionals see potential in adopting a domain-specific compression strategy in certain key areas of their systems. Still, we acknowledged several concerns regarding deployment and infrastructure management, and that a more significant gain would likely be a deciding adoption factor.

We conclude that domain-specific compression strategies exploiting structural redundancy and shared context can improve the compression ratio for individual XML and JSON log entries compared to general-purpose ones, offering a viable alternative depending on whether compression ratio or speed is prioritized. Practical applicability hinges on managing context drift and balancing gains for added maintenance complexities. 

Abstract [sv]

Den snabba ökningen av digital data skapar stora utmaningar för både lagring och överföring av loggar. Generella kompressionsalgoritmer har ofta svårt att hantera små datamängder, eftersom de har begränsade möjligheter att identifiera och utnyttja redundans. Detta hindrar effektiv komprimering av enskilda små datamängder, såsom loggfiler, vilket leder till att komprimering istället sker i större block och försvårar åtkomst till enskilda filer vid dekompression.

Genom att använda design science research har vi studerat loggfiler hos Ericsson och utvecklat strategier för att komprimera individuella XML- och JSON-loggfiler genom att utnyttja deras strukturella likheter. Dessa strategier jämfördes med vanliga kompressionsverktyg som Gzip, bzip2, LZ4 och Zstandard, utifrån kompressionsgrad, hastighet vid kompression och dekompression, minnesanvändning och känslighet för träning. Resultaten visar att domänspecifika kompressionsalgoritmer kan förbättra kompressionsgraden avsevärt, men ofta till priset av minskad hastighet. Våra strategier förbättrade kompressionsgraden med 4,8 gånger respektive 2,8 gånger jämfört med Gzip, där den senare även matchade Gzips kompressionshastighet. Dock är dessa strategier mycket känsliga för hur de tränas, vilket kan leda till sämre kompressionsgrad när datamönstret förändras (context drift).

Genom en studie fann vi att yrkesverksamma inom branschen ser potential i att implementera en domänspecifik kompressionsalgoritm, men uttryckte också oro för distribution och hantering av infrastrukturen. Vi drar slutsatsen att domänspecifika kompressionsalgoritmer kan förbättra kompressionsgraden för enskilda XML- och JSON-loggfiler jämfört med generella algoritmer, och kan vara ett hållbart alternativ beroende på om kompressionsgrad eller hastighet är prioriterat. Den praktiska användningen beror på hur väl man kan hantera förändringar i datamönster (context drift) och balansera vinster mot ökade underhållskostnader.

Place, publisher, year, edition, pages
2025. , p. 74
Keywords [en]
Compression Algorithms, Data Compression, Log Compression, Random Access Decompression
Keywords [sv]
Kompressionsalgoritmer, Datakompression, Loggkompression
National Category
Software Engineering
Identifiers
URN: urn:nbn:se:bth-27961OAI: oai:DiVA.org:bth-27961DiVA, id: diva2:1962630
External cooperation
Ericsson
Subject / course
Degree Project in Master of Science in Engineering 30,0 hp
Educational program
PAAMJ Master of Science in Engineering: Software Engineering 300,0 hp
Supervisors
Examiners
Available from: 2025-06-13 Created: 2025-06-01 Last updated: 2025-09-30Bibliographically approved

Open Access in DiVA

fulltext(3836 kB)413 downloads
File information
File name FULLTEXT01.pdfFile size 3836 kBChecksum SHA-512
54e3890c99591e48cfa59399489f0162163523b69bebd9f3018501eadff7781d8aa0f9280bfaa8f556c47fc1f6e2d23016c3c8934ba0c4911d95dd3479bf6dbf
Type fulltextMimetype application/pdf

By organisation
Department of Software Engineering
Software Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 416 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 710 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf