Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A performance study for autoscaling big data analytics containerized applications: Scalability of Apache Spark on Kubernetes
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science.
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science.
2022 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Container technologies are rapidly changing how distributed applications are executed and managed on cloud computing resources. As containers can be deployed on a large scale, there is a tremendous need for Container Orchestration tools like Kubernetes that are highly automatic in deployment, scaling, and management. In recent times, the adoption of these container technologies like Docker has seen a rise in internal usage, commercial offering, and various application fields ranging from High-Performance Computing to Geo-distributed (Edge or IoT) applications.

Big Data analytics is another field where there is a trend to run applications (e.g., Apache Spark) as containers for elastic workloads and multi-tenant service models by leveraging various container orchestration tools like Kubernetes. Despite the abundant research on the performance impact of containerizing big data applications, to the best of our knowledge, the studies that focus on specific aspects like scalability and resource management are largely unexplored, which leaves a research gap to study upon.

This research studies the performance impact of autoscaling a big data analytics application on Kubernetes based on autoscaling mechanisms like Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA). These state-of-art autoscaling mechanisms available for scaling containerized applications on Kubernetes and the available big data benchmarking tools for generating workload on frameworks like Spark are identified through a literature review. Apache Spark is selected as a representative big data application due to its ecosystem and industry-wide adoption by enterprises. In particular, a series of experiments are conducted by adjusting resource parameters (such as CPU requests and limits) and autoscaling mechanisms to measure run-time metrics like execution time and CPU utilization. Our experiment results show that while Spark performs better execution time when configured to scale with VPA, it also exhibits overhead in CPU utilization. In contrast, the impact of autoscaling big data applications using HPA adds overhead in terms of both execution time and CPU utilization. The research from this thesis can be used by researchers and other cloud practitioners, using big data applications to evaluate autoscaling mechanisms and derive better performance and resource utilization.

Place, publisher, year, edition, pages
2022. , p. 50
Keywords [en]
Containers, Container Orchestration, Big data analytics, Autoscaling, Resource Management
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:bth-22685OAI: oai:DiVA.org:bth-22685DiVA, id: diva2:1641422
Subject / course
DV2572 Master´s Thesis in Computer Science
Educational program
DVADA Master Qualification Plan in Computer Science
Supervisors
Examiners
Available from: 2022-03-02 Created: 2022-03-01 Last updated: 2022-03-02Bibliographically approved

Open Access in DiVA

A performance study for autoscaling big data analytics containerized applications(2592 kB)762 downloads
File information
File name FULLTEXT02.pdfFile size 2592 kBChecksum SHA-512
143ce8cff0ee3f0033c4b093cbd42c7255309bba34ab46179f5172eba87633463d3623d0aa2222cff10a664f6d4978a3cf46d5aafb65540af970ea4418917d0b
Type fulltextMimetype application/pdf

By organisation
Department of Computer Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 762 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 1734 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf