Replacing batch-based data extraction withevent streaming with Apache Kafka: A comparative study
2022 (English)Independent thesis Basic level (university diploma), 10 credits / 15 HE credits
Student thesis
Abstract [en]
For growing organisations that have built their data flow around a monolithic database server, anever-increasing number of applications and an ever-increasing demand for data freshness willeventually push the existing system to its limits, prompting either hardware upgrades or anupdated data architecture. Switching from an approach of full extractions of data at regularintervals to an approach where only changes are extracted, resource consumption couldpotentially be decreased, while simultaneously increasing data freshness. The objective of this thesis is to provide insights into how implementing an event streamingsetup with Apache Kafka connected to SQL Server through the Debezium source connectoraffects resource consumption on the database server. Other studies in related work have oftenbeen focused on steps further downstream in the data pipeline. This thesis can thereforecontribute to an area where more knowledge is needed. Through an empirical study done using two different setups in the same system, traditional dataextraction in batches and extraction through event streaming is measured and compared. The point of measurement is the SQL Server database from which data is extracted. Both memoryutilisation and CPU utilisation is measured, using SQL Server Profiler. Different parameters fortable sizes, volumes of data and intervals between changes are used to simulate differentscenarios. One of the takeaways of the results is that, at the same number of total changes, the size of theindividual transactions has a large impact on the resource consumption caused by eventstreaming. The study shows that an overhead cost is involved with each transaction, and also thatthe regular polling that the source connector performs causes resource consumption even inidleness. The thesis concludes that event streaming can offer reduced resource consumption on thedatabase server. However, when the source table size is small, and the number of changes large,extraction in batches is less resource-intensive.
Place, publisher, year, edition, pages
2022. , p. 66
Keywords [en]
etl, kafka connect, resource consumption, sql server
National Category
Software Engineering
Identifiers
URN: urn:nbn:se:bth-23671OAI: oai:DiVA.org:bth-23671DiVA, id: diva2:1696556
Subject / course
PA1438 Självständigt arbete Webbprogrammering
Educational program
PAGWG Webbprogrammering
Supervisors
Examiners
2022-10-132022-09-172022-10-13Bibliographically approved