A Tale of Many Streams: Characterizing a Hybrid Batch-Stream Production Workload in Digazu, a Data Lake supported by Apache Kafka and Flink

(2025) 19th ACM International Conference on Distributed and Event-based Systems — Location: Gothenburg, Sweden (10.June.2025)

Files

37017173734462.pdf
  • Open Access
  • Adobe PDF
  • 1.79 MB

Details

Authors
Abstract
Many industries rely on analyzing large volumes of combined historical and live data. A data lake facilitates these operations by supporting an integrated data ingestion, storage, replay, and analysis workflow. A modern data lake is distributed and combines a processing engine, able to seamlessly process large volumes of existing data as well as continuous flows of new data, such as Apache Flink, with a storage infrastructure able to ingest and replay this data, such as Apache Kafka. This use of Flink in this setting departs from the commonly agreed model of stream processing queries operating over windows of events, maintaining a bounded and relatively small state per operator. Instead, hybrid batch-stream queries typically process an existing data set in its entirety before updating results with incoming stream data, leading to a large accumulated state. Given the industry’s importance of such usages, understanding their characteristics and how they differ from common assumptions in designing and evaluating stream processing systems is of utmost importance. We present in this paper the analysis of a large-scale hybrid batch-stream workload collected from a production deployment of Digazu, a modern data lake building upon Kafka and Flink. We characterize 142 different sources of data and 129 hybrid batch-stream queries. Our analysis offers valuable insights into the nature of data and queries in typical data lake deployment, which will assist designers of such systems and associated benchmarks.
Affiliations

Citations

Schmitz, D., Berrewaerts, L., Rosinosky, G., Skhiri, S., & Riviere, E. (2025). A Tale of Many Streams: Characterizing a Hybrid Batch-Stream Production Workload in Digazu, a Data Lake supported by Apache Kafka and Flink. In Collectif (ed.), Proceedings of the 19th ACM International Conference on Distributed and Event-based System (p. p. 188-198). Association for Computing Machinery. https://doi.org/10.1145/3701717.3734462