Active replication for latency-sensitive stream processing in Apache Flink

Rosinosky, Guillaume; Schmidt, Florian; Bodunov, Oleh; Fetzer, Christof; Martin, André; Riviere, Etienne

Active replication for latency-sensitive stream processing in Apache Flink

Rosinosky, Guillaume

;

Schmidt, Florian

;

Bodunov, Oleh

;

Fetzer, Christof

;

Riviere, Etienne

;et.al.

(2021) The 40th International Symposium on Reliable Distributed Systems (SRDS) (20.September.2021)

Files

paper.pdf

Open Access
Adobe PDF
1.57 MB

Download

Details

Authors

Rosinosky, GuillaumeUCLouvain
Author
Schmidt, FlorianTU Dresden
Author
Bodunov, OlehTU Dresden
Author
Fetzer, ChristofTU Dresden
Author
Riviere, EtienneUCLouvain
Author

Abstract

Stream processing frameworks allow processing massive amounts of data shortly after it is produced, and enable a fast reaction to events in scenarios such as data center monitoring, smart transportation, or telecommunication networks. Many scenarios depend on the fast and reliable processing of incoming data, requiring low end-to-end latencies from the ingest of a new event to the corresponding output. The occurrence of faults jeopardizes these guarantees: Currently- leading high-availability solutions for stream processing such as Spark Streaming or Apache Flink’s implement passive replication through snapshotting, requiring a stop-the-world operation to recover from a failure. Active replication, while incurring higher deployment costs, can overcome these limitations and allow to mask the impact of faults and match stringent end-to-end latency requirements. We present the design, implementation, and evaluation of active replication in the popular Apache Flink platform. Our study explores two alternative designs, a leader-based approach leveraging external services (Kafka and ZooKeeper) and a leaderless implementation leveraging a novel deterministic merging algorithm. Our evaluation using a series of microbenchmarks and a SaaS cloud monitoring scenario on a 37-server cluster show that the actively-replicated Flink can fully mask the impact of faults on end-to-end latency.

Affiliations

UCLouvainSST/ICTM/INGI - Pôle en ingénierie informatique

Citations

APA
Chicago
FWB

Rosinosky, G., Schmidt, F., Bodunov, O., Fetzer, C., Martin, A., & Riviere, E. (2021). Active replication for latency-sensitive stream processing in Apache Flink. Proceedings of the 40th International Symposium on Reliable Distributed Systems. Published. The 40th International Symposium on Reliable Distributed Systems (SRDS). https://hdl.handle.net/2078.5/256300