Stream processing frameworks allow processing massive amounts of data shortly after it is produced, and enable a fast reaction to events in scenarios such as data center monitoring, smart transportation, or telecommunication networks. Many scenarios depend on the fast and reliable processing of incoming data, requiring low end-to-end latencies from the ingest of a new event to the corresponding output. The occurrence of faults jeopardizes these guarantees: Currently- leading high-availability solutions for stream processing such as Spark Streaming or Apache Flink’s implement passive replication through snapshotting, requiring a stop-the-world operation to recover from a failure. Active replication, while incurring higher deployment costs, can overcome these limitations and allow to mask the impact of faults and match stringent end-to-end latency requirements. We present the design, implementation, and evaluation of active replication in the popular Apache Flink platform. Our study explores two alternative designs, a leader-based approach leveraging external services (Kafka and ZooKeeper) and a leaderless implementation leveraging a novel deterministic merging algorithm. Our evaluation using a series of microbenchmarks and a SaaS cloud monitoring scenario on a 37-server cluster show that the actively-replicated Flink can fully mask the impact of faults on end-to-end latency.
Rosinosky, G., Schmidt, F., Bodunov, O., Fetzer, C., Martin, A., & Riviere, E. (2021). Active replication for latency-sensitive stream processing in Apache Flink. Proceedings of the 40th International Symposium on Reliable Distributed Systems. Published. The 40th International Symposium on Reliable Distributed Systems (SRDS). https://hdl.handle.net/2078.5/256300