FORS: Fault-adaptive Optimized Routing and Scheduling for DAQ Networks

Eloise Stein;Quentin Bramas;Flavio Pisani;Tommaso Colombo;Pelsser, Cristel
(2025) Computing and Software for Big Science — Vol. 9, n° 1 (2025)

Files

FORS_journal_springer.pdf
  • Open Access
  • Adobe PDF
  • 2.1 MB

Details

Authors
  • Eloise SteinUniversité de Strasbourg, CERN
    Author
  • Quentin BramasUniversité de Strasbourg
    Author
  • Flavio PisaniCERN
    Author
  • Tommaso ColomboCERN
    Author
  • Author
Abstract
Data acquisition (DAQ) networks, widely used in scientific research and industrial applications, are composed of numerous interconnected servers, exchanging substantial data volumes produced by large scientific instruments. One traffic matrix generally used in such networks is the all-to-all collective exchange, which demands substantial network resources, making network failures particularly challenging to mitigate. If not mitigated, the effects of network failures severely hamper the performance of the DAQ network, potentially leading to the loss of valuable experimental data. In the context of DAQ networks using a fat-tree topology, we propose FORS: a scheduling and associated routing solution to support the all-to-all collective exchange under network failures. FORS optimizes bandwidth utilization in the face of any failure scenarios, ensuring robust performance compared to the existing approaches. We propose an algorithm to solve the scheduling. For the routing, we design an algorithm for simple failure scenarios, along with a linear programming model to address more complex failure scenarios. We validate our proposed solution using a real-world DAQ network as a case study. Results demonstrate significant performance degradation in existing approaches and FORS’ consistent ability to achieve higher throughput across various failure scenarios.
Affiliations

Citations

Eloise Stein, Quentin Bramas, Flavio Pisani, Tommaso Colombo, & Pelsser, C. (2025). FORS: Fault-adaptive Optimized Routing and Scheduling for DAQ Networks. Computing and Software for Big Science, 9(1). https://hdl.handle.net/2078.5/256412 (Original work published 2025)