On Fault-Tolerance and Tolerated Failures in Communication Networks
Abstract
The Internet has become an essential core technology in our modern society. As a fundamentally heterogeneous collection of independently designed and managed systems, it is challenging to operate networks such that they are reliable—despite an increasing reliance, or even dependency, on networks’ availability by their users.
In response to these rising demands, this dissertation focuses on two aspects towards increasing the reliability in today’s communication networks:
On the application layer, we propose to increase fault-tolerant protocols’ resilience for scenarios when facing faulty behavior. To that end, we present two novel byzantine fault tolerant (BFT) protocols. PermitBFT improves the lowest possible commit latency despite tolerating faulty behavior. It relaxes the traditional consensus properties to only ordering committed transactions—potentially leaving conflicting transactions uncommitted. PermitBFT is the first totally-ordering BFT protocol that achieves a commit latency of only 2 message delays while tolerating a third of the nodes to act maliciously. In the same setting, FnF-BFT constitutes the first BFT protocol with provable performance even under attack, guaranteeing a constant fraction of its best-case throughput as long as the network remains stable. To achieve that, FnF-BFT allows all nodes to act as leaders in parallel, ensuring that at least the correct nodes make steady progress continuously.
On the network layer, we present a framework for facilitating the analysis of network instabilities caused by routing events, both in lab environments and in live networks. To that end, we (i) develop a measurement framework for studying the effects of transient forwarding anomalies in a lab environment, (ii) show how to infer transient forwarding anomalies in live networks from both control-plane messages or router logs with our system Trix, and (iii) propose a design to explore networks’ various convergence behaviors with simulation—facilitating the systematic analysis and potential prevention of network instabilities in the future.
People
BibTex
@phdthesis{schmid2025fault-tolerance,
author = {Schmid, Roland},
title = {{On Fault-Tolerance and Tolerated Failures in Communication Networks}},
year = 2025,
month = oct,
publisher = {ETH Zurich},
doi = {10.3929/ETHZ-C-000785024},
url = {https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/785024/schmid_thesis_online.pdf},
school = {ETH Zurich}
}Research Collection: 20.500.11850/785024
Slide Sources: https://gitlab.ethz.ch/projects/52351
