Improving Network Failure Detection and Recovery with Programmable Data Planes

Doctoral Thesis

Abstract

Since its creation, the Internet has grown exponentially in size and use cases, becoming an integral part of our society. Its seamless operation is often taken for granted; we only recognize its importance when disruptions occur. The current Internet’s complexity and scale make it prone to all sorts of failures, with each minute of downtime costing companies millions of dollars and damaging their reputation.

In this thesis, we address the critical need for rapid detection and recovery mechanisms for network failures. We expand beyond conventional hard failures to explore and address the issue of gray failures in ISP networks, a subtle and poorly understood issue for which operators lack effective solutions. By leveraging advances in programmable data planes, we develop two systems to detect, localize, and recover from network failures.

First, we introduce FANcY, a novel system to detect and localize gray failures in ISP networks. FANcY utilizes programmable switches to implement a reliable synchronization and counting protocol, enabling precise packet loss detection. FANcY adapts to the limited memory capacity of modern switches with a hybrid approach: dedicated counters for high-priority traffic and a probabilistic data structure for best-effort traffic. This design ensures efficient monitoring under various conditions and future-proofs the system against constantly increasing traffic volumes. We demonstrate FANcY’s capability for sub-second gray failure detection and reaction through extensive simulations and a prototype running on Intel Tofino switches.

Second, we present our work on hardware-accelerated network control planes. This research extends beyond detection, demonstrating that programmable data planes can run critical control plane functions traditionally implemented in software. Our working prototype efficiently runs diverse such tasks in the data plane including: detecting hard, gray, and remote failures, notifying other devices, executing distributed path-vector computations that adhere to shortest-path and BGP-like policies, and rapidly updating forwarding states to restore connectivity after failures. Finally, our work identifies challenges in expressiveness and scalability for programmable data planes, emphasizing that the careful selection of tasks for offloading remains a critical area for future research.

People

Dr. Edgar Costa Molero
PhD student
2017—2024

BibTex

@PHDTHESIS{molero2024improving,
	copyright = {In Copyright - Non-Commercial Use Permitted},
	year = {2024},
	type = {Doctoral Thesis},
	author = {Costa Molero, Edgar},
	size = {168 p.},
	abstract = {Since its creation, the Internet has grown exponentially in size and use cases, becoming an integral part of our society. Its seamless operation is often taken for granted; we only recognize its importance when disruptions occur. The current Internet’s complexity and scale make it prone to all sorts of failures, with each minute of downtime costing companies millions of dollars and damaging their reputation.In this thesis, we address the critical need for rapid detection and recovery mechanisms for network failures. We expand beyond conventional hard failures to explore and address the issue of gray failures in ISP networks, a subtle and poorly understood issue for which operators lack effective solutions. By leveraging advances in programmable data planes, we develop two systems to detect, localize, and recover from network failures.First, we introduce FANcY, a novel system to detect and localize gray failures in ISP networks. FANcY utilizes programmable switches to implement a reliable synchronization and counting protocol, enabling precise packet loss detection. FANcY adapts to the limited memory capacity of modern switches with a hybrid approach: dedicated counters for high-priority traffic and a probabilistic data structure for best-effort traffic. This design ensures efficient monitoring under various conditions and future-proofs the system against constantly increasing traffic volumes. We demonstrate FANcY’s capability for sub-second gray failure detection and reaction through extensive simulations and a prototype running on Intel Tofino switches.Second, we present our work on hardware-accelerated network control planes. This research extends beyond detection, demonstrating that programmable data planes can run critical control plane functions traditionally implemented in software. Our working prototype efficiently runs diverse such tasks in the data plane including: detecting hard, gray, and remote failures, notifying other devices, executing distributed path-vector computations that adhere to shortest-path and BGP-like policies, and rapidly updating forwarding states to restore connectivity after failures. Finally, our work identifies challenges in expressiveness and scalability for programmable data planes, emphasizing that the careful selection of tasks for offloading remains a critical area for future research.},
	keywords = {Computer networks; Failure Detection and Recovery; Programmable data planes; Hardware acceleration; Hardware offloading},
	language = {en},
	address = {Zurich},
	publisher = {ETH Zurich},
	DOI = {10.3929/ethz-b-000690095},
	title = {Improving Network Failure Detection and Recovery with Programmable Data Planes},
	school = {ETH Zurich}
}

Research Collection: 20.500.11850/690095