I am a PostDoctoral researcher at the University of Strasbourg in the ICube laboratory. I work on BGP anomaly detection and mitigation with Prof. Cristel Pelsser. I am also interested in Internet measurements, software-defined networks and programmable data planes.
I did my PhD in the Networked Systems Group at ETH Zurich under the guidance of Prof. Laurent Vanbever. During my PhD, I focused on improving the routing convergence on the Internet upon outages.
I received both my Bachelor and Master degrees in Computer Science from the University of Strasbourg, France. Before joining ETH Zurich, I worked six months at Internet Initiative Japan where I was supervised by Cristel Pelsser and Randy Bush. In 2016, I worked six months at CAIDA where I was supervised by Alberto Dainotti.
BibTeX...
Doctoral dissertation. ETH Zurich. August 2021.
Nowadays, so many services – including critical ones – rely on the Internet to work that even a few minutes of connectivity disruption make customers unhappy and cause sizeable financial loss for companies. Ensuring that customers are always connected to the Internet is thus a top priority for Internet service providers. However, this is harder than one may think because the Internet is often subject to network outages. Network outages are a headache for network operators because they are unpredictable, can occur in any of the 70,000 independently operated networks composing the Internet, and can affect users’ connectivity network-wide. Far too often, the only way to restore connectivity upon an outage is to wait that (i) BGP, the glue of the Internet, converges; and (ii) the routers update their forwarding decisions accordingly. Unfortunately, these two processes work on a per-destination basis and are thus inherently slow given the always-increasing number of destinations in the Internet. It is therefore not a surprise that network operators still experience minutes of downtime upon outages. In this dissertation, we tackle the problem of fast connectivity recovery upon outages occurring in remote networks, without requiring network operators to change the standards, manufacture new devices, or cooperate with each other. The final result of our work is Snap, a framework that network operators can deploy on their routers and allows them to quickly detect outages and reroute tra ffic to working alternative paths that comply with the configured routing policies. Snap’s design follows a two-step recipe. First, it uses an outage inference algorithm based on new fundamental results and which, instead of waiting for the slow control-plane (BGP) notifications, analyzes the fast data-plane signals. Second, it uses a rerouting scheme that allows routers to quickly reroute all the a ffected traffi c to alternative paths circumventing the outage. Snap’s design takes advantage of the recent advances in network programmability and relies on a hardware-software codesign. To be fast, Snap collects data-plane signals at line-rate using programmable switches (e.g., Tofino). The switches then mirror the signals to a controller, which accurately infers remote outages and triggers tra ffic rerouting. We implemented Snap in P416 and Python and show its e ffectiveness in many real-world situations. Our results indicate that Snap can restore connectivity within a few seconds only, which is much faster than the few minutes often needed by traditional routers.
Thomas Holterbach, Tobias Bühler, Tino Rellstab, Laurent Vanbever
ACM SIGCOMM CCR 2020. Volume 50 Issue 2 (April 2020).
Each year at ETH Zurich, around 100 students collectively build and operate their very own Internet infrastructure composed of hundreds of routers and dozens of Autonomous Systems (ASes). Their goal? Enabling Internet-wide connectivity.
We find this class-wide project to be invaluable in teaching our students how the Internet infrastructure practically works. Among others, our students have a much deeper understanding of Internet operations alongside their pitfalls. Besides students tend to love the project: clearly the fact that all of them need to cooperate for the entire Internet to work is empowering.
In this paper, we describe the overall design of our teaching platform, how we use it, and interesting lessons we have learnt over the years. We also make our platform openly available.
Roland Meier, Thomas Holterbach, Stephan Keck, Matthias Stähli, Vincent Lenders, Ankit Singla, Laurent Vanbever
ACM HotNets 2019. Princeton, NJ, USA (November 2019).
Traditional network control planes can be slow and require manual tinkering from operators to change their behavior. There is thus great interest in a faster, data-driven approach that uses signals from real-time traffic instead. However, the promise of fast and automatic reaction to data comes with new risks: malicious inputs designed towards negative outcomes for the network, service providers, users, and operators.
Adversarial inputs are a well-recognized problem in other areas; we show that networking applications are susceptible to them too. We characterize the attack surface of data-driven networks and examine how attackers with different privileges—from infected hosts to operator-level access—may target network infrastructure, applications, and protocols. To illustrate the problem, we present case studies with concrete attacks on recently proposed data-driven systems.
Our analysis urgently calls for a careful study of attacks and defenses in data-driven networking, with a view towards ensuring that their promise is not marred by oversights in robust design.
Thomas Holterbach, Edgar Costa Molero, Maria Apostolaki, Alberto Dainotti, Stefano Vissicchio, Laurent Vanbever
USENIX NSDI 2019. Boston, Massachusetts, USA (February 2019).
We present Blink, a data-driven system that leverages TCP-induced signals to detect failures directly in the data plane. The key intuition behind Blink is that a TCP flow exhibits a predictable behavior upon disruption: retransmitting the same packet over and over, at epochs exponentially spaced in time. When compounded over multiple flows, this behavior creates a strong and characteristic failure signal. Blink efficiently analyzes TCP flows to: (i) select which ones to track; (ii) reliably and quickly detect major traffic disruptions; and (iii) recover connectivity---all this, completely in the data plane. We present an implementation of Blink in P4 together with an extensive evaluation on real and synthetic traffic traces. Our results indicate that Blink: (i) achieves sub-second rerouting for large fractions of Internet traffic; and (ii) prevents unnecessary traffic shifts even in the presence of noise. We further show the feasibility of Blink by running it on an actual Tofino switch.
Thomas Holterbach, Stefano Vissicchio, Alberto Dainotti, Laurent Vanbever
ACM SIGCOMM 2017. Los Angeles, California, USA (August 2017).
Network operators often face the problem of remote outages in transit networks leading to significant (sometimes on the order of minutes) downtimes. The issue is that BGP, the Internet routing protocol, often converges slowly upon such outages, as large bursts of messages have to be processed and propagated router by router. In this paper, we present SWIFT, a fast-reroute framework which enables routers to restore connectivity in few seconds upon remote outages. SWIFT is based on two novel techniques. First, SWIFT deals with slow outage notification by predicting the overall extent of a remote failure out of few control-plane (BGP) messages. The key insight is that significant inference speed can be gained at the price of some accuracy. Second, SWIFT introduces a new dataplane encoding scheme, which enables quick and flexible update of the affected forwarding entries. SWIFT is deployable on existing devices, without modifying BGP.
We present a complete implementation of SWIFT and demonstrate that it is both fast and accurate. In our experiments with real BGP traces, SWIFT predicts the extent of a remote outage in few seconds with an accuracy of ?90% and can restore connectivity for 99% of the affected destinations.
Thomas Holterbach, Cristel Pelsser, Randy Bush, Laurent Vanbever
ACM IMC 2015. Tokyo, Japan (October 2015).
Assistant
Spring 2016
BibTeX...
Jakob Wöhler
Supervisors: Dr. Thomas Holterbach, Prof. Laurent Vanbever
Sandro Lutz
Supervisors: Dr. Thomas Holterbach, Tobias Bühler, Prof. Laurent Vanbever
Martin Vahlensieck
Supervisors: Dr. Thomas Holterbach, Prof. Laurent Vanbever
Alex Studer
Supervisors: Dr. Thomas Holterbach, Tobias Bühler, Prof. Laurent Vanbever
Denis Mikhaylov
Supervisors: Dr. Thomas Holterbach, Tobias Bühler, Prof. Laurent Vanbever
Manuel Pulfer
Supervisors: Dr. Thomas Holterbach, Tobias Bühler, Prof. Laurent Vanbever
Eric Marty
Supervisors: Dr. Thomas Holterbach, Tobias Bühler, Prof. Laurent Vanbever
Tino Rellstab
Supervisors: Dr. Thomas Holterbach, Prof. Laurent Vanbever
Stephan Keck
Supervisors: Dr. Thomas Holterbach, Prof. Laurent Vanbever
Tino Rellstab
Supervisors: Tobias Bühler, Dr. Thomas Holterbach, Prof. Laurent Vanbever
Stephan Keck
Supervisors: Dr. Thomas Holterbach, Prof. Laurent Vanbever
Fabian Schleiss
Supervisors: Dr. Thomas Holterbach, Edgar Costa Molero, Prof. Laurent Vanbever
Simon Miescher
Supervisors: Dr. Thomas Holterbach, Prof. Laurent Vanbever
Philipp Mao
Supervisors: Dr. Rüdiger Birkner, Dr. Thomas Holterbach, Prof. Laurent Vanbever
Roman May
Supervisors: Dr. Thomas Holterbach, Prof. Laurent Vanbever