Taming the transient while reconfiguring BGP
Research area: Verification and Synthesis
Published by Tibor Schneider on October 24, 2023
Originally written for the MANRS Blog
Network operators reconfigure their networks daily to adapt to new business relationships, optimize network performance, or perform regular maintenance. Applying such reconfigurations is nerve-racking as they often affect a significant portion of the network traffic.
Fortunately, the research community has proposed several systems (Batfish, Minesweeper) to verify new configurations before deploying them in production networks. So, once we prove the final configuration won’t break connectivity, does this guarantee that we won’t break reachability when reconfiguring the network?
Unfortunately, verifying the final configuration is insufficient to reconfigure the network safely. Distributed routing protocols that govern most networks on the Internet today must first converge to reach the final, steady state. Even if the initial and final states are anomaly-free, the network might still experience transient black holes and forwarding loops.
Let us consider the Border Gateway Protocol (BGP) to demonstrate this reconfiguration problem. Assume you are an operator of a network and must tear down a BGP session with a peering network to perform maintenance. After removing the session, the network forwards traffic to a different egress network. However, as soon as you remove the session, the border router adjacent to that session drops packets as it does not yet know about the alternative route.
The visualization above simulates a small network with five internal routers and two external peering networks. The network initially forwards traffic to the left (
e1) due to a higher preference. When you remove the BGP session between
e1 and delete
r1 from the list of BGP neighbors),
r1 will propagate the withdrawal inside the network. At the same time,
r1 drops packets because it doesn’t yet know about the alternative route from
e2. By clicking on BGP messages, you can make the network converge to see how long it takes until it restores reachability.
Of course, directly removing the session is not the best idea. A better reconfiguration strategy is to move the traffic to the right before removing the session. You can try to do so in the visualization by clicking on
r1 and reconfiguring its incoming BGP route map to modify the local preference from 200 to 10. However, transient routing anomalies can still appear: you can trigger forwarding loops by processing packets towards
r4, and black holes by delaying the message from
Even though transient anomalies are short-lived, they can significantly impact the network. For example, a router with a transient black hole will send a withdrawal message to its external neighbors, just to re-advertise the route a couple of milliseconds later. Consequently, transient routing anomalies can spread to neighboring networks. Further, a chaotic BGP convergence process can cause traffic to shift, potentially violating security-related properties.
Chameleon Provides a Solution
Chameleon is a system that performs any BGP reconfiguration by precisely controlling the convergence process. Based on the current configuration, the planned reconfiguration scenario, and a specification, it reconfigures the network while preserving the specification in every transient state throughout the reconfiguration. Chameleon can:
- perform large-scale reconfigurations within minutes without packet loss.
- give guarantees during the convergence by precisely controlling how BGP converges.
- perform any BGP reconfiguration safely.
It does this by generating one specific convergence trace for which the specification is satisfied in every transient state. Then, it enforces that schedule by applying temporary BGP configurations that influence the BGP decision process to stabilize intermediate convergence states.
Chameleon also generates synchronization barriers, that is, conditions that ensure the network has converged sufficiently to apply the next commands safely. All BGP commands change route maps to increase or decrease route preferences locally using only basic BGP functionality.
Let’s see how Chameleon reconfigures the example above, which involves moving traffic from
e2. Initially, only
r5 knows about the route from
e2, so Chameleon starts by forcing
r5 to prefer the route from
r5 will distribute the route from
e2 to all other routers (while they still prefer the route from
e1). Now, Chameleon can update all routers to prefer the route from
e2 in an order that preserves reachability (i.e., from right to left). Try it out in the visualization above. You can show Chameleon’s reconfiguration plan by clicking on “Plan”. It shows all reconfiguration commands along with the conditions used for synchronization. You can perform the reconfiguration as Chameleon would by clicking on commands for which the precondition is satisfied (eventually, you must let the network converge enough to meet the preconditions).
You can find more information on Chameleon in our paper, which explains how the approach generalizes to BGP features like route reflection and presents various case studies and a large-scale evaluation. The source code is available on Github, and you can find the interactive network simulator at bgpsim.github.io.