Network controllers are complex software systems that must simultaneously implement a wide range of policies and services on top of a highly asynchronous environment (the network). As any complex software systems, network controllers are likely to be plagued by bugs that must be discovered and fixed. While discovering these bugs is often hard, fixing them is even harder. As network controllers are critical entity, they cannot simply be "rebooted", they must run at all time to ensure smooth and continuous network operations. As such, network controllers need also to be upgraded on-the-fly.
Our group is involved in verifying that network controllers are correct through static and dynamic analysis. We are also studying how we can make network controller "hot-swappable", enabling controllers to be upgraded at runtime without creating any disruption.
Samuel Steffen, ACM SIGCOMM 2020
Rüdiger Birkner, USENIX NSDI 2020
ACM SIGCOMM 2021. Online (August 2021).
Large-scale reconfiguration campaigns tend to be nerve-racking for network operators as they can lead to significant network downtimes, decreased performance, and policy violations. Unfortunately, existing reconfiguration frameworks often fall short inpractice as they either only support a small set of reconfiguration scenarios or simply do not scale.
We address these problems with Snowcap, the first network reconfiguration framework which can synthesize configuration updates that comply with arbitrary hard and soft specifications, and involve arbitrary routing protocols. Our key contributionis an efficient search procedure which leverages counter-examples to efficiently navigate the space of configuration updates. Given a reconfiguration ordering which violates the desired specifications, our algorithm automatically identifies the problematic commands so that it can avoid this particular order in the next iteration.
We fully implemented Snowcap and extensively evaluated its scalability and effectiveness on real-world topologies and typical, large-scale reconfiguration scenarios. Even for large topologies, Snowcap finds a valid reconfiguration ordering with minimal side-effects (i.e., traffic shifts) within a few seconds at most.
USENIX NSDI 2021. Online (April 2021).
Network analysis and verification tools are often a godsend for network operators as they free them from the fear of introducing outages or security breaches. As with any complex software though, these tools can (and often do) have bugs. For the operators, these bugs are not necessarily problematic except if they affect the precision of the model as it applies to their specific network. In that case, the tool output might be wrong: it might fail to detect actual configuration errors and/or report non-existing ones.
In this paper, we present Metha, a framework that systematically tests network analysis and verification tools for bugs in their network models. Metha automatically generates syntactically- and semantically-valid configurations; compares the tool’s output to that of the actual router software; and detects any discrepancy as a bug in the tool’s model. The challenge in testing network analyzers this way is that a bug may occur very rarely and only when a specific set of configuration statements is present. We address this challenge by leveraging grammar-based fuzzing together with combinatorial testing to ensure thorough coverage of the search space and by identifying the minimal set of statements triggering the bug through delta debugging.
We implemented Metha and used it to test three well-known tools. In all of them, we found multiple (new) bugs in their models, most of which were confirmed by the developers.
Samuel Steffen, Timon Gehr, Petar Tsankov, Laurent Vanbever, Martin Vechev
ACM SIGCOMM 2020. New York, USA (August 2020).
Not all important network properties need to be enforced all the time. Often, what matters instead is the fraction of time / probability these properties hold. Computing the probability of a property in a network relying on complex inter-dependent routing protocols is challenging and requires determining all failure scenarios for which the property is violated. Doing so at scale and accurately goes beyond the capabilities of current network analyzers.
In this paper, we introduce NetDice, the first scalable and accurate probabilistic network configuration analyzer supporting BGP, OSPF, ECMP, and static routes. Our key contribution is an inference algorithm to efficiently explore the space of failure scenarios. More specifically, given a network configuration and a property phi, our algorithm automatically identifies a set of links whose failure is provably guaranteed not to change whether phi holds. By pruning these failure scenarios, NetDice manages to accurately approximate P(phi). NetDice supports practical properties and expressive failure models including correlated link failures.
We implement NetDice and evaluate it on realistic configurations. NetDice is practical: it can precisely verify probabilistic properties in few minutes, even in large networks.
USENIX NSDI 2020. Santa Clara, California, USA (February 2020).
Network verification and configuration synthesis are promising approaches to make networks more reliable and secure by enforcing a set of policies. However, these approaches require a formal and precise description of the intended network behavior, imposing a major barrier to their adoption: network operators are not only reluctant to write formal specifications, but often do not even know what these specifications are.
We present Config2Spec, a system that automatically synthesizes a formal specification (a set of policies) of a network given its configuration and a failure model (e.g., up to two link failures). A key technical challenge is to design a synthesis algorithm which can efficiently explore the large space of possible policies. To address this challenge, Config2Spec relies on a careful combination of two well-known methods: data plane analysis and control plane verification.
Experimental results show that Config2Spec scales to mining specifications of large networks (>150 routers).
Aaron Gember-Jacobson, Costin Raiciu, Laurent Vanbever
ACM HotNets 2017. Palo Alto, California, USA (November 2017).
Network verification has made great progress recently, yet existing solutions are limited in their ability to handle specific protocols or implementation quirks or to diagnose and repair the cause of policy violations. In this positioning paper, we examine whether we can achieve the best of both worlds: full coverage of control plane protocols and decision processes combined with the ability to diagnose and repair the cause of violations. To this end, we leverage the happens-before relationships that exist between control plane I/Os (e.g., route advertisements and forwarding updates). These relationships allow us to identify when it is safe to employ a data plane verifier and track the root-cause of problematic forwarding updates. We show how we can capture errors before they are installed, automatically trace down the source of the error and roll-back the updates whenever possible.
ACM SOSR 2017. Santa Clara, CA, USA (April 2017).
By operating in highly asynchronous environments, SDN controllers often suffer from bugs caused by concurrency violations. Unfortunately, state-of-the-art concurrency analyzers for SDNs often report thousands of true violations, limiting their effectiveness in practice.
This paper presents BigBug, an approach for automatically identifying the most representative concurrency violations: those that capture the cause of the violation. The two key insights behind BigBug are that: (i) many violations share the same root cause, and (ii) violations with the same cause share common characteristics. BigBug leverages these observations to cluster reported violations according to the similarity of events in them as well as SDN-specific features. BigBug then reports the most representative violation for each cluster using a ranking function.
We implemented BigBug and showed its practical effectiveness. In more than 100 experiments involving different controllers and applications, BigBug systematically produced 6 clusters or less, corresponding to a median decrease of 95% over state-of-the-art analyzers. The number of violations reported by BigBug also closely matched that of actual bugs, indicating that BigBug is effective at identifying root causes of SDN races.
ACM PLDI 2016. Santa Barbara, CA, USA (June 2016).
Karla Saur, Joseph Collard, Nate Foster, Arjun Guha, Laurent Vanbever, Michael Hicks
ACM SOSR 2016. Santa Clara, CA, USA (March 2016).
ACM SOSR 2015. Santa Clara, CA, USA (June 2015).