Training distributed Machine Learning models in adversarial networks

Due to the size of modern Machine Learning (ML) models and the amount of data needed to train them, training is often carried out in a distributed fashion across multiple workers. Communication takes place among nodes in order to collect the outputs of layers or copies of the model, aggregate them and send back the results so that training may proceed. A question naturally occurs: what role does the network have in this?

It has been shown that distributed ML training can be bottlenecked by the network it operates in [1]. In this thesis, you will be setting up a system to train ML models over multiple workers and to manipulate the properties of the network it acts on. In particular, we wish to explore the effects of insecure/unreliable networks on modern ML models [2]. How can congestion and packet loss impact convergence time and accuracy? From a security perspective, can we design adversarial attacks to actively degrade performance? And how can we protect ourselves from such issues?

Milestones

  1. Set up the pipeline to train popular models (e.g. ResNet) over custom datasets in a distributed fashion, splitting data and aggregating results over multiple workers.
  2. Add hooks to manipulate network parameters and traffic.
  3. Analyse the impact of network conditions on model convergence.
  4. (Optional) Go wild! Given time, we can try to explore different parallelization schemes (e.g. Parameter Server vs All-Reduce), proposed techniques (e.g. loss tolerant transport protocols [3]), aggregation types, mitigation approaches, and so on.

Requirements

  • Familiarity with a major DNN framework (PyTorch, TensorFlow).
  • Knowledge of network transport protocols, congestion control.

References

[1] Zhen Zhang, Chaokun Chang, Haibin Lin, Yida Wang, Raman Arora, and Xin Jin. 2020. Is Network the Bottleneck of Distributed Training? In Proceedings of the Workshop on Network Meets AI & ML (NetAI ‘20). Association for Computing Machinery, New York, NY, USA, 8–13.

[2] Yu, Chen, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, and Ji Liu. ‘Distributed Learning over Unreliable Networks’, n.d.

[3] Chen, Zixuan, Lei Shi, Xuandong Liu, Xin Ai, Sen Liu, and Yang Xu. ‘Boosting Distributed Machine Learning Training Through Loss-Tolerant Transmission Protocol’. In 2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS), 1–10, 2023.

Supervisors