Post-Mortem 2016-11-20 Network Config Changes

Authors: Steffen Gebert, Michael Stucki

Issue Summary

During two occasions during the last days, service availability was impaired after changing /etc/network/interfaces to add our VPN interfaces.

  • Saturday, Nov 12th: backup server
  • Friday, Nov 16th: physical host server ms06

Both times service networking restart resulted in a permanent loss of connectivity. As we locked out of the running server externally triggered reboots were required.

Side note: We do not manually re-configure our servers, but use Chef instead. However, IP address configuration is not part of the Chef setup.

Root Cause

While a syntax error was at least partially the reason in one case, we nevertheless experienced the same connectivity issue when running service networking restart with a correct configuration.
The syntactically correct config file was only accepted after the reboot.

Resolution and Recovery

In the first occurrence, we contacted the organization hosting our backup server, asking for a reboot. In the second case, we had means to execute a remote reset.

Corrective and Preventative Measures

All changes to the network configuration should be backed by an automatic revert procedure that would kick in, if not disabled by the operator who remains connected.

According to this issue, service networking restart should not be used. Instead, use

( ifdown iface; ifup iface ) &

However, we are not certain about this, if this would be really sufficient in all cases.

The following procedure should be automatically triggered to prevent further failures, independent of the way to reset networking:

  • After 1 minute: Revert the configuration file change and restart networking
  • After 5 minutes: Reboot the server

The following gist can be used, assuming that a backup has been created in /etc/network/interfaces.bak.
Usage:

curl https://gist.githubusercontent.com/StephenKing/83fedc56137f5640de929b4430f1b653/raw/24a7536bc074b575af55e667ccde0a4f3668fd21/reset.sh > reset.sh
bash reset.sh