Post-Mortem 2017-01-18: Multiple Services affected by private network flapping

Authors: Steffen Gebert, Andri Steiner

Issue Summary

All services running within the new KVM infrastructure were temporarily affected by issues caused by flapping of internal network connectivity.

Timeline

- 07:13: First monitoring mails indicating issues
- 07:25: Georg Ringer reports in the #typo3-server-team channel about connection issues to Gerrit SSH
- 08:07: Issues confirmed by Michael Stucki
- 08:10: Our team is gathering together in the hotel (as we are on a sprint right now)
- 08:30: Good indications for Layer 2 bridging / ARP irregularities on both of the new KVM servers (ms08 and ms09)
- 08:47: Tweet informing about service outage
- 09:29: Service availability restored (tweet)

Root Cause

- Not exactly known
- we have seen incoming packets on physical server on the br-int interface, which are not forwarded to the vnetX interface of the VM
- we have seen incomplete ARP tables on the VMs (arp -a)
- we have seen tons (hundreds per second) of ARP packets, mostly for resolving the gateway address (tcpump -i br-int arp on the host servers), which are all correctly answered. Many duplicated.
- dmesg emitted many of the following messages:

br-int: received packet on t3o with own address as source address
net_ratelimit: 216 callbacks suppressed

- there was no duplicated MAC address involved
- tinc log files (/var/log/tinc.t3o.log) did not include anything helpful
- after shutting down the private VPN (systemctl stop tinc) on the physical host, these ARP storms stopped
- starting tinc again restored connectivity to other hosts, without any further ARP storms

The definite cause is unknown.

- One variant is that tinc cause some looping or dropped packets
- Another one that the real issue is the local bridging and tinc was just a side-effect

Resolution and Recovery

systemctl stop tinc; systemctl start tinc

resolved the problem.

Once the issue occurs again, please try the following first:
- remove tinc from the bridge

brctl delif br-int t3o

to figure out, if it is a bridging or tinc issue. After checking success, use above commands to restart tinc and restore clean state.
- create a tcpdump for all interfaces:
tcpdump -i all -w dump.cap

Corrective and Preventative Measures

- we plan to get rid of tinc VPN for the long term, once we have services migrated to the new infrastructure and order rack space for all servers including dedicated internal network
- if this happens again, we could issue an INT signal to the tinc process to temporarily switch to debug logging (or USR1 / USR2 signals to get connectivity status).
- get better understanding, how to debug such connectivity / bridging issues and how / when the kernel drops packets etc.