Post-Mortem 2015-07-26-TravisCI-Downtime

Author: Steffen Gebert
Original URL: https://notes.typo3.org/p/post-mortem-travisci-2015-07

tl;dr we enabled replication to Github and overloaded TravisCI

TravisCI status report: https://www.traviscistatus.com/incidents/4gbn8hv2sp4m

Setup

We use Gerrit as code review system: https://review.typo3.org/
We replicate every (proposed) change to https://github.com/typo3-ci/TYPO3.CMS-pre-merge-tests in order to trigger a TravisCI build (see the changes/x/y/z branches)
This provides feedback regarding whether the test succeed, even before the commit is merged into the official code base (basically giving us the comfort that others have with Github pull requests)

Changes on 26.07.2015

  • As we replicated only refs/changes/* from Gerrit to refs/heads/changes/* on GH, the status there did not reflect the actual difference between the current state of the target branch and the proposed change (as all release branches and master where simply put.. outdated since ~Jun 20th)
  • Gerrit was now told to replicate also refs/heads/* and refs/tags/* to GH (around 15:30 GMT+2)
  • This caused Gerrit to push numerous commits to GH (around 650 for master branch + xyz for other branches)
  • TODO: Insert the missing piece here
  • Speculation by AndyG: Gerrit replicates commit by commit, instead of updating the remote branch to the tip of the local branch immediately. To be confirmed
  • Speculation by AndreasW: Each version of every patch added to Gerrit since Jun 20th was pushed (that should be more than 650)
    there are ~580 commits on master from Jun 20th to Jul 27th
  • I (steffen) think this is not the case. We had refs/changes/* replicated all the time. Only refs/{tags,heads}/ was missing and due to the screwed chef setup, we were not able to deploy that update
  • Log output: https://gist.github.com/StephenKing/247ad6e3704d41b97a84
  • This somehow filled up the build queue on TravisCI, which caused starvation for other project's builds (and obviously caused some data base issues?)

Remedy

  • Travis refused builds from the TYPO3-ci/TYPO3.CMS-pre-merge-tests repo
  • Travis staff contacted us and joined the #typo3-cms channel in slack

Open Questions

  • Dan Buch from TravisCI wrote about build requests that are "in the thousands" for that repo. We replicated maybe ~1000 commits to master + release branches.
  • Some people received, however, also status mails for commit that are ~4 years old. Not yet clear, why those were built
  • Does Gerrit really replicate to remote branches commit by commit?

Conclusions / Next Steps

  • We should be careful when replicating a large number of commits to a travis-enabled repo