Bug #85910

CloudNS ignores our changes

Added by Michael Stucki about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Must have
Target version:
-
Start date:
2018-08-20
Due date:
% Done:

100%


Description

Overview

When the nameservers at CloudNS are using a serial which is not a timestamp (2xxx), their nameserver does no longer accept our updates. After a grace period of 48h, this will lead to the whole zone becoming unavailable.

We've had this problem 2 times now and it caused serious impact on the availability of our services.

Here is a short summary of our findings so far, workarounds for situations when this happens again, and a summary of the current status.

History

#1 Updated by Michael Stucki about 1 year ago

Findings so far

On August 1st, we noticed that the nameservers at CloudNS (ns.opsdns.ch & co.) were using a serial which was not propagated by us. This led to a downtime of the whole DNS after 2 days. Here is a chat log with notes by Steffen:

## 2018-08-01
Steffen Gebert [20:30]
@andri somehow CloudDNS is rejecting our zone transfers :-(
> Received unsuccessful notification report for 'typo3.org' from [2a06:fb00:1::1:77]:53, error: Query Refused
They still serve the old chef server host and also t3board19 isn't available

## 2018-08-02
Steffen Gebert [06:12]
yes, it's a 50:50 chance.. 2 nameservers are updated, two are not. Will contact snowflake support .. to.. bring us snow or so ;-)
Steffen Gebert [06:19]
okay.. when I'm composing the mail and writing down the serials, I notice probably the error: old: 2864778338 new 1533148013... :-(
Steffen Gebert [06:34]
so.. should we flip over the 32bit unsigned boundary once again via 4000000000 step in between? I'm really wondering, where this serial is coming from..
1533148013 is 29.07. at 13:48. This contains the new t3board19 record. But it was submitted to gerrit only on the 31st! How can that be..?
Steffen Gebert [06:38]
I've now forced a rewrite of the zone (new serial 1533184647)
Steffen Gebert [07:00]
oh.. this is so annoying.. I'm going through that process.. things are not broken, but it seems that clouddns has a pretty big delay and also doesn't eat everything..
punkt.de seems to to follow my updates at all and clouddns now picked a 28.. serial for which I have no clue, why it chose that..

{ dig SOA typo3.org @ns.typo3.org && dig SOA typo3.org @nameserver1.pluspunkthosting.de. && dig SOA typo3.org @ns.snowflakehosting.ch && dig SOA typo3.org @ns.snowflakehosting.net}  | grep admin
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 3000000000 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1532908101 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 3000000000 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 3000000000 86400 3600 604800 3600

typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 4000000000 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1532908101 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 3000000000 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 3000000000 86400 3600 604800 3600

typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 4000000000 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1532908101 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 4000000000 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 4000000000 86400 3600 604800 3600

typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 42 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1532908101 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 4000000000 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 4000000000 86400 3600 604800 3600

typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 42 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1532908101 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 42 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 42 86400 3600 604800 3600

I really like it that you've got my back and watching over my sholders while live-patching zone files :stuck_out_tongue:

typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1000000000 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1532908101 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 42 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 42 86400 3600 604800 3600

typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1000000000 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1532908101 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1000000000 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1000000000 86400 3600 604800 3600

now handing over control back to chef by adding some whitespace to the template `/var/chef/cache/powerdns-zones/typo3.org.zone.erb`

typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1533186686 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1532908101 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1000000000 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1000000000 86400 3600 604800 3600

I think the chance that one got an updated zone was 1/4th not 50:50..  it seems we're again sending semi-legal stuff that bind at punkt doesn't accept

typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1533186686 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1532908101 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1533186686 86400 3600 604800 3600
typo3.org.        300    IN    SOA    ns.typo3.org. admin.typo3.org. 1533186686 86400 3600 604800 3600

mission accomplished.. we've now got a 3/4 likelyhood. Fine, isn't it?
opened a ticket at the remaining 1/4th

## 2018-08-03
Markus Klein [11:40]
Gerrit dead? forge dead?
status.typo3.org dead?
Mathias Schreiber [11:41]
works for me, maybe related to the DNS issue
Markus Klein [11:41]
>status.typo3.org.       41      IN      A       127.0.0.1
says DNS

#2 Updated by Michael Stucki about 1 year ago

We checked our logs and the Git history on our primary nameserver. This serial (2864778338) was not propagated by us.
We informed snowflake Ops to check with CloudNS what happened here. They answered today:

I just have looked into our logs and it seems that for some reason the zone was with this serial 2864778338 in June, the serial is also mentioned in this ticket in the initial messages. For some reason there was left a cached copy in our storages and this has been caused the problem.

It is true that we used modulo instead of timestamps in the past (around June), but this got replaced already on June 22.
So the wrong values are caused by old caches at CloudNS. snowflake Ops is in contact with them to get that fixed.

#3 Updated by Michael Stucki about 1 year ago

Workarounds for situations when this happens again

CloudNS is aware of the problem now, it should not happen again. snowflake Ops is monitoring the serial numbers of these servers and will detect if the problem happens again. Should the problem still occur, then we will have 48h grace time to react.

In such situations, we need to remove the ns.opsdns.* nameservers until the problem is resolved. There is still the server at nameserver1.pluspunkthosting.de which can be used in such situations. After the problem is resolved, wait 1 day to be sure it's working, before you add the nameservers again.

Available nameservers

  • Master: ns.typo3.org
  • CloudNS: ns.opsdns.ch / ns.opsdns.net / ns.opsdns.li / ns.opsdns.org
  • Pluspunkthosting: nameserver1.pluspunkthosting.de

#4 Updated by Michael Stucki about 1 year ago

Current status

  • As of today, the nameservers of CloudNS are used again for typo3.org. The serial has been correct since August 2 now, so everything looks good for now.
  • snowflake Ops is monitoring these nameservers and will alert us if the serial is wrong again.
  • snowflake Ops is still in contact with CloudNS to get a sustainable solution for this problem (Web-based reset of the serial).

#5 Updated by Michael Stucki about 1 year ago

  • Status changed from New to Resolved
  • Assignee set to Michael Stucki
  • Priority changed from Should have to Must have
  • % Done changed from 0 to 100

Closing this task for now. The bug was only opened to document the current status...

Also available in: Atom PDF