Description of my analysis of the too-many-KSK queries problem
TL;DR Background: After the revoke bit was set, all the roots suddenly started receiving significant increases in DNSKEY queries. No one knew why. TL;DR below: I was able to start a bind resolver with specific DNSSEC config settings that caused 6+ outgoing requests for root/DNSKEYs for every incoming request. Then it would go silent, and then it would start over-querying again... ---- In the end, shortly before the revoked key was removed from the published keys list on March 22, I took a quick look into possible causes. I discussed some of my findings, without conclusions, at the DNSSEC workshop [1], as did Duane. After the workshop concluded, I noticed that a bunch of addresses sending lots of queries were from the CS department Purdue. We (ISI) reached out to them asking for help in discovering what the problem was. They discovered that one of the CS labs, which made use of VMs, had left a bunch of VMs running that had DNS resolvers and authoritative servers on them. They were kind enough to send me a copy of the lab setup instructions (which included bind running on Lubuntu), which allowed me to reproduce the situation. The key to the problem turns out that they were specifically using these settings in the bind config file: //dnssec-validation auto; dnssec-enable no; The first link clearly sets the validation state to the default behavior (IE, it was explicitly commented out in the file, and not set to a particular state). I believe the default state is "auto", though I don't know for which versions of bind this is true. The second line, dnssec-enable, was deliberately set to "no". With these two settings adjusted in my default bind.conf file from Fedora (I didn't have Lubuntu to test against), and a bind "managed keys" file that contained only the KSK2010 key and the dlv.isc.org key, I started up bind (bind-9.11.5-4.P4 from Fedora) and tcpdump in parallel. The results were (frustratingly) intermittent. Sometimes there was an issue, and sometimes not. In the end the managed-keys file was updated to contain just the KSK2017 key. When there was an issue, every query I sent from dig@localhost to the resolver@localhost caused 6-7 outgoing queries for DNSKEY and the root servers (I think it required requests for a new TLD each request). After a number of queries, negative caching seemed to finally kick in as the resolver would stop sending DNSKEY queries for a while. Then, on the order of minutes, later it would starting sending more DNSKEY queries in bulk for every incoming request. And as I said, sometimes the resolver failed to even enter the "broken" state. Evan, at the IETF, reported in a few meetings and conversations that they had discovered a bug in bind previously that would exhibit this roll-over-and-die type behavior but that it was only present in out-of-date versions of bind (9.10 and below I believe he stated). I appreciate the bind team looking into the problem. I can't, unfortunately, reconcile why my bind 9.11 on Fedora seemed to have the same issue however. My colleague Robert Story reproduced the issue as well and had tried Fedora 27-29, though I don't remember which versions had the issue. [1]: https://icann.zoom.us/recording/play/OYJ4R7IQCxnF5Rw6Kk_nnJQqrp4037W7eS9eeJe... -- Wes Hardaker USC/ISI
Thank you for this analysis. On Wed, Apr 03, 2019 at 01:56:14PM -0700, Wes Hardaker wrote:
Evan, at the IETF, reported in a few meetings and conversations that they had discovered a bug in bind previously that would exhibit this roll-over-and-die type behavior but that it was only present in out-of-date versions of bind (9.10 and below I believe he stated).
I think we have a case of two different bugs with superficially similar effects. I haven't yet been able to reproduce yours (maybe it's specific to Fedora somehow, or maybe I just haven't hit the right combination yet). Mine causes named to go into a tight loop sending DNSKEY queries forever, starting immediately on startup. It doesn't ever quiet down, even temporarily, and it doesn't depend on incoming queries - it just spins. Once the revoked key is removed, it stops. Based on sheer volume, I would guess this was a bigger contributor to the observed increase in DNSKEY traffic than the bug you discovered, though yours is odd, and definitely warrants further investigation. The looping bug was fixed in 9.10.2 and (if I recall correctly) 9.9.7, and was never in the 9.11 branch. I saw a list of "version.bind" responses from servers that were sending the most DNSKEY queries, and the worst offenders were older than that. -- Evan Hunt -- each@isc.org Internet Systems Consortium, Inc.
Evan Hunt <each@isc.org> writes: Can you elaborate on this:
Once the revoked key is removed, it stops.
Removed from where? the root zone? the cache? The managed keys file?
Based on sheer volume, I would guess this was a bigger contributor to the observed increase in DNSKEY traffic than the bug you discovered, though yours is odd, and definitely warrants further investigation.
Reminder: I was in an airport and working quickly right before the flight and right before the 22nd, when the revoked key would be removed. I'm not *positive* there was a correlation between requests and outgoing DNSKEY queries since this is from memory and because I was working quickly I may not have hit the right conclusion. Wish I had saved pcaps... -- Wes Hardaker USC/ISI
On Thu, Apr 04, 2019 at 09:23:17AM -0700, Wes Hardaker wrote:
Once the revoked key is removed, it stops.
Removed from where? the root zone? the cache? The managed keys file?
Root zone. As far as I can tell it never reaches the cache at all.
Reminder: I was in an airport and working quickly right before the flight and right before the 22nd, when the revoked key would be removed. I'm not *positive* there was a correlation between requests and outgoing DNSKEY queries since this is from memory and because I was working quickly I may not have hit the right conclusion. Wish I had saved pcaps...
Perhaps, but you described a number of behaviors that were significantly different, including intermittency, burstiness, and the fact that the release you were testing should've had the fix for the looping problem... so there may well be two different bugs. After I'm done with jetlag recovery I'm planning to build a test environment and keep looking. -- Evan Hunt -- each@isc.org Internet Systems Consortium, Inc.
Hi Evan, can you elaborate on the looping bug? For example, what combination of configuration statements would cause this, and why was a revoked key special in this case. Thanks Roy
On 4 Apr 2019, at 12:50, Evan Hunt <each@isc.org> wrote:
On Thu, Apr 04, 2019 at 09:23:17AM -0700, Wes Hardaker wrote:
Once the revoked key is removed, it stops.
Removed from where? the root zone? the cache? The managed keys file?
Root zone. As far as I can tell it never reaches the cache at all.
Reminder: I was in an airport and working quickly right before the flight and right before the 22nd, when the revoked key would be removed. I'm not *positive* there was a correlation between requests and outgoing DNSKEY queries since this is from memory and because I was working quickly I may not have hit the right conclusion. Wish I had saved pcaps...
Perhaps, but you described a number of behaviors that were significantly different, including intermittency, burstiness, and the fact that the release you were testing should've had the fix for the looping problem... so there may well be two different bugs. After I'm done with jetlag recovery I'm planning to build a test environment and keep looking.
-- Evan Hunt -- each@isc.org Internet Systems Consortium, Inc. _______________________________________________ ksk-rollover mailing list ksk-rollover@icann.org https://mm.icann.org/mailman/listinfo/ksk-rollover
On Thu, Apr 04, 2019 at 12:54:37PM -0400, Roy Arends wrote:
Hi Evan, can you elaborate on the looping bug? For example, what combination of configuration statements would cause this, and why was a revoked key special in this case.
It was similar to the environment Wes set up, I found it while trying to reproduce his report. When starting named with either a managed-keys database containing only KSK-2010, or with no managed-keys database and a bind.keys file containing only KSK-2010, named sent a key refresh query for ./DNSKEY, got back a response containing KSK-2010 with the REVOKED bit set, validated it but failed to record the revocation, and immediately retried the query. I believe it was fixed in this commit, around line 8940 in lib/dns/zone.c: https://gitlab.isc.org/isc-projects/bind9/commit/f87d4ca08 Since the revoked key is no longer in the root zone, I'll need to set up a toy root server to confirm that that was indeed the relevant change; I haven't done so yet. -- Evan Hunt -- each@isc.org Internet Systems Consortium, Inc.
Evan Hunt <each@isc.org> writes:
Since the revoked key is no longer in the root zone, I'll need to set up a toy root server to confirm that that was indeed the relevant change; I haven't done so yet.
(do note that the report I posted yesterday shows that things are still happening with the current root zone too, with KSK2010 being completely missing... so you probably need to test against both a fake root with a revoked key *and* the current state, as there may be more than one bug) -- Wes Hardaker USC/ISI
participants (3)
-
Evan Hunt -
Roy Arends -
Wes Hardaker