Replies inline. Apologies for wordiness but there's information I didn't want to put in a pastebin in this email. On 10/2/19 10:36 AM, Paul Vixie wrote:
Michael Casadevall wrote on 2019-10-02 00:30:
Looking at the document, one thing that jumps out of me is both data integrity and route hijacking. I feel like this document puts far too much faith in validating resolvers to catch compromised root servers and route hijacking.
When I did work on creating a root server emulation setup for testing root IDNs, generally speaking, unless I went out of my way to force strict compliance, non-signed/invalidly signed data would just be accepted if it came from the roots even before I inserted my fake KSK into the mix.
The short version, is if I can pop a root server and inject malicious traffic, I can't tamper with the root server data without causing signature validation failures. I *can* freely change the NS records to MITN any DNS lookups to a TLD especially if I'm just echoing signed data. That's a pretty darn effective way to essentially cast a dragnet on all lookups to a TLD.
Furthermore, I'm not convinced that the recursive resolver infrastructure as a whole will properly fail if someone injected a bad zone into the root. The defaults on far too many things are to soft-fail vs. hard-fail.
it's vital that we find out how common this behaviour is.
My original framework for testing IDNs exists as a (somewhat ugly) set of bash scripts and docker instances. It's available here on GitHub: https://github.com/NCommander/idn-root-zone Please note I'm going off memory here so some of the details may be wrong. The original work didn't go anywhere but I'm happy to revive this. The setup creates a fake root zone signed with it's own KSK. Each TLD is delegated to a separate resolver from the root zone, which in turn subdelegates to a second level domain, more or less accurately capturing the state of how TLDs are managed. Each zone is signed with it's own set of keys. As a test, I defined .internic as a TLD to get everything working so you ended up with the following. 1. A docker container with the root zone signed with a fake KSK 2. Container with authoritive BIND server for .internic and handling key signing via BIND's key management features. 3. Second level TLD (nynex.internic) which was authoritative 4. A recursive resolver (hermes.nynex.internic) Each server is loaded with the proper root hints to point at the fake root + bind.keys. I also had part of a .arpa zone setup but looking at github, I either never finished it, or never committed it. This framework is pretty easy to extend to perform a more in-depth test if we can figure out exactly what and how we want to test.
data exfiltration by unsigned (delegation) NS RR is not concerning, since the data is public, and since in the earlier part of the story a root name server had to be compromised -- so exfiltration is already on the table, be it from one vantage point or another.
however, if soft fails are sometimes ignored, we need to study it closely.
Fair enough regard exfilitation of data. As for soft fails, or root DNSSEC data just being MIA, I can cite specific examples. The first one I can give is actually my mom's home router. I wrote incorrectly that it returns AD=1, however it still shows erroneous behavior with DNSSEC. The original writeup as part of my DNSCatcher work (https://github.com/NCommander/dnscatcher/blob/master/doc/technical_overview....) is correct but I'll summarize it here. In short, my personal domain (casadevall.pro) is DNSSEC signed. Asking my router for verification returns the following: $ dig @10.0.0.1 casadevall.pro +dnssec ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7107 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1 ;; ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags: do; udp: 1280 casadevall.pro. 3600 IN A 96.126.124.51 casadevall.pro. 3600 IN RRSIG A 8 2 3600 *snip* Note that the AD flag is *not* set even though it returned an RRSIG record. It should have returned SERVFAIL and the router failed to validate and CD=0. (I did check with Wireshark, and the router is responding DO=1 in the EDNS section in the query response so this is clearly erroneous behavior; I can provide pcap files). Furthermore, this router's DNS client is broken and can't perform DNSSEC lookups properly. This can be seen if one performs a sigchase (pastebin: https://pastebin.com/A3mKa7rA). If the router did not support DNSSEC at all, it shouldn't be returning DO=1 on the OPT reply. Sigchase passes if I use 8.8.8.8 (pastebin: https://pastebin.com/DDdfiiY6) The router responses to version.bind: ;; ANSWER SECTION: version.bind. 0 CH TXT "dnsmasq-2.39" Short version: Router says it supports DNSSEC, returns DO=1 when queried, fails to be able to actually look up DNSSEC records, and fails to return SERVFAIL regardless of the domain being right or wrong (sigfail.verteiltesysteme.net shows the same behavior). --- Case 2. soylentnews.org SoylentNews is a small business I own. For long and irrelevant reasons, the network it runs on is on split horizon DNS; externally, soylentnews.org and sylnt.us are DNSSEC signed. (https://dnssec-analyzer.verisignlabs.com/soylentnews.org). During testing, I noticed an odd discrepancy: Internally, all servers are networked via IPv6 (there are no A records for most machines) and reference each other via a fake TLD, li694-22. For example: mcasadevall@boron:/etc/bind$ dig hydrogen.li694-22 AAAA +short 2600:3c00::f03c:91ff:fe6e:1ded Directory information is handled by hesiod in the IN class: mcasadevall@boron:~$ dig @helium.li694-22 mcasadevall.passwd.li694-22 TXT +short "mcasadevall:*:2500:2500:Michael Casadevall,SN,+1-917-716-2585:/home/mcasadevall:/bin/bash" Password authentication is handled through Kerberos. boron is one of two recursive resolvers, with helium being the authoritative server for the li694-22 zone. The li694-22 is *not* signed at all (no local anchor or DLV). Helium is the authoritative name for the li694-22 zone. Boron is the backup server and replicates via AXFR from helium. The li694-22 is not signed in any way. For these tests, I'm accessing the recursive resolvers from another machine, specifically neon, one of our database nodes. Helium's named options relating to DNSSEC are only set as follows: named.conf.options: dnssec-validation auto; When requesting hydrogen's AAAA record with DO=1: mcasadevall@neon:~$ dig @helium.li694-22 hydrogen.li694-22 AAAA +dnssec ; <<>> DiG 9.9.5-3ubuntu0.15-Ubuntu <<>> @helium.li694-22 hydrogen.li694-22 AAAA +dnssec ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58352 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 3 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags: do; udp: 4096 ;; QUESTION SECTION: ;hydrogen.li694-22. IN AAAA ;; ANSWER SECTION: hydrogen.li694-22. 3600 IN AAAA 2600:3c00::f03c:91ff:fe6e:1ded ;; AUTHORITY SECTION: li694-22. 3600 IN NS helium.li694-22. li694-22. 3600 IN NS boron.li694-22. ;; ADDITIONAL SECTION: boron.li694-22. 3600 IN AAAA 2600:3c00::f03c:91ff:fe6e:c4bf helium.li694-22. 3600 IN AAAA 2600:3c00::f03c:91ff:fe6e:1d88 ;; Query time: 0 msec ;; SERVER: 2600:3c00::f03c:91ff:fe6e:1d88#53(2600:3c00::f03c:91ff:fe6e:1d88) ;; WHEN: Wed Oct 02 17:45:52 UTC 2019 ;; MSG SIZE rcvd: 171 I get an AA record, but no signed results. This should be SERVFAIL because there's no chain from root. Notably, running a test against a known bad domain fails as expected: mcasadevall@neon:~$ dig @helium.li694-22 sigfail.verteiltesysteme.net AAAA ; <<>> DiG 9.9.5-3ubuntu0.15-Ubuntu <<>> @helium.li694-22 ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 35765 As such BIND is properly validating DNSSEC requests given the SERVFAIL, but if it's authoritative and returns AA, it returns the record. Helium is running the latest LTS of Ubuntu 18.04: mcasadevall@helium:/etc/bind$ dpkg -l bind9 ii bind9 1:9.11.3+dfsg-1ubun amd64 Short version: Commonly deployed version of BIND shows incorrect behavior and will return a record if its authoritive even if DNSSEC information is MIA or broken. --- Case 2.5 Neon Oddity When I wrote this up, I was originally testing on the backup recursive resolver, and got very strange behavior. Boron is running Ubuntu 14.04 and is overdue for upgrade. It's pointed at helium. It's dnssec options are set somewhat differently. named.conf.local: auto-dnssec maintain; named.conf.local: auto-dnssec maintain; named.conf.options: dnssec-enable yes; named.conf.options: dnssec-validation yes; named.conf.options: dnssec-lookaside auto; Neon shows the same behavior resolving the li694-22 domain. It also will resolve the known broken domain: mcasadevall@boron:/etc/bind$ dig @127.0.0.1 sigfail.verteiltesysteme.net AAAA +short +dnssec 2001:638:501:8efc::139 AAAA 5 3 60 20191230020003 20190930020003 30665 verteiltesysteme.net. //This+RRSIG+is+deliberately+broken///For+more+informati on+please+go+to/http+//www+verteiltesysteme+net///////// //////////////////////////////////////////////////////// //8= mcasadevall@boron:/etc/bind$ dpkg -l bind9 ii bind9 1:9.9.5.dfsg amd64 Internet Domain Name Server Boron is slated for upgrade this weekend (I didn't realize it was THAT out of date), but I wanted to hilight a case where DNSSEC validation is not working the way it's supposed to.
The canonical example I can bring out is .local/.lan. These TLDs are common in private networks, and if my understanding is correct, the roots get pinged for these TLDs all the time. The root server send NXDOMAIN (with signed NSEC if DO=1) but these hosts still (presumably) work for the locations that use them. If recursive resolvers were hardfailing for non-existant TLDs, we'd see a lot of pain when the root zone was originally signed.
generally speaking, local names like .lan or .local (or .tor) are handled as "hosts file fallback" or as a plugin, and those namespaces rely on an NXD response from DNS in order to open up the next namespace in their local logic path.
See counter-example above. I also know Microsoft Active Directory for .local (or any domain) is handled through DNS in addition to LDAP and WINS and not as a fallback. I'm unfamiliar with any setup that actually depends on an NXD from the root; in all cases and clients I've worked with, the setup breaks if their resolvers aren't first in resolv.conf (or are loading hosts from NIS/LDAP via nsswitch. However, in MS environments, it's always been DNS in my experience).
Taking the case that I've popped a RSO, and can control one of the root servers zonefile freely, I can delete all DNSSEC data and just send whatever records I like. How much of the validating recursive infrastructure is going to properly hard-fail if they receive unsigned data, or a mix of signed and unsigned data?
all of it. the RDNS config includes a binding from a static key to some subtree of the DNS -- often called a "trust anchor". in bind9 as observed, if it gets unsigned data for a subtree (such as ".") that has a trust anchor defined for it, the stub resolve will get a SERVFAIL answer.
See above, this isn't the observed behavior in limited testing on a recent BIND9 instance, nor did I run into this with my IDN root zone emulator. I admit in the case I gave, there was no dedicated recursive resolver, but dnssec-validation yes should have caused SERVFAIL regardless if AA is true or not. Setting DO=1 didn't return AD=1 (no path to root), but it didn't return SERVFAIL either.
if _any_ available validating RDNS does not behave that way, we need to set our hair on fire until we get it either fixed or shamed/decertified.
Speaking from my position on the other end of the stack, I have a bunch of stuff here that either falls over with DNSSEC or actively lies (AD=1) when it's impossible for that to be true.
we need details.
I misremembered with the AD=1. I can build more comprehensive tests here. As far as I understand the specs, in any case where DNSSEC validation is enabled, I should get SERVFAIL on essentially all of the above and not what I'm seeing. I'll admit these are scattershot data points, but there's enough red flags based on my own work with the root zone emulator and the data points I collected for this email that there is a real concern that recursive resolvers are either soft-failing or showing outright incorrect behavior.
DNSSEC can provide the security we want with regards to data integrity, but I think we're assuming too much of the DNS ecosystem to say that it's effective at preventing root zone tampering without actual data to support that.
I could probably re-purpose some of my original root zone emulator work to actually test this behavior more in-depth but I think that's a separate discussion (esp. with regards to what to test and such). Michael please either continue, or share more detail about prior experiments.
What I'd like to do, with help with the caucus is devise a set of scenarios, and the expected DNSSEC behavior in each one, and run a full battery of tests to see essentially if they bend, testing the lastest version of popular DNS resolvers (dnsmasq, BIND, PowerDNS, Unbind, and Microsoft AD are probably good candidates, dnsmasq is extremely common in routers) and seeing if they actually properly go SERVFAIL when they're supposed to. This should also extend to distribution supplied versions of resolvers (i.e. BIND as shipped in Ubuntu 16.04/18.04/latest). The framework I'd recommend is my fake Internet root zone with it's own KSK so I fully control the environment and can test various scenarios. The root server software can be consistent because we're looking at the behavior of the recursive resolvers. Outside of the control tests, I believe all these cases SHOULD return SERVFAIL. Any others or other cases that should pass/fail, please note. 1. Control: DNSSEC validation pass of an existing domain in the root from the client to a resolver 2. Control: DNSSEC validation NSEC+NXDOMAIN of non-existent domain in the root 3. Hijacked root pointing at unsigned TLD 4. Hijacked root serving bad data with incorrect RRSIGs 5. Hijacked root not returning RRSIGs. 6. TLD hijack serving bad signatures (represents Case 2 above). 7. TLD injection (Case 2 above), with no chain to root 8. Domain hijack between root and recursive resolver (traffic injection, cache poisioning, etc.) 9. Expired/Revoked KSK, or similar scenario. I'm pretty close to the end of my day, but if this sounds like the correct direction to investigate, I'll flesh this out into an actual document and float it on the caucus and start working on building a test suite for it. Any corrections welcome, Michael