Re: [RSSAC Caucus] Threat Mitigation for the Root Server System

Oct. 2, 2019

      Replies inline. Apologies for wordiness but there's information I didn't
want to put in a pastebin in this email.

On 10/2/19 10:36 AM, Paul Vixie wrote:
...
Michael Casadevall wrote on 2019-10-02 00:30:
...
Looking at the document, one thing that jumps out of me is both data
integrity and route hijacking. I feel like this document puts far too
much faith in validating resolvers to catch compromised root servers and
route hijacking.
When I did work on creating a root server emulation setup for testing
root IDNs, generally speaking, unless I went out of my way to force
strict compliance, non-signed/invalidly signed data would just be
accepted if it came from the roots even before I inserted my fake KSK
into the mix.
The short version, is if I can pop a root server and inject malicious
traffic, I can't tamper with the root server data without causing
signature validation failures. I *can* freely change the NS records to
MITN any DNS lookups to a TLD especially if I'm just echoing signed
data. That's a pretty darn effective way to essentially cast a dragnet
on all lookups to a TLD.
Furthermore, I'm not convinced that the recursive resolver
infrastructure as a whole will properly fail if someone injected a bad
zone into the root. The defaults on far too many things are to soft-fail
vs. hard-fail.
it's vital that we find out how common this behaviour is.
My original framework for testing IDNs exists as a (somewhat ugly) set
of bash scripts and docker instances. It's available here on GitHub:
https://github.com/NCommander/idn-root-zone

Please note I'm going off memory here so some of the details may be
wrong. The original work didn't go anywhere but I'm happy to revive this.

The setup creates a fake root zone signed with it's own KSK. Each TLD is
delegated to a separate resolver from the root zone, which in turn
subdelegates to a second level domain, more or less accurately capturing
the state of how TLDs are managed. Each zone is signed with it's own set
of keys. As a test, I defined .internic as a TLD to get everything
working so you ended up with the following.

1. A docker container with the root zone signed with a fake KSK
2. Container with authoritive BIND server for .internic and handling key
signing via BIND's key management features.
3. Second level TLD (nynex.internic) which was authoritative
4. A recursive resolver (hermes.nynex.internic)

Each server is loaded with the proper root hints to point at the fake
root + bind.keys.

I also had part of a .arpa zone setup but looking at github, I either
never finished it, or never committed it.

This framework is pretty easy to extend to perform a more in-depth test
if we can figure out exactly what and how we want to test.
...
data exfiltration by unsigned (delegation) NS RR is not concerning,
since the data is public, and since in the earlier part of the story a
root name server had to be compromised -- so exfiltration is already on
the table, be it from one vantage point or another.
however, if soft fails are sometimes ignored, we need to study it closely.
Fair enough regard exfilitation of data.

As for soft fails, or root DNSSEC data just being MIA, I can cite
specific examples.

The first one I can give is actually my mom's home router. I wrote
incorrectly that it returns AD=1, however it still shows erroneous
behavior with DNSSEC. The original writeup as part of my DNSCatcher work
(https://github.com/NCommander/dnscatcher/blob/master/doc/technical_overview....)
is correct but I'll summarize it here.

In short, my personal domain (casadevall.pro) is DNSSEC signed. Asking
my router for verification returns the following:

$ dig @10.0.0.1 casadevall.pro +dnssec
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7107
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;;
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1280

casadevall.pro.		3600	IN	A	96.126.124.51
casadevall.pro.		3600	IN	RRSIG	A 8 2 3600 *snip*

Note that the AD flag is *not* set even though it returned an RRSIG
record. It should have returned SERVFAIL and the router failed to
validate and CD=0.

(I did check with Wireshark, and the router is responding DO=1 in the
EDNS section in the query response so this is clearly erroneous
behavior; I can provide pcap files).

Furthermore, this router's DNS client is broken and can't perform DNSSEC
lookups properly. This can be seen if one performs a sigchase (pastebin:
https://pastebin.com/A3mKa7rA). If the router did not support DNSSEC at
all, it shouldn't be returning DO=1 on the OPT reply.

Sigchase passes if I use 8.8.8.8 (pastebin: https://pastebin.com/DDdfiiY6)

The router responses to version.bind:
;; ANSWER SECTION:
version.bind.		0	CH	TXT	"dnsmasq-2.39"

Short version: Router says it supports DNSSEC, returns DO=1 when
queried, fails to be able to actually look up DNSSEC records, and fails
to return SERVFAIL regardless of the domain being right or wrong
(sigfail.verteiltesysteme.net shows the same behavior).

---

Case 2. soylentnews.org

SoylentNews is a small business I own. For long and irrelevant reasons,
the network it runs on is on split horizon DNS; externally,
soylentnews.org and sylnt.us are DNSSEC signed.
(https://dnssec-analyzer.verisignlabs.com/soylentnews.org). During
testing, I noticed an odd discrepancy:

Internally, all servers are networked via IPv6 (there are no A records
for most machines) and reference each other via a fake TLD, li694-22.
For example:

mcasadevall@boron:/etc/bind$ dig hydrogen.li694-22 AAAA  +short
2600:3c00::f03c:91ff:fe6e:1ded

Directory information is handled by hesiod in the IN class:

mcasadevall@boron:~$ dig @helium.li694-22 mcasadevall.passwd.li694-22
TXT +short
"mcasadevall:*:2500:2500:Michael
Casadevall,SN,+1-917-716-2585:/home/mcasadevall:/bin/bash"

Password authentication is handled through Kerberos. boron is one of two
recursive resolvers, with helium being the authoritative server for the
li694-22 zone. The li694-22 is *not* signed at all (no local anchor or DLV).

Helium is the authoritative name for the li694-22 zone. Boron is the
backup server and replicates via AXFR from helium. The li694-22 is not
signed in any way. For these tests, I'm accessing the recursive
resolvers from another machine, specifically neon, one of our database
nodes.

Helium's named options relating to DNSSEC are only set as follows:
named.conf.options:	dnssec-validation auto;

When requesting hydrogen's AAAA record with DO=1:

mcasadevall@neon:~$ dig @helium.li694-22 hydrogen.li694-22 AAAA +dnssec

; <<>> DiG 9.9.5-3ubuntu0.15-Ubuntu <<>> @helium.li694-22
hydrogen.li694-22 AAAA +dnssec
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58352
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 3

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 4096
;; QUESTION SECTION:
;hydrogen.li694-22.		IN	AAAA

;; ANSWER SECTION:
hydrogen.li694-22.	3600	IN	AAAA	2600:3c00::f03c:91ff:fe6e:1ded

;; AUTHORITY SECTION:
li694-22.		3600	IN	NS	helium.li694-22.
li694-22.		3600	IN	NS	boron.li694-22.

;; ADDITIONAL SECTION:
boron.li694-22.		3600	IN	AAAA	2600:3c00::f03c:91ff:fe6e:c4bf
helium.li694-22.	3600	IN	AAAA	2600:3c00::f03c:91ff:fe6e:1d88

;; Query time: 0 msec
;; SERVER: 2600:3c00::f03c:91ff:fe6e:1d88#53(2600:3c00::f03c:91ff:fe6e:1d88)
;; WHEN: Wed Oct 02 17:45:52 UTC 2019
;; MSG SIZE  rcvd: 171

I get an AA record, but no signed results. This should be SERVFAIL
because there's no chain from root. Notably, running a test against a
known bad domain fails as expected:

mcasadevall@neon:~$ dig @helium.li694-22 sigfail.verteiltesysteme.net AAAA
; <<>> DiG 9.9.5-3ubuntu0.15-Ubuntu <<>> @helium.li694-22
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 35765

As such BIND is properly validating DNSSEC requests given the SERVFAIL,
but if it's authoritative and returns AA, it returns the record.

Helium is running the latest LTS of Ubuntu 18.04:
mcasadevall@helium:/etc/bind$ dpkg -l bind9
ii  bind9                         1:9.11.3+dfsg-1ubun amd64

Short version: Commonly deployed version of BIND shows incorrect
behavior and will return a record if its authoritive even if DNSSEC
information is MIA or broken.

---

Case 2.5 Neon Oddity

When I wrote this up, I was originally testing on the backup recursive
resolver, and got very strange behavior. Boron is running Ubuntu 14.04
and is overdue for upgrade. It's pointed at helium. It's dnssec options
are set somewhat differently.

named.conf.local:	auto-dnssec maintain;
named.conf.local:	auto-dnssec maintain;
named.conf.options:	dnssec-enable yes;
named.conf.options:	dnssec-validation yes;
named.conf.options:	dnssec-lookaside auto;

Neon shows the same behavior resolving the li694-22 domain. It also will
resolve the known broken domain:

mcasadevall@boron:/etc/bind$ dig @127.0.0.1 sigfail.verteiltesysteme.net
AAAA +short +dnssec
2001:638:501:8efc::139
AAAA 5 3 60 20191230020003 20190930020003 30665 verteiltesysteme.net.
//This+RRSIG+is+deliberately+broken///For+more+informati
on+please+go+to/http+//www+verteiltesysteme+net/////////
//////////////////////////////////////////////////////// //8=

mcasadevall@boron:/etc/bind$ dpkg -l bind9
ii  bind9          1:9.9.5.dfsg amd64        Internet Domain Name Server

Boron is slated for upgrade this weekend (I didn't realize it was THAT
out of date), but I wanted to hilight a case where DNSSEC validation is
not working the way it's supposed to.
...
...
The canonical example I can bring out is .local/.lan. These TLDs are
common in private networks, and if my understanding is correct, the
roots get pinged for these TLDs all the time. The root server send
NXDOMAIN (with signed NSEC if DO=1) but these hosts still (presumably)
work for the locations that use them. If recursive resolvers were
hardfailing for non-existant TLDs, we'd see a lot of pain when the root
zone was originally signed.
generally speaking, local names like .lan or .local (or .tor) are
handled as "hosts file fallback" or as a plugin, and those namespaces
rely on an NXD response from DNS in order to open up the next namespace
in their local logic path.
See counter-example above. I also know Microsoft Active Directory for
.local (or any domain) is handled through DNS in addition to LDAP and
WINS and not as a fallback.

I'm unfamiliar with any setup that actually depends on an NXD from the
root; in all cases and clients I've worked with, the setup breaks if
their resolvers aren't first in resolv.conf (or are loading hosts from
NIS/LDAP via nsswitch. However, in MS environments, it's always been DNS
in my experience).
...
...
Taking the case that I've popped a RSO, and can control one of the root
servers zonefile freely, I can delete all DNSSEC data and just send
whatever records I like. How much of the validating recursive
infrastructure is going to properly hard-fail if they receive unsigned
data, or a mix of signed and unsigned data?
all of it. the RDNS config includes a binding from a static key to some
subtree of the DNS -- often called a "trust anchor". in bind9 as
observed, if it gets unsigned data for a subtree (such as ".") that has
a trust anchor defined for it, the stub resolve will get a SERVFAIL answer.
See above, this isn't the observed behavior in limited testing on a
recent BIND9 instance, nor did I run into this with my IDN root zone
emulator. I admit in the case I gave, there was no dedicated recursive
resolver, but dnssec-validation yes should have caused SERVFAIL
regardless if AA is true or not. Setting DO=1 didn't return AD=1 (no
path to root), but it didn't return SERVFAIL either.
...
if _any_ available validating RDNS does not behave that way, we need to
set our hair on fire until we get it either fixed or shamed/decertified.
...
Speaking from my position on the other end of the stack, I have a bunch
of stuff here that either falls over with DNSSEC or actively lies (AD=1)
when it's impossible for that to be true.
we need details.
I misremembered with the AD=1. I can build more comprehensive tests here.

As far as I understand the specs, in any case where DNSSEC validation is
enabled, I should get SERVFAIL on essentially all of the above and not
what I'm seeing. I'll admit these are scattershot data points, but
there's enough red flags based on my own work with the root zone
emulator and the data points I collected for this email that there is a
real concern that recursive resolvers are either soft-failing or showing
outright incorrect behavior.
...
...
DNSSEC can provide the security we want with regards to data integrity,
but I think we're assuming too much of the DNS ecosystem to say that
it's effective at preventing root zone tampering without actual data to
support that.
I could probably re-purpose some of my original root zone emulator work
to actually test this behavior more in-depth but I think that's a
separate discussion (esp. with regards to what to test and such).
Michael
please either continue, or share more detail about prior experiments.
What I'd like to do, with help with the caucus is devise a set of
scenarios, and the expected DNSSEC behavior in each one, and run a full
battery of tests to see essentially if they bend, testing the lastest
version of popular DNS resolvers (dnsmasq, BIND, PowerDNS, Unbind, and
Microsoft AD are probably good candidates, dnsmasq is extremely common
in routers) and seeing if they actually properly go SERVFAIL when
they're supposed to.

This should also extend to distribution supplied versions of resolvers
(i.e. BIND as shipped in Ubuntu 16.04/18.04/latest).

The framework I'd recommend is my fake Internet root zone with it's own
KSK so I fully control the environment and can test various scenarios.
The root server software can be consistent because we're looking at the
behavior of the recursive resolvers. Outside of the control tests, I
believe all these cases SHOULD return SERVFAIL. Any others or other
cases that should pass/fail, please note.

1. Control: DNSSEC validation pass of an existing domain in the root
from the client to a resolver

2. Control: DNSSEC validation NSEC+NXDOMAIN of non-existent domain in
the root

3. Hijacked root pointing at unsigned TLD

4. Hijacked root serving bad data with incorrect RRSIGs

5. Hijacked root not returning RRSIGs.

6. TLD hijack serving bad signatures (represents Case 2 above).

7. TLD injection (Case 2 above), with no chain to root

8. Domain hijack between root and recursive resolver (traffic injection,
cache poisioning, etc.)

9. Expired/Revoked KSK, or similar scenario.

I'm pretty close to the end of my day, but if this sounds like the
correct direction to investigate, I'll flesh this out into an actual
document and float it on the caucus and start working on building a test
suite for it.

Any corrections welcome,
Michael