Starting discussion on acceptable criteria for proceeding with the root KSK roll
Greetings in the new year. As announced on this list (and in many other places) a few weeks ago, the ICANN org wants to use this list to get input from the community on acceptable criteria for proceeding with the root KSK roll. When we made that announcement, we saw a good number of new subscriptions to the list, but the discussion didn't start on its own, so we want to get that going. For reference, please see <https://www.icann.org/news/blog/update-on-the-root-ksk-rollover-project>. The relevant timing part from that article is:
The ICANN org will monitor this mailing list and beginning on 15 January 2018, we will develop a draft plan for proceeding with the root KSK roll based on the input received and discussion on the mailing list. The plan will be published by 31 January 2018 and undergo a formal ICANN public comment process to gather further input.
We would really like to hear from you about the criteria you think would be relevant for us to observe/measure, if such criteria exist. --Paul Hoffman
I'll go first, we need to take in account the human behaviour, and not being an expert human behavioral analyst, I know that people fix things when broken and not when it's working. So getting a 100% of people's attention to fix something not broken is almost impossible. When we talk to ISP about this issue, the smaller ones just turn DNSSEC validation off because it's easier. It's impossible to have 100% readiness. The majority of DNSSEC validation today is via google DNS. I think we need to go ahead with the roll over, have the humans fix the problems as they arise, and start re-building the trust in DNSSEC globally! (before it's too late!) My 2 cents! Jacques
-----Original Message----- From: ksk-rollover [mailto:ksk-rollover-bounces@icann.org] On Behalf Of Paul Hoffman Sent: January 2, 2018 12:07 PM To: ksk-rollover@icann.org Subject: [ksk-rollover] Starting discussion on acceptable criteria for proceeding with the root KSK roll
Greetings in the new year. As announced on this list (and in many other places) a few weeks ago, the ICANN org wants to use this list to get input from the community on acceptable criteria for proceeding with the root KSK roll. When we made that announcement, we saw a good number of new subscriptions to the list, but the discussion didn't start on its own, so we want to get that going.
For reference, please see <https://www.icann.org/news/blog/update-on-the-root-ksk-rollover-project>. The relevant timing part from that article is:
The ICANN org will monitor this mailing list and beginning on 15 January 2018, we will develop a draft plan for proceeding with the root KSK roll based on the input received and discussion on the mailing list. The plan will be published by 31 January 2018 and undergo a formal ICANN public comment process to gather further input.
We would really like to hear from you about the criteria you think would be relevant for us to observe/measure, if such criteria exist.
--Paul Hoffman _______________________________________________ ksk-rollover mailing list ksk-rollover@icann.org https://mm.icann.org/mailman/listinfo/ksk-rollover
Hi, the subject is very hard to discuss: on the one hand, we have a process we must follow unil the end as we already engaged on it. On the second hand, ICANN fears to isolate many end users due to the fact that resolvers do not update with KSK-2017.As Paul mentioned, getting 100% readiness is hard even seems impossible to achieve. What i suggest is that we must have an estimation of how many end users will be isolated from Internet if we go for the rollover now and how many resolvers still have KSK-2010 only at the current date. Knowing those data should be a good reference point. Then, ICANN can continue contacting the administrators, having their awarness sessions/trainings, provide technical supports, ... in permanent and more intensive basis. Each end of month, ICANN can publish data for the above two KPI to see how we are globally improving to reduce the severity of impact. Meanwhile, ICANN will put in place various emergency response technical pool of staff and choose a date for the rollover.Those who are still not ready at the date of the rollover will be isolated from Internet and administrators must have to contact one pool of ICANN emergency response staff to get support. -- Prenez soin de vous car chaque jour est une vie. Yazid M. AKANHO Mobile: (+229)97979910 LinkdIn: www.linkedin.com/in/yakanho blog: http://yakanho.beninois.net Le jeudi 4 janvier 2018 à 17:39:34 UTC+1, Jacques Latour <Jacques.Latour@cira.ca> a écrit : I'll go first, we need to take in account the human behaviour, and not being an expert human behavioral analyst, I know that people fix things when broken and not when it's working. So getting a 100% of people's attention to fix something not broken is almost impossible. When we talk to ISP about this issue, the smaller ones just turn DNSSEC validation off because it's easier. It's impossible to have 100% readiness. The majority of DNSSEC validation today is via google DNS. I think we need to go ahead with the roll over, have the humans fix the problems as they arise, and start re-building the trust in DNSSEC globally! (before it's too late!) My 2 cents! Jacques
-----Original Message----- From: ksk-rollover [mailto:ksk-rollover-bounces@icann.org] On Behalf Of Paul Hoffman Sent: January 2, 2018 12:07 PM To: ksk-rollover@icann.org Subject: [ksk-rollover] Starting discussion on acceptable criteria for proceeding with the root KSK roll
Greetings in the new year. As announced on this list (and in many other places) a few weeks ago, the ICANN org wants to use this list to get input from the community on acceptable criteria for proceeding with the root KSK roll. When we made that announcement, we saw a good number of new subscriptions to the list, but the discussion didn't start on its own, so we want to get that going.
For reference, please see <https://www.icann.org/news/blog/update-on-the-root-ksk-rollover-project>. The relevant timing part from that article is:
The ICANN org will monitor this mailing list and beginning on 15 January 2018, we will develop a draft plan for proceeding with the root KSK roll based on the input received and discussion on the mailing list. The plan will be published by 31 January 2018 and undergo a formal ICANN public comment process to gather further input.
We would really like to hear from you about the criteria you think would be relevant for us to observe/measure, if such criteria exist.
--Paul Hoffman _______________________________________________ ksk-rollover mailing list ksk-rollover@icann.org https://mm.icann.org/mailman/listinfo/ksk-rollover
ksk-rollover mailing list ksk-rollover@icann.org https://mm.icann.org/mailman/listinfo/ksk-rollover
Yazid, On January 4, 2018 at 1:24:23 PM, Yazid AKANHO via ksk-rollover (ksk-rollover@icann.org<mailto:ksk-rollover@icann.org>) wrote: What i suggest is that we must have an estimation of how many end users will be isolated from Internet if we go for the rollover now and how many resolvers still have KSK-2010 only at the current date. Knowing those data should be a good reference point. Through RFC 8145, we have a whiff of a hint about the 2nd, but it is highly questionable as it only shows up with folks who keep their name servers up to date and those are probably not the folks we have to worry about having misconfigurations (the fact that a surprising percentage of these folks _were_ misconfigured is interesting). We don’t have good data for the first and I’m not sure how we can get it. Then, ICANN can continue contacting the administrators, This has not been particularly successful, at least with the 500 IP addresses we identified from 8145 reports back in Sept. having their awarness sessions/trainings, We now typically get rejected when we ask to speak at conferences on the KSK rollover. provide technical supports, ... in permanent and more intensive basis. I’m not sure what this means. Each end of month, ICANN can publish data for the above two KPI to see how we are globally improving to reduce the severity of impact. Meanwhile, ICANN will put in place various emergency response technical pool of staff and choose a date for the rollover. Those who are still not ready at the date of the rollover will be isolated from Internet and administrators must have to contact one pool of ICANN emergency response staff to get support. To be honest, I’m not sure how this will help. Realistically (although there are people who will and have argued “people will die!” if applications get answers they don’t expect), I suspect the problem isn’t recovery from KSK rollover-induced failures — that’s easy, you just turn off DNSSEC validation in your resolver, or, if you’re clueful, update your trust anchor. The problem is the longer term impact on trust in DNSSEC: not only do authoritative server operators cause outages when they’ve enabled DNSSEC, but now, resolver operators cause outages. Regards, -drc
Jacques, Thanks very much for kicking this discussion off. On January 4, 2018 at 8:39:35 AM, Jacques Latour (jacques.latour@cira.ca<mailto:jacques.latour@cira.ca>) wrote: I'll go first, we need to take in account the human behaviour, and not being an expert human behavioral analyst, I know that people fix things when broken and not when it's working. So getting a 100% of people's attention to fix something not broken is almost impossible. Agreed. When we talk to ISP about this issue, the smaller ones just turn DNSSEC validation off because it's easier. Not just the smaller ones. I’ve been told one of the largest ISPs in the world turned off DNSSEC until after the KSK rollover. It's impossible to have 100% readiness. Agreed. The question we are trying to answer is that given this, and given we now have objective data that strongly suggests there are folks who are NOT ready, what is the criteria by which we move ahead? The majority of DNSSEC validation today is via google DNS. I suspect this is the case, but I think Comcast also provides a good chunk of validation. It’d be interesting to find out exactly what the reality is and that’s part of the problem: the criteria defined by the KSK Rollover Design Team measures something for which we don’t (currently) have objective data (namely, 0.5% of end users negatively impacted). I think we need to go ahead with the roll over, have the humans fix the problems as they arise, and start re-building the trust in DNSSEC globally! (before it's too late!) There is a bit to unpack here and I’m left with more questions: 1) “we need to go ahead with the roll over” Just to level set and argue the extreme, if we had data that suggested that 100% of validating resolvers would fail, would you personally pull the trigger that causes the KSK rollover? The implication of your statement is that regardless of the state of knowledge about whether the rollover would succeed or the damage that it would cause, we should roll the key. This would appear to contradict SAC063 recommendation #3. As you are on SSAC, perhaps you could encourage them to weigh in on this topic? 2) "have the humans fix the problems as they arise” This (presumably) assumes humans will fix the problems in a positive way. I’ll admit I suspect the more likely way of fixing DNSSEC rollover-caused validation failures will be to simply disable DNSSEC validation (after all, the folks who would fix things the right way are unlikely to be bit by the rollover). Is that an acceptable outcome? 3) “start re-building the trust in DNSSEC globally!" I am personally unaware that of any noticeable change in the trust associated with DNSSEC as a result of the (lack of) KSK rollover. Within security knowledgeable folks, I do know that trust in DNSSEC has been _increased_ a bit by the move by Verisign from a 1024 bit ZSK to a 2048 bit ZSK, but that’s obviously unrelated to the KSK rollover. What data do you have that trust has decreased due to the lack of KSK rollover? 4) "(before it's too late!)” My impression has been that the percentage of responses being validated has been increasing over time, and not just because more and more folks have been using Google Public DNS — I’ve heard anecdotally that more folks on turning on validation (perhaps as a side effect of the KSK rollover communications plan). You’re suggesting that the lack of KSK rollover will result in a crisis of trust in DNSSEC. How long do you think it will be until this occurs? It has already happened? Days? Weeks? Months? Years? Thanks, -drc
David, On 4 Jan 2018, at 23.01, David Conrad <david.conrad@icann.org<mailto:david.conrad@icann.org>> wrote: 2) "have the humans fix the problems as they arise” This (presumably) assumes humans will fix the problems in a positive way. I’ll admit I suspect the more likely way of fixing DNSSEC rollover-caused validation failures will be to simply disable DNSSEC validation (after all, the folks who would fix things the right way are unlikely to be bit by the rollover). Is that an acceptable outcome? I’ll add to that corporate culture: when DNS fails, the message from higher up will be “Make it work”, not “Make it work The Right Way”. On a side note, my biggest fear has been those people/SMBs that buy a black box for some reason, which also does DNS, and has DNSSEC turned on by default, plug in into their network and forget about it, rather than, say, people that know DNS, but not really about DNSSEC. In all your outreach, have you noticed that to be the case, or is it more e.g. service providers, where someone flipped the switch and promptly forgot about it, or something else? 3) “start re-building the trust in DNSSEC globally!" I am personally unaware that of any noticeable change in the trust associated with DNSSEC as a result of the (lack of) KSK rollover. Within security knowledgeable folks, I do know that trust in DNSSEC has been _increased_ a bit by the move by Verisign from a 1024 bit ZSK to a 2048 bit ZSK, but that’s obviously unrelated to the KSK rollover. What data do you have that trust has decreased due to the lack of KSK rollover? 4) "(before it's too late!)” My impression has been that the percentage of responses being validated has been increasing over time, and not just because more and more folks have been using Google Public DNS — I’ve heard anecdotally that more folks on turning on validation (perhaps as a side effect of the KSK rollover communications plan). You’re suggesting that the lack of KSK rollover will result in a crisis of trust in DNSSEC. How long do you think it will be until this occurs? It has already happened? Days? Weeks? Months? Years? I guess these questions boil down to how you define trust. Trust in the additional layer of security DNSSEC provides or trust in DNSSEC as a usable protocol. The resistance to DNSSEC I’m hearing, is that it is too cumbersome, too fragile, and requires too many resources to implant compared to the benefit it provides. I have not heard anyone question the security DNSSEC provides or that postponing a rollover is reducing its security. Both postponing the rollover and doing a rollover with significant fallout, will add more fuel to the fire for those who feel DNSSEC is not a viable solution from an operational viewpoint. All that to say, and I’m playing devils advocate here, at some point we do need to bite the bullet and do the rollover, because to keep postponing it is yet another signal that DNSSEC is not production ready. Erwin
Could not of said it better. In my opinion, the uncertainty is what is eroding the trust in DNSSEC as a technology. The sooner we get over this hurdle the better we are. I’m on the camp that DNSSEC is a viable technology, and we made an error in not rolling the key sooner and we need to acknowledge in advance there will be some collateral damage ‘for people not running up to date stuff’ but it’s inevitable to bring this technology to maturity. If we don’t take this risk, might as well turn it off because people won’t trust it. Going from pulse to dial tone telephone, going from analog to digital TV, all had its share of collateral damage. Perfection is not an option, I think we’re ready, let’s do it! All that to say, and I’m playing devils advocate here, at some point we do need to bite the bullet and do the rollover, because to keep postponing it is yet another signal that DNSSEC is not production ready. Erwin
I fully, fully agree with Jacques and others. This needs to be over soon. Cheers! Carlos via Newton Mail [https://cloudmagic.com/k/d/mailapp?ct=pi&cv=9.8.139&pv=11.2.1&source=email_f...] On Thu, Jan 4, 2018 at 8:41 PM, Jacques Latour <Jacques.Latour@cira.ca> wrote: Could not of said it better. In my opinion, the uncertainty is what is eroding the trust in DNSSEC as a technology. The sooner we get over this hurdle the better we are. I’m on the camp that DNSSEC is a viable technology, and we made an error in not rolling the key sooner and we need to acknowledge in advance there will be some collateral damage ‘for people not running up to date stuff’ but it’s inevitable to bring this technology to maturity. If we don’t take this risk, might as well turn it off because people won’t trust it. Going from pulse to dial tone telephone, going from analog to digital TV, all had its share of collateral damage. Perfection is not an option, I think we’re ready, let’s do it! All that to say, and I’m playing devils advocate here, at some point we do need to bite the bullet and do the rollover, because to keep postponing it is yet another signal that DNSSEC is not production ready. Erwin _______________________________________________ ksk-rollover mailing list ksk-rollover@icann.org https://mm.icann.org/mailman/listinfo/ksk-rollover
On 5 Jan 2018, at 10:45 am, Carlos Marcelo Martinez Cagnazzo <carlosm3011@gmail.com> wrote:
I fully, fully agree with Jacques and others. This needs to be over soon.
When we roll the clock back to September 2017, the cited reason for the deferral of the root roll was the existence of data from resolvers that supported RFC8145 signalling that a pool of these resolvers had not loaded KSK20911 into their local trusted key store. Carlos, (I’m asking because you posted a "me too") what is the data set you are using to justify this call to be “over soon”? It seems to me that in the absence of new data, the only changed factor is your own appetite for risk. Without additional data, your tolerance for risk appears to increase over time (*). But is this altered personal perception of the risk sufficient motivation to proceed? Objectively, if the numbers in September 2017 gave sufficient grounds to pause, and the numbers haven't changed (**) then surely the grounds for pausing the operation as as strong now as they were in September (***). So if you want to show that at its heart the Internet is still the wild west and we work on gut instincts and disregard data that appears to contradict such instincts, then by all means please roll the key. Personally I think such calls to roll the key are more emotive than reasoned and we need to fill in the gaps with real data rather than rhetoric. Geoff * Humans are really really bad at risk assessment - they amplify inferred risk from recent events and discount the risk associated with old events (The fact that many millions of folk live in areas of highly active geology just because the last major catastrophic event happened decades ago is a really good illustration of this human risk perception problem.) ** I have not seen a recent update to the RFC8145 numbers, but I assume that not much has changed *** The problem was not that the RFC8145 numbers gave strong evidence that the root key was not being loaded. The issues was that the RFC8145 signal is as noisy as you can get. Because its noisy its a challenge to either prove or disprove any theory about the true KSK2011 state and the likely impact on users. And the task to allow the process to proceed is to provide a new data set that can clear up the interpretation problems with the RFC8145 data.
Geoff, The problem is, and Jacques said it much better than me, it’s the uncertainty. “Over soon” could be “we will not roll until 2020”, or “we will roll once the numbers for ksk2010 fall below X%” This uncertainty every quarter will, IMO, seriously erode trust in DNSSEC. Again, just echoing here something already mentioned in this thread, I also believe some amount of collateral damage seems unavoidable if we’re to ever roll the key. The question is how much. And no, I’m not a big risk taker, actually shame on me for that. /Carlos via Newton Mail [https://cloudmagic.com/k/d/mailapp?ct=pi&cv=9.8.139&pv=11.2.1&source=email_f...] On Thu, Jan 4, 2018 at 9:19 PM, Geoff Huston <gih@apnic.net> wrote:
On 5 Jan 2018, at 10:45 am, Carlos Marcelo Martinez Cagnazzo <carlosm3011@gmail.com> wrote:
I fully, fully agree with Jacques and others. This needs to be over soon.
When we roll the clock back to September 2017, the cited reason for the deferral of the root roll was the existence of data from resolvers that supported RFC8145 signalling that a pool of these resolvers had not loaded KSK20911 into their local trusted key store. Carlos, (I’m asking because you posted a "me too") what is the data set you are using to justify this call to be “over soon”? It seems to me that in the absence of new data, the only changed factor is your own appetite for risk. Without additional data, your tolerance for risk appears to increase over time (*). But is this altered personal perception of the risk sufficient motivation to proceed? Objectively, if the numbers in September 2017 gave sufficient grounds to pause, and the numbers haven't changed (**) then surely the grounds for pausing the operation as as strong now as they were in September (***). So if you want to show that at its heart the Internet is still the wild west and we work on gut instincts and disregard data that appears to contradict such instincts, then by all means please roll the key. Personally I think such calls to roll the key are more emotive than reasoned and we need to fill in the gaps with real data rather than rhetoric. Geoff * Humans are really really bad at risk assessment - they amplify inferred risk from recent events and discount the risk associated with old events (The fact that many millions of folk live in areas of highly active geology just because the last major catastrophic event happened decades ago is a really good illustration of this human risk perception problem.) ** I have not seen a recent update to the RFC8145 numbers, but I assume that not much has changed *** The problem was not that the RFC8145 numbers gave strong evidence that the root key was not being loaded. The issues was that the RFC8145 signal is as noisy as you can get. Because its noisy its a challenge to either prove or disprove any theory about the true KSK2011 state and the likely impact on users. And the task to allow the process to proceed is to provide a new data set that can clear up the interpretation problems with the RFC8145 data.
Carlos, On January 4, 2018 at 4:31:18 PM, Carlos Marcelo Martinez Cagnazzo (carlosm3011@gmail.com<mailto:carlosm3011@gmail.com>) wrote: [https://tr.cloudmagic.com/h/v6/emailtag/tag/2.0/1515112249/96f2d3060286ecbab...] The problem is, and Jacques said it much better than me, it’s the uncertainty. “Over soon” could be “we will not roll until 2020”, or “we will roll once the numbers for ksk2010 fall below X%” Sorry, I’m a bit confused: are you saying the uncertainty is due to the lack of definitive criteria to move forward or that people are not certain that we will move forward? This uncertainty every quarter will, IMO, seriously erode trust in DNSSEC. TBH, I’m struggling a bit with this assertion. Whose trust is being eroded? Again, just echoing here something already mentioned in this thread, I also believe some amount of collateral damage seems unavoidable if we’re to ever roll the key. The question is how much. And this is precisely the question that we’re trying to get answered. We absolutely know we’re going to break some resolver's ability to validate — the 8145 reports provide concrete proof. We have no idea what the collateral damage will be — we don’t even know how many people behind those resolvers will be impacted. You and others appear to be saying that we should ignore that question (which calls into question the point of SAC-063 recommendation 3). I’m unsure this would be viewed by the wider community as prudent and I suspect it has the potential to more seriously erode the trust in DNSSEC than postponing the key roll has. And no, I’m not a big risk taker, actually shame on me for that. Err, but you’re saying roll sooner even without data — doesn’t that imply taking an unmeasured risk? Regards, -drc
Hi David, On Thu, Jan 4, 2018 at 10:32 PM, David Conrad <david.conrad@icann.org> wrote:
Carlos,
On January 4, 2018 at 4:31:18 PM, Carlos Marcelo Martinez Cagnazzo ( carlosm3011@gmail.com) wrote:
The problem is, and Jacques said it much better than me, it’s the uncertainty. “Over soon” could be “we will not roll until 2020”, or “we will roll once the numbers for ksk2010 fall below X%”
Sorry, I’m a bit confused: are you saying the uncertainty is due to the lack of definitive criteria to move forward or that people are not certain that we will move forward?
IMO, it has to do more with your second point, not knowing what next quarter's decision will be in terms of the roll. I fear that this sends a confusing signal to the community. Outside quite a closed circle people will not know or will not understand that is going on, feeding both the skeptical camp and not sending a strong enough signal for others to update their configurations. In an ideal world people would take advantage of these hiatuses and update their systems, but well, in reality they won't.
This uncertainty every quarter will, IMO, seriously erode trust in DNSSEC.
TBH, I’m struggling a bit with this assertion. Whose trust is being eroded?
What will happen after this first roll? will there be another ? if so, when? what would happen if an emergency roll was needed ? are we prepared enough to do one ? These are the kind of questions that pop in peoples minds. I know, and most people know, you have them all well covered. But community trust is a subtler value more affected by feeling than reason.
Again, just echoing here something already mentioned in this thread, I also believe some amount of collateral damage seems unavoidable if we’re to ever roll the key. The question is how much.
And this is precisely the question that we’re trying to get answered. We absolutely know we’re going to break some resolver's ability to validate — the 8145 reports provide concrete proof. We have no idea what the collateral damage will be — we don’t even know how many people behind those resolvers will be impacted.
You and others appear to be saying that we should ignore that question (which calls into question the point of SAC-063 recommendation 3). I’m unsure this would be viewed by the wider community as prudent and I suspect it has the potential to more seriously erode the trust in DNSSEC than postponing the key roll has.
I won't speak for others on this particular point. That was not what I was saying. My central issue is with uncertainty, not actually so much with rolling the key today or next quarter. I know you had to make a decision based on limited data and on a very short time frame. I trust your instinct, and not rolling last October was probably the wisest call to make. My point is that I don't feel that scanning ICANN's maililing lists or press releases each new quarter to see what will be happening each time is the best way forward. Quantitative criteria would definitely help, including perhaps publishing up-to-date data and indicators.
And no, I’m not a big risk taker, actually shame on me for that.
Err, but you’re saying roll sooner even without data — doesn’t that imply taking an unmeasured risk?
Well, again I wasn't saying that. I hope I clarified it enough. However, this last comment brings something else to mi mind. We as a community seemed to be perfectly fine with rolling the key with no data whatsoever until around July-August 2017 or so. Did we feel at the time we were taking an unmeasured risk ?
Regards,
-drc
Best regards,
/Carlos -- -- ========================= Carlos M. Martinez-Cagnazzo h <http://cagnazzo.name>ttp://cagnazzo.me =========================
On 01/04/2018 04:19 PM, Geoff Huston wrote:
When we roll the clock back to September 2017, the cited reason for the deferral of the root roll was the existence of data from resolvers that supported RFC8145 signalling that a pool of these resolvers had not loaded KSK20911 into their local trusted key store.
Carlos, (I’m asking because you posted a "me too") what is the data set you are using to justify this call to be “over soon”? It seems to me that in the absence of new data, the only changed factor is your own appetite for risk. Without additional data, your tolerance for risk appears to increase over time (*). But is this altered personal perception of the risk sufficient motivation to proceed? Objectively, if the numbers in September 2017 gave sufficient grounds to pause, and the numbers haven't changed (**) then surely the grounds for pausing the operation as as strong now as they were in September (***).
The grounds for pausing have the same strength that they did in September, yes. Which is to say, very limited compared to the overall risk of not doing the roll. Since a little before September when the 8145 data started rolling in all I've heard discussed is the risk to the deployed base if we do the roll and their stuff breaks. But there is another, arguably greater risk that is not being discussed, what happens if we get ourselves into a position where we are forced to do an emergency roll? (The common scenarios for that are key compromise, which is very unlikely but not impossible, and alg failure.) We aren't planning to do the roll for the fun of it. We are planning to do the roll because at SOME point in the future, it will be necessary to do one. Every day that we don't roll the key adds to that risk, since we already know, for sure, that rolling the key NOW will break stuff. There are only two conditions that can be true at this point: 1. DNSSEC, while deployed to a non-trivial degree, has little actual utility at this time. That is, nothing mission-critical depends on it now, or in the near future. OR 2. DNSSEC is an essential service that many organizations depend on. If #1 is true we should do the roll ASAP because any fallout from breakage will be minimal, and hopefully have a net positive benefit when people update their broken stuff. If #2 is true we should do the roll ASAP because we need to demonstrate that we can, and so that any breakage can be dealt with in a somewhat controlled environment with lots of eyeballs and resources dedicated to it. This ultimately will make the system more robust because people will fix their broken stuff, and develop confidence in the system regarding any possible rolls in the future. Either way, Jacques is right, we need to make like Nike and "Just Do It." Doug
Doug, On January 4, 2018 at 11:50:02 PM, Doug Barton (dougb@dougbarton.email<mailto:dougb@dougbarton.email>) wrote: Since a little before September when the 8145 data started rolling in all I've heard discussed is the risk to the deployed base if we do the roll and their stuff breaks. But there is another, arguably greater risk that is not being discussed, what happens if we get ourselves into a position where we are forced to do an emergency roll? (The common scenarios for that are key compromise, which is very unlikely but not impossible, and alg failure.) If they key gets lost or compromised, my understanding is that we cannot use RFC 5011 to do the roll and must fall back to doing an out-of-band key rollover. We aren’t really exercising this under this iteration of the community defined KSK rollover plan. There are only two conditions that can be true at this point: […] If #1 is true we should do the roll ASAP […] If #2 is true we should do the roll ASAP […] As I’ve noted previously, this would appear to argue that SAC-063 rec#3 should not have been made and that the amount of “breakage” is irrelevant. It would be nice if SSAC were to weigh in on this. Regards, -drc
David, thanks for the reply. Some responses below. On 01/05/2018 10:42 AM, David Conrad wrote:
Doug,
On January 4, 2018 at 11:50:02 PM, Doug Barton (dougb@dougbarton.email <mailto:dougb@dougbarton.email>) wrote:
Since a little before September when the 8145 data started rolling in all I've heard discussed is the risk to the deployed base if we do the roll and their stuff breaks. But there is another, arguably greater risk that is not being discussed, what happens if we get ourselves into a position where we are forced to do an emergency roll? (The common scenarios for that are key compromise, which is very unlikely but not impossible, and alg failure.)
If they key gets lost or compromised, my understanding is that we cannot use RFC 5011 to do the roll and must fall back to doing an out-of-band key rollover. We aren’t really exercising this under this iteration of the community defined KSK rollover plan.
Correct, but we're well past the point where we can run an emergency roll as an exercise. When I was leading IANA and we were in the early stages of planning for signing the root, several of us (and I include myself in that group) wanted to hold the first 90 days or so after the signing went public as a sort of "production beta" period, and perform an emergency roll during that period for precisely this reason. But that ship sailed long ago, so at this point we are limited in what we can do. While the current exercise won't directly address any concerns related to an /emergency/ roll, ANY roll at this point is better than no roll, because the risk of not knowing what will happen during/after a roll is great now, and increases daily as DNSSEC grows in importance.
There are only two conditions that can be true at this point: […] If #1 is true we should do the roll ASAP […] If #2 is true we should do the roll ASAP […]
As I’ve noted previously, this would appear to argue that SAC-063 rec#3 should not have been made and that the amount of “breakage” is irrelevant. It would be nice if SSAC were to weigh in on this.
I was a founding member of the SSAC before I joined the ICANN staff. While it's been some time since I participated with them, I feel I understand their remit pretty well. The recommendation you refer to reads (sorry for any copy/paste issues): It is expected that there will be some issues during at least the first KSK rollover, and probably the next few. It will not be possible to anticipate all the problems that may occur but an agreed understanding of when the rollover has affected operational stability beyond a reasonable boundary is essential so the decision to rollback the rollover can be made quickly and efficiently. To me, that recommendation seems to provide a solid balance between acknowledging that there will be problems, and also taking the importance of stability into account by asking for both a rollback plan and a criteria for the rollback decision. That all seems perfectly reasonable and appropriate. The SSAC is asked to provide advice on /both/ Security and Stability. in this case, adding security (by showing that a 5011 roll can be performed with a minimum of disruption) requires a small, but necessary sacrifice in stability. That makes this issue no different than other, similar issues; like IPv6, IDNs, the new gTLD program, or even the introduction of DNSSEC itself. At this point we have gathered as much data as we can realistically obtain (as you, Matt, and others have pointed out). That the data is both incomplete, and imperfect, is unfortunate; but inevitable given the state of the technology. If the reason for failing to proceed with the plan at this point is the data, I fear that we will never proceed, as there is no reasonable expectation that the data will improve in quality, no matter how long we wait. It's also worth pointing out in regards to your concern about the SSAC recommendation that it does not call for perfect data. In fact, it specifically acknowledges that, "It will not be possible to anticipate all the problems that may occur." That point seems to bear repeating since i have seen several references to SSAC #063 as a reason not to proceed, while my reading of that document does not support that conclusion in any way. Like you, I think it would be useful if the members of the AC that worked on that document would weigh in. At the end of the day, I sympathize with your position, David. (Perhaps more so than most others would be able to.) But waiting makes the problem worse, not better. At this point it's entirely reasonable to conclude that ICANN has gone above and beyond in their efforts to make the transition process as smooth as possible, and now it's time to move forward. Warmest regards, Doug
On Fri, Jan 5, 2018 at 13:42 David Conrad <david.conrad@icann.org> wrote:
Doug,
On January 4, 2018 at 11:50:02 PM, Doug Barton (dougb@dougbarton.email) wrote:
Since a little before September when the 8145 data started rolling in all I've heard discussed is the risk to the deployed base if we do the roll and their stuff breaks. But there is another, arguably greater risk that is not being discussed, what happens if we get ourselves into a position where we are forced to do an emergency roll? (The common scenarios for that are key compromise, which is very unlikely but not impossible, and alg failure.)
If they key gets lost or compromised, my understanding is that we cannot use RFC 5011 to do the roll and must fall back to doing an out-of-band key rollover. We aren’t really exercising this under this iteration of the community defined KSK rollover plan.
Um. No. In fact, at this point you’re closer to being able to use 5011 as I designed it than ever before. I.e., you have two trust Anchors. If you want to be able to support key compromise and emergency replacement the next step is to add anchor C . The step after that is to revoke the current (old/original) trust anchorA. Keep C’s private key off line and in threshold pieces. Sign the DNSKEY RRSet with B. Later, Mike There are only two conditions that can be true at this point:
[…] If #1 is true we should do the roll ASAP […] If #2 is true we should do the roll ASAP […]
As I’ve noted previously, this would appear to argue that SAC-063 rec#3 should not have been made and that the amount of “breakage” is irrelevant. It would be nice if SSAC were to weigh in on this.
Regards,
-drc
_______________________________________________ ksk-rollover mailing list ksk-rollover@icann.org https://mm.icann.org/mailman/listinfo/ksk-rollover
Mike, On January 7, 2018 at 12:53:15 PM, Michael StJohns (msj@nthpermutation.com<mailto:msj@nthpermutation.com>) wrote:
If they key gets lost or compromised, my understanding is that we cannot use RFC 5011 to do the roll and must fall back to doing an out-of-band key rollover. We aren’t really exercising this under this iteration of the community defined KSK rollover plan.
Um. No. As currently operationally practiced, I believe my statement is correct. In fact, at this point you’re closer to being able to use 5011 as I designed it than ever before. I.e., you have two trust Anchors. And to state the obvious, the reason we’ve postponed the KSK rollover is indications that some resolvers are only configured for one trust anchor. If you want to be able to support key compromise and emergency replacement the next step is to add anchor C . The step after that is to revoke the current (old/original) trust anchorA. Keep C’s private key off line and in threshold pieces. Sign the DNSKEY RRSet with B. This may be an opportunity to revise operational practice. Providing this as input may be worthwhile. Regards, -drc
On 2018-01-09 at 01:33, David Conrad wrote:
Mike,
On January 7, 2018 at 12:53:15 PM, Michael StJohns (msj@nthpermutation.com<mailto:msj@nthpermutation.com>) wrote:
If they key gets lost or compromised, my understanding is that we cannot use RFC 5011 to do the roll and must fall back to doing an out-of-band key rollover. We aren’t really exercising this under this iteration of the community defined KSK rollover plan.
Um. No.
As currently operationally practiced, I believe my statement is correct.
Your statement is correct. Adding has an emergency rollover key (as described by Mike) has been considered several times over the years, but has been rejected every time due to how the primary key is protected and maintained. No failure scenario has been identified where it wouldn't be possible to recover from a failure and still maintain public transparency. An emergency rollover key does not help us in the current design nor does it make the current key rollover easier. jakob
On 01/08/2018 11:35 PM, Jakob Schlyter wrote:
Adding has an emergency rollover key (as described by Mike) has been considered several times over the years, but has been rejected every time due to how the primary key is protected and maintained. No failure scenario has been identified where it wouldn't be possible to recover from a failure and still maintain public transparency.
Does that include a scenario where the algorithm used by the current key is unexpectedly broken? (Commonly referred to as alg failure) Doug
On 1/9/2018 7:21 PM, Doug Barton wrote:
On 01/08/2018 11:35 PM, Jakob Schlyter wrote:
Adding has an emergency rollover key (as described by Mike) has been considered several times over the years, but has been rejected every time due to how the primary key is protected and maintained. No failure scenario has been identified where it wouldn't be possible to recover from a failure and still maintain public transparency.
Does that include a scenario where the algorithm used by the current key is unexpectedly broken? (Commonly referred to as alg failure)
Doug
I'd say maybe based on the fact that alg failures tend to occur over time rather than at a discrete point in time. The current example is NSA/NISTs guidance to transition from Suite B to NCAS (Suite B at the 192 bit strength level basically) pending guidance to transition to a Quantum Resistent National Cryptographic Algorithm Suite at some later time. Given that guidance we should already be phasing out RSA2048/Sha256... but AFAICT we aren't there yet. There's also the point that algorithm swaps tend to be messy on the client side (because they generally need additional crypto implementations) - if you thought the arguments about which clients supported 5011 were painful, wait until we get to the "which clients/resolvers are 192bit ready?" or "which clients/resolvers are EC ready?" discussions. All that being said, we can add a set of stand by keys (up to 4 based on the 5011 requirements) to cover various algorithms. To be even more blunt than I normally am, DNSSEC (and DNS) does not lend itself to good knowledge about what the consumers of DNS data are doing from the view point of the providers of that data. All the root folks can do is a) make a schedule, b) warn everyone a lot, c) keep to the schedule, d) repeat a-c annually or on some regular period so we don't forget how to do it. We (this list collectively mostly ) believe DNSSEC is a benefit to the end-user. The root should provide its service in a professional and regular manner. If the resolvers break, they break, and the people running the resolvers can decide if DNSSEC is of benefit to their users or they want to turn it off because its too much pain for too little gain. If the Root has followed its schedule and publicity program, has waited now some odd months for some of the resolvers to fix themselves, while I can see complaints from the resolver folk, I can see no liability. Having the root rollover held hostage to a group of resolvers that don't see enough benefit to DNSSEC to pay attention to DNSSEC operations seems to me to be a losing strategy. Later, Mike
On 1/9/2018 2:35 AM, Jakob Schlyter wrote:
On 2018-01-09 at 01:33, David Conrad wrote:
Mike,
On January 7, 2018 at 12:53:15 PM, Michael StJohns (msj@nthpermutation.com<mailto:msj@nthpermutation.com>) wrote:
If they key gets lost or compromised, my understanding is that we cannot use RFC 5011 to do the roll and must fall back to doing an out-of-band key rollover. We aren’t really exercising this under this iteration of the community defined KSK rollover plan.
Um. No.
As currently operationally practiced, I believe my statement is correct.
That's a different statement than "we cannot use RFC5011..."
Your statement is correct.
Adding has an emergency rollover key (as described by Mike) has been considered several times over the years, but has been rejected every time due to how the primary key is protected and maintained. No failure scenario has been identified where it wouldn't be possible to recover from a failure and still maintain public transparency. An emergency rollover key does not help us in the current design nor does it make the current key rollover easier.
Forgive me for saying this, but the above is either a failure of will or a failure of imagination. Try this on for size: 1) Have cheap HSMs (e.g. smart cards) that can generate EC or RSA key pairs. One for each shard holder. 2) Write an applet that can decrypt something encrypted under a public key, can generate a public key, and can certify that the public key is associated with the specific smart card and that there's a PIN known only to the card holder. 3) Send all of the smart card public keys to the central location, 4) Send all of the smart cards to a central location wrapped in tamper bags. 5) Generate an AES key (for key wrapping). 6) Generate a new KSK key pair and export the public key, and export the private key under the key wrapping key. Delete it from the HSM at this point. 7) Split (5) into N of K shares using shamir. Encrypt each share using the public key from a different smart card. Print out the encrypted shares and distribute them widely. Delete the AES key from the HSM. Make N greater than the normal KSK signing threshold ideally .5 * K + 1. 8) Pseudo-randomly send the smart cards to each of the share holders mainly ensuring that none of them get their own cards. Each entity is responsible for protecting the card using two man controls - e.g. dual signatures on a safe deposit box, dual locked safe... 9) Verify that all the cards are in place and that the verification seals are intact. 10) Add the new KSK to the DNSKey RRSet The key idea here is that you don't have to protect each shard to the level you have to protect the combined key and you use the difficulty of stealing and using the key cards (note that the people in (8) don't have the PINs for the cards they hold) to decrypt the shares as a security enhancement. You have the card holders check the tamper indications once a month and sign off or some other form. An attacker has to get surreptitious access to N cards to be able to reform the wrapping key and decrypt the KSK. And the price for breaking into most of the smart cards is pretty hefty. The above is off the top of my head, but something like the above is a good start. Another possibility for RSA based signatures is https://link.springer.com/chapter/10.1007/3-540-57220-1_47 - I built a DNSKEY signer using this scheme - the private key is never combined, the signing shards may be publicly combined to get a valid signature. Nice thing about this approach is that once the key is generated, there's no need for anyone to show up at a central location for signing ceremonies.
No failure scenario has been identified where it wouldn't be possible to recover from a failure and still maintain public transparency.
I can't parse this. Removing the double negative it reads - I think - "For all failure scenarios that have been identified it is possible to recover from a failure and still maintain public transparency"? If you mean that all failure scenarios you've consider involve a manual restart to the process with new roots being manually placed in 100s of 1000s of devices you're probably right - that's pretty transparent. On the other hand, revoking an existing key without warning based on a compromise and moving over to another KSK signing key that's already in the trust anchor set is also pretty transparent - 100s of 1000s of devices will see the revocation. Later, Mike
jakob
On Fri, Jan 05, 2018 at 11:19:17AM +1100, Geoff Huston <gih@apnic.net> wrote a message of 44 lines which said:
So if you want to show that at its heart the Internet is still the wild west and we work on gut instincts and disregard data that appears to contradict such instincts, then by all means please roll the key.
Yeah, I like that! Let's press the Button! But, seriously, your framing of the issues is not complete. We should not look just at the risk of rolling the key, but also at the risk of not rolling it: demonstration of failure (we cannot follow "our" own DPS), erosion of trust, risk if we are forced one day to roll the key. The divide is not between wise and experimented people on one side, and irresponsible trigger-happy mavericks on the other. There are risks on both sides.
On Thu, Jan 04, 2018 at 08:45:31PM -0300, Carlos Marcelo Martinez Cagnazzo <carlosm3011@gmail.com> wrote a message of 1108 lines which said:
I fully, fully agree with Jacques and others.
I also agree, by the way. On the other hand, I just realised that the people who decided to subscribe to this list are probably interested in DNSSEC and want it to move forward, and are not the people who will have problems. So, we probably have some bias :-) This being said,
This needs to be over soon.
I agree.
Erwin, On January 4, 2018 at 2:33:17 PM, Erwin Lansing (erwin@dk-hostmaster.dk<mailto:erwin@dk-hostmaster.dk>) wrote: I’ll add to that corporate culture: when DNS fails, the message from higher up will be “Make it work”, not “Make it work The Right Way”. Agreed. On a side note, my biggest fear has been those people/SMBs that buy a black box for some reason, which also does DNS, and has DNSSEC turned on by default, plug in into their network and forget about it, rather than, say, people that know DNS, but not really about DNSSEC. In all your outreach, have you noticed that to be the case, or is it more e.g. service providers, where someone flipped the switch and promptly forgot about it, or something else? I share this concern, but TBH, from my experience in the outreach I was involved with personally, the response was bimodal, either: A) boredom, having to listen to yet another talk on stuff they’d already dealt with (e.g., NANOGs, RIPE meetings, etc) - or - B) incomprehension, not even knowing what the letters DNS stand for. (e.g., CIO/CTO forums, non-technical venues) The reality is that finding the right people to speak to to ensure resolvers are properly configured for the KSK rollover turns out to be quite hard. I have not heard anyone question the security DNSSEC provides or that postponing a rollover is reducing its security. I’ve heard the former. I’ve yet to hear the latter except from a handful of people directly involved in DNSSEC deployment efforts. Both postponing the rollover and doing a rollover with significant fallout, will add more fuel to the fire for those who feel DNSSEC is not a viable solution from an operational viewpoint. All that to say, and I’m playing devils advocate here, at some point we do need to bite the bullet and do the rollover, because to keep postponing it is yet another signal that DNSSEC is not production ready. To be very clear, we don’t want to continue postponing. What we’re looking for is for the community to tell us in the ICANN Org how to move forward. We were surprised with the 8145 data (i.e., that we were actually getting data and the number of misconfigurations we were seeing were as high as they were). We’ve done a bit of analysis and from what little we’ve been able to ascertain, there doesn’t appear to be anything fundamentally broken with the architecture or implementations, rather misconfiguration happens. This isn’t surprising. However, now that we know concretely there will be brokenness, how much is the community willing to tolerate (and what metrics can we use to ensure we’re below that threshold). Regards, -drc
David, On 5 Jan 2018, at 02.29, David Conrad <david.conrad@icann.org<mailto:david.conrad@icann.org>> wrote: I share this concern, but TBH, from my experience in the outreach I was involved with personally, the response was bimodal, either: A) boredom, having to listen to yet another talk on stuff they’d already dealt with (e.g., NANOGs, RIPE meetings, etc) - or - B) incomprehension, not even knowing what the letters DNS stand for. (e.g., CIO/CTO forums, non-technical venues) The reality is that finding the right people to speak to to ensure resolvers are properly configured for the KSK rollover turns out to be quite hard. I, and most people on this list, are definitely in group A. Those talks are good breaks to check email during conferences :-) But seriously, that goes to the heart of the problem. The people trying to fix the issue (A) are not the people actually using the service (B). That’s both a problem to reach those people that may need to act in some way, but also might lead to misunderstandings about how the world looks from the viewpoint of the other group. To be very clear, we don’t want to continue postponing. What we’re looking for is for the community to tell us in the ICANN Org how to move forward. We were surprised with the 8145 data (i.e., that we were actually getting data and the number of misconfigurations we were seeing were as high as they were). We’ve done a bit of analysis and from what little we’ve been able to ascertain, there doesn’t appear to be anything fundamentally broken with the architecture or implementations, rather misconfiguration happens. This isn’t surprising. However, now that we know concretely there will be brokenness, how much is the community willing to tolerate (and what metrics can we use to ensure we’re below that threshold). So we don’t want to not do the rollover, we know our data is incomplete, and we know there will be an unknown amount of fallout. From the data that we do have through 8145, is there any indication that the amount of known brokenness is decreasing? Could that be used as an indicator that, despite all the tremendous effort from ICANN and others over the last months, we have no way to decrease the known fallout further, thereby assuming there’s nothing more we can do to prevent the unknown fallout either? Erwin
On 5.1.2018 09:30, Erwin Lansing via ksk-rollover wrote:
David,
On 5 Jan 2018, at 02.29, David Conrad <david.conrad@icann.org <mailto:david.conrad@icann.org>> wrote:
I share this concern, but TBH, from my experience in the outreach I was involved with personally, the response was bimodal, either:
A) boredom, having to listen to yet another talk on stuff they’d already dealt with (e.g., NANOGs, RIPE meetings, etc)
- or -
B) incomprehension, not even knowing what the letters DNS stand for. (e.g., CIO/CTO forums, non-technical venues)
The reality is that finding the right people to speak to to ensure resolvers are properly configured for the KSK rollover turns out to be quite hard.
I, and most people on this list, are definitely in group A. Those talks are good breaks to check email during conferences :-)
But seriously, that goes to the heart of the problem. The people trying to fix the issue (A) are not the people actually using the service (B). That’s both a problem to reach those people that may need to act in some way, but also might lead to misunderstandings about how the world looks from the viewpoint of the other group.
To be very clear, we don’t want to continue postponing. What we’re looking for is for the community to tell us in the ICANN Org how to move forward. We were surprised with the 8145 data (i.e., that we were actually getting data and the number of misconfigurations we were seeing were as high as they were). We’ve done a bit of analysis and from what little we’ve been able to ascertain, there doesn’t appear to be anything fundamentally broken with the architecture or implementations, rather misconfiguration happens. This isn’t surprising. However, now that we know concretely there will be brokenness, how much is the community willing to tolerate (and what metrics can we use to ensure we’re below that threshold).
So we don’t want to not do the rollover, we know our data is incomplete, and we know there will be an unknown amount of fallout. From the data that we do have through 8145, is there any indication that the amount of known brokenness is decreasing? Could that be used as an indicator that, despite all the tremendous effort from ICANN and others over the last months, we have no way to decrease the known fallout further, thereby assuming there’s nothing more we can do to prevent the unknown fallout either?
Erwin
In my opinion, the important metric is *derivative* of: # of users behind KSK-2017 capable resolver vs. # of users behind KSK-2010 only resolver - If portion of users who are ready for the roll is increasing, it might make sense to wait a little bit longer. - If the number is steady or even getting worse we need to roll ASAP. This approch has its own problems I can see: 0. This meric requires mapping resolver=>number of its users. Geoff Huston's method can theoretically do that for big resolvers, but his method (web ads) have inherent selection bias. Is it good enough? I do not dare to guess. 1. Both RFC 8145 *and* IETF draft-huston-kskroll-sentinel suffer from unquantified selection bias at the moment. RFC 8145 data show 'best resolvers', i.e. those who are managed by people who updated their software, but this tell us *nothing* about real state of the Internet at large. If there were new versions of software implementing draft-huston-kskroll-sentinel we could attempt to quantify its deployment, but I'm personally not convinced that it will get real deployment anytime soon, so its data will again have so strong selection bias so they will be useless. As other people mentioned already in this discussion, we need to roll to a) firedrill for future emergencies b) improve trust in the DNSSEC technology (most likely by going to stronger algs/keys in future rolls) This leads me to conclusion that we do not have and most likely will not have relevant data anyway, so it is pointless to postpone the roll any further. Please will fix their stuff when it breaks. (Geoff and others, I will be more than happy if you prove that I'm wrong and that you have drawers full of data!) -- Petr Špaček @ CZ.NIC
On Fri, Jan 05, 2018 at 10:40:36AM +0100, Petr Špaček <petr.spacek@nic.cz> wrote a message of 78 lines which said:
In my opinion, the important metric is *derivative* of: # of users behind KSK-2017 capable resolver vs. # of users behind KSK-2010 only resolver
Even if it were the right metric (I disagree, see later), what is the point of a metric we cannot measure? Let's face it: with the root key rollover problems, like almost all the real and important problems of mankind, we won't have perfect data. Waiting for data is wise, expecting to have perfect data is just procrastination.
This leads me to conclusion that we do not have and most likely will not have relevant data anyway, so it is pointless to postpone the roll any further. Please will fix their stuff when it breaks.
So we agree. Let me add that the # of users behind a broken resolver is not the best metric: some amount of breakage is unavoidable, the real problem is how long will it take to fix it. If the guy in charge of the resolver screams "[Expletive deleted], I forgot to change the key, let's change the resolver to a non-DNSSEC one, go to IANA Web site, download the key, install it and switch the resolver again", this is not a big deal: there are always micro-breakages somewhere on the Internet. So, the important metric to me is # of users behind a resolver which is both broken and unmanaged (or managed by clueless people). Both metrics being non-measurable, I think they only have a theoretical interest.
Stephane, On January 6, 2018 at 12:48:15 PM, Stephane Bortzmeyer (bortzmeyer@nic.fr<mailto:bortzmeyer@nic.fr>) wrote: Waiting for data is wise, expecting to have perfect data is just procrastination. Strawman: no one is suggesting we are expecting perfect data. Let me add that the # of users behind a broken resolver is not the best metric: some amount of breakage is unavoidable, the real problem is how long will it take to fix it. While I’d agree time to repair is important, I personally think the worrisome part is what happens when previously working resolvers spontaneously stop working until there is manual intervention. I assume the MTTR will be quite low: it isn’t that hard to turn off DNSSEC validation (as a first step to stop the screaming). The unpleasant scenario is where someone has gone to great lengths to do the right thing and have (someone else set up) a resilient, redundant validating resolver infrastructure supporting some mission critical application and, because they didn’t know to update the KSK, have that resolver infrastructure simply fall over. And yes, this is also unmeasurable. Regards, -drc
On Fri, Jan 05, 2018 at 01:29:17AM +0000, David Conrad <david.conrad@icann.org> wrote a message of 176 lines which said:
What we’re looking for is for the community to tell us in the ICANN Org how to move forward.
There is no such thing as "the community". There are various people, with different experiences and opinions. I, like everyone else here, can only speak for myself (and, as I said, I believe we whould do the rollover now).
how much is the community willing to tolerate
Again, there is no "community", with a common opinion. People will never agree and yet a decision has to be made.
Stephane, On January 6, 2018 at 12:38:13 PM, Stephane Bortzmeyer (bortzmeyer@nic.fr<mailto:bortzmeyer@nic.fr>) wrote: On Fri, Jan 05, 2018 at 01:29:17AM +0000,
What we’re looking for is for the community to tell us in the ICANN Org how to move forward.
There is no such thing as "the community". Of course. Forgive my shorthand. We’re looking for input from those who have interest in the matter, particularly those knowledgeable on the topic, and who are willing to contribute their opinions. It would, of course, be helpful if those that are providing their opinions represented some subset of those who would be impacted in one way or another, e.g., ICANN's SOs and ACs, but that doesn’t mean that the input has to come from those folks. Again, there is no "community", with a common opinion. People will never agree and yet a decision has to be made. Just as people never agreed to expanding the root zone, yet a decision was made. Regards, -drc
On Thu, 4 Jan 2018, David Conrad wrote:
On January 4, 2018 at 8:39:35 AM, Jacques Latour (jacques.latour@cira.ca) wrote: I'll go first, we need to take in account the human behaviour, and not being an expert human behavioral analyst, I know that people fix things when broken and not when it's working. So getting a 100% of people's attention to fix something not broken is almost impossible.
Agreed.
When we talk to ISP about this issue, the smaller ones just turn DNSSEC validation off because it's easier.
Not just the smaller ones. I’ve been told one of the largest ISPs in the world turned off DNSSEC until after the KSK rollover.
That's quite the disservice to their customers.
Agreed. The question we are trying to answer is that given this, and given we now have objective data that strongly suggests there are folks who are NOT ready, what is the criteria by which we move ahead?
I think the ISP nameserver case is the easy part. They either upgrade, automatically or manually, or turn it off. Any outage caused by their lack of DNS server management skills wil cause outages that will get fixed quickly when customer service starts receiving support calls. Trickier is the enterprise deployments that are not aware of DNS at all. They have something that works now, and it will just break. The problem here is that we cannot do more then warn and watch. How does the ksk-sentinel draft change this? Would we get more useful data ? If so, how long do we need to wait to get that additional new data?
1) “we need to go ahead with the roll over”
We did measurements that caused us to hit pause. I'd like to know more about these measurements now. Is the situation getting better or is it unchanged? If so, what efforts have we done to pinpoint the issue. If this is some common framework (an OS vendor, a docker image, a commonly used AMI image) can we get them to fix it? Once we know this, we have a better idea on whether waiting is going to be helpful or not.
I am personally unaware that of any noticeable change in the trust associated with DNSSEC as a result of the (lack of) KSK rollover. Within security knowledgeable folks, I do know that trust in DNSSEC has been _increased_ a bit by the move by Verisign from a 1024 bit ZSK to a 2048 bit ZSK, but that’s obviously unrelated to the KSK rollover. What data do you have that trust has decreased due to the lack of KSK rollover?
Fully agree. So to summarize, I'd like to see the delta in knowledge between when we hit the pause button and now, to see if we are getting better informed or not. Paul
On Thu, Jan 04, 2018 at 10:01:04PM +0000, David Conrad <david.conrad@icann.org> wrote a message of 222 lines which said:
Just to level set and argue the extreme, if we had data that suggested that 100% of validating resolvers would fail, would you personally pull the trigger that causes the KSK rollover?
If there were this data, no, because it would mean there is a general problem, may be a broken protocol that the IETF would need to fix. But we are not at 100 %, we now that key rollover can work, just not for everyone. 100 % failure is an easy case to handle: it means there is clearly a problem, and which does not reside in the ordinary sysadmin. But we are not in the easy case.
This (presumably) assumes humans will fix the problems in a positive way. I’ll admit I suspect the more likely way of fixing DNSSEC rollover-caused validation failures will be to simply disable DNSSEC validation
Yes, this is a serious risk. On the other hand, people who still use the old key, and did not do anything to fix the problem, will have big trouble with DNSSEC sooner or later. So, it may be a not-so-bad thing if they disable it.
I am personally unaware that of any noticeable change in the trust associated with DNSSEC as a result of the (lack of) KSK rollover.
Several people told me "so, you are still unable to replace the [profanity deleted] key?", laughing hard. Yes, this is anecdotal evidence, I don't have a better one to offer.
Stephane, On January 6, 2018 at 12:28:31 PM, Stephane Bortzmeyer (bortzmeyer@nic.fr<mailto:bortzmeyer@nic.fr>) wrote:
Just to level set and argue the extreme, if we had data that suggested that 100% of validating resolvers would fail, would you personally pull the trigger that causes the KSK rollover?
If there were this data, no, because it would mean there is a general problem, may be a broken protocol that the IETF would need to fix. But we are not at 100 %, we now that key rollover can work, just not for everyone. Right. What if that number was 75%? Or 50%? The data we have says that the part of everyone for which the roll isn’t going to work is the number of users sitting behind around 8% of 8145-reporting resolvers. Is that low enough? Is it even a useful measure? 100 % failure is an easy case to handle: it means there is clearly a problem, and which does not reside in the ordinary sysadmin. But we are not in the easy case. Yes. Which is why we’re looking for input on how to move forward. Regards, -drc
On Tue, Jan 2, 2018 at 12:06 PM, Paul Hoffman <paul.hoffman@icann.org> wrote:
Greetings in the new year. As announced on this list (and in many other places) a few weeks ago, the ICANN org wants to use this list to get input from the community on acceptable criteria for proceeding with the root KSK roll. When we made that announcement, we saw a good number of new subscriptions to the list, but the discussion didn't start on its own, so we want to get that going.
For reference, please see <https://www.icann.org/news/ blog/update-on-the-root-ksk-rollover-project>. The relevant timing part from that article is:
The ICANN org will monitor this mailing list and beginning on 15 January 2018, we will develop a draft plan for proceeding with the root KSK roll based on the input received and discussion on the mailing list. The plan will be published by 31 January 2018 and undergo a formal ICANN public comment process to gather further input.
We would really like to hear from you about the criteria you think would be relevant for us to observe/measure, if such criteria exist.
--Paul Hoffman
This has been an interesting discussion. My $.02... I don't think the delay causes any significant risk or loss of trust. I prefer the cautious approach. If we have data, or can reasonably get data, and it seems to show improvement, then we should delay and continue to collect data as long as there is 'significant' improvement, or until the number or affected clients meets SAC063 recommendation #3. I would prefer to limit the delay to one year. If there is no data, or if the data does not show improvement, we should set a new date for the roll and warn people that there is likely to be some breakage, and explain how to figure out who to contact if your resolver breaks. We need to give people some test (draft-huston-kskroll-sentinel ?) that they can run to see if the resolvers they use are likely to break. Tell them "run this test in the places where you use the internet, and if it fails, let us know and also contact the internet provider at that location to get it fixed." If the test passes, that could be a great relief to people who are worried about their internet provider. -- Bob Harold
On 17:06 02/01, Paul Hoffman wrote:
Greetings in the new year. As announced on this list (and in many other places) a few weeks ago, the ICANN org wants to use this list to get input from the community on acceptable criteria for proceeding with the root KSK roll. When we made that announcement, we saw a good number of new subscriptions to the list, but the discussion didn't start on its own, so we want to get that going.
For reference, please see <https://www.icann.org/news/blog/update-on-the-root-ksk-rollover-project>. The relevant timing part from that article is:
The ICANN org will monitor this mailing list and beginning on 15 January 2018, we will develop a draft plan for proceeding with the root KSK roll based on the input received and discussion on the mailing list. The plan will be published by 31 January 2018 and undergo a formal ICANN public comment process to gather further input.
We would really like to hear from you about the criteria you think would be relevant for us to observe/measure, if such criteria exist.
I agreed with the suspension of rollover last september, but not so much for telemetry data, if not for this bug in Unbound: http://unbound.net/pipermail/unbound-users/2017-August/004883.html (For those who do not know, any new installation of Unbound before version 1.6.5 (published August 21 2017, 40 days after the add-hold is initiated for the new key) came with the new key in ADDPEND status, with 30 days to go to VALID. I personally installed an Unbound resolver on CentOS 6.9 in September 20th, which had the bug. That service would became bogus on October 11th). After the patch was released, how long it takes to pass downstream to common OS distros? IMHO, any software that has such critical errors through the process of rollover deserves reconsideration of deadlines, and wait a "as long as possible". At some point we have to say that we had a cautious time to allow patches on distros, or operators manually upgrade the keys ... but clearly on October 11 was too early. At this point, 4 months later, can we assume that a competent operator, with current OS with updated patches, is "safe from the rollover"? I wonder if ICANN in their research and direct contact with operators have found evidence of any bug, outdated distros, incorrect manuals, bad practices, etc., that demonstrate a "structural" problem with rollover procedures. Regards, Hugo
On Mon, 8 Jan 2018, Hugo Salgado-Hernández wrote:
After the patch was released, how long it takes to pass downstream to common OS distros?
It depends. For instance for RHEL, it will be fixed in 7.5. But had we actually not aborted the roll, Red Hat would have done a accelerated update to fix this issue.
At this point, 4 months later, can we assume that a competent operator, with current OS with updated patches, is "safe from the rollover"?
Yes, and not only that, for this issue we could have rolled on the original date as well.
I wonder if ICANN in their research and direct contact with operators have found evidence of any bug, outdated distros, incorrect manuals, bad practices, etc., that demonstrate a "structural" problem with rollover procedures.
That is what I asked about as well. What have they learned, and how did they try to learn this? If it were very regional centric, did they reach out to that region further? How do modern deployments that include a DNS server look like? Has anyone checked popular AMI's? Checked with openshift, openstack, docker? Asked companies that deploy many containers how they do DNS? Or asked the bleeding end web front/backend people what they do and how? The only reason for waiting is to await more data. If we are not getting new data, then based on what we know, the faulty deployments won't vanish over time, so there is no point in waiting. Although if we do see a decrease over time, then where is it decreasing, and can we link some staggered decrease to the release of something opensource? Paul
On 08/01/2018 19:52, Paul Wouters wrote:
On Mon, 8 Jan 2018, Hugo Salgado-Hernández wrote:
After the patch was released, how long it takes to pass downstream to common OS distros?
It depends. For instance for RHEL, it will be fixed in 7.5. But had we actually not aborted the roll, Red Hat would have done a accelerated update to fix this issue.
At this point, 4 months later, can we assume that a competent operator, with current OS with updated patches, is "safe from the rollover"?
Yes, and not only that, for this issue we could have rolled on the original date as well.
To add on this, we were also in contact with Ubuntu, Debian and FreeBSD. The distributions moved this issue with priority through their process and patches were backported to stable distributions as well as the new Unbound release ended up in new/upcoming/experimental distributions. (OpenBSD 6.2 incorporated Unbound 1.6.6.) So I would like to acknowledge the distribution maintainers for their swift actions to push the patches (backported) or the new Unbound release. Best, -- Benno -- Benno J. Overeinder NLnet Labs https://www.nlnetlabs.nl/
On Tue, Jan 2, 2018 at 12:06 PM Paul Hoffman <paul.hoffman@icann.org> wrote:
Greetings in the new year. As announced on this list (and in many other places) a few weeks ago, the ICANN org wants to use this list to get input from the community on acceptable criteria for proceeding with the root KSK roll. When we made that announcement, we saw a good number of new subscriptions to the list, but the discussion didn't start on its own, so we want to get that going.
For reference, please see < https://www.icann.org/news/blog/update-on-the-root-ksk-rollover-project>. The relevant timing part from that article is:
The ICANN org will monitor this mailing list and beginning on 15 January 2018, we will develop a draft plan for proceeding with the root KSK roll based on the input received and discussion on the mailing list. The plan will be published by 31 January 2018 and undergo a formal ICANN public comment process to gather further input.
We would really like to hear from you about the criteria you think would be relevant for us to observe/measure, if such criteria exist.
I have been scared to open this thread, and so it's been lurking at me, unread in my inbox... I've had a large number of discussions on this topic over the past few years, wearing many different hats, and over this time, the main thing I've learnt is that this is hard :-) I want to raise a few concerns: 1: there is a significant bias in these responses -- the discussion is happening on the "ksk-rollover mailing list". I just went and checked, and I know between 2/3rd and 3/4 of the members of this list (or they are @verisign or @icann). We definitely fall into the "weird DNSSEC crowd", and are not representative of the average resolver operator/user. I have no idea which way the bias leans, but I have a hard time believing that people on a list devoted to key rolling have none :-) 2: The instances that concern me are not the large ISPs or public open resolvers - instead it is the people who listened to all of our proselytization, turned on DNSSEC at their employer... and then left... and now no-one there knows what this DNS thingie is. My wife handles IT for a number of small companies, and this sort of thing is sadly common - there will be a box in a corner which no-one knows what it does, but 'if the Internets stops, you turn it off and on again, and then the Internets works again'. This is in the same bucket as "We installed a DNS appliance. Look at the flashy lights!" -- sadly this scene from the IT Crowd describes many small companies: https://www.youtube.com/watch?v=12LLJFSBnS4 3: I'm only slightly worried about the actual breakage; I'm much more concerned about the PR fallout from the breakage -- when ICANN looked at the data, and made the (responsible) decision to pause and reconsider the roll timing, the press went somewhat nuts. When the roll happens, and someone's auntie is not able to reach www.cnn.com, this won't be "Auntie Mae updates key", it will instead be "Critical Internet Security Flaw Breaks Internet. ICANN, the organization responsible for running the Internet, today broke the Internet for an unknown number of users. When interviewed, Mae West said ...". I'm glad to see ICANN asking for the community's feedback on this, and that there will be a public comment period; this obviously won't stop the press from turning any issues into a story, but at least we can say "ICANN asked the community, and then asked again, and then published the plan. It wasn't in a disused lavatory with a sign on the door saying 'Beware of the Leopard.", it was here <link>. We did what the community asked....". As for the actual question you asked -- I personally think that the data which draft-ietf-dnsop-kskroll-sentinel will eventually produce will be useful, but I'm a: biased and b: don't know when this would be available, not how much of you you'd need before making a decision. I'm punting on the actual criteria; you're welcome! W
--Paul Hoffman _______________________________________________ ksk-rollover mailing list ksk-rollover@icann.org https://mm.icann.org/mailman/listinfo/ksk-rollover
On Mon, Jan 15, 2018 at 3:03 PM, Warren Kumari <warren@kumari.net> wrote:
On Tue, Jan 2, 2018 at 12:06 PM Paul Hoffman <paul.hoffman@icann.org> wrote:
Greetings in the new year. As announced on this list (and in many other places) a few weeks ago, the ICANN org wants to use this list to get input from the community on acceptable criteria for proceeding with the root KSK roll. When we made that announcement, we saw a good number of new subscriptions to the list, but the discussion didn't start on its own, so we want to get that going.
For reference, please see <https://www.icann.org/news/ blog/update-on-the-root-ksk-rollover-project>. The relevant timing part from that article is:
The ICANN org will monitor this mailing list and beginning on 15 January 2018, we will develop a draft plan for proceeding with the root KSK roll based on the input received and discussion on the mailing list. The plan will be published by 31 January 2018 and undergo a formal ICANN public comment process to gather further input.
We would really like to hear from you about the criteria you think would be relevant for us to observe/measure, if such criteria exist.
I have been scared to open this thread, and so it's been lurking at me, unread in my inbox...
I've had a large number of discussions on this topic over the past few years, wearing many different hats, and over this time, the main thing I've learnt is that this is hard :-)
I want to raise a few concerns: 1: there is a significant bias in these responses -- the discussion is happening on the "ksk-rollover mailing list". I just went and checked, and I know between 2/3rd and 3/4 of the members of this list (or they are @verisign or @icann). We definitely fall into the "weird DNSSEC crowd", and are not representative of the average resolver operator/user. I have no idea which way the bias leans, but I have a hard time believing that people on a list devoted to key rolling have none :-)
2: The instances that concern me are not the large ISPs or public open resolvers - instead it is the people who listened to all of our proselytization, turned on DNSSEC at their employer... and then left... and now no-one there knows what this DNS thingie is. My wife handles IT for a number of small companies, and this sort of thing is sadly common - there will be a box in a corner which no-one knows what it does, but 'if the Internets stops, you turn it off and on again, and then the Internets works again'. This is in the same bucket as "We installed a DNS appliance. Look at the flashy lights!" -- sadly this scene from the IT Crowd describes many small companies: https://www.youtube.com/watch?v=12LLJFSBnS4
3: I'm only slightly worried about the actual breakage; I'm much more concerned about the PR fallout from the breakage -- when ICANN looked at the data, and made the (responsible) decision to pause and reconsider the roll timing, the press went somewhat nuts. When the roll happens, and someone's auntie is not able to reach www.cnn.com, this won't be "Auntie Mae updates key", it will instead be "Critical Internet Security Flaw Breaks Internet. ICANN, the organization responsible for running the Internet, today broke the Internet for an unknown number of users. When interviewed, Mae West said ...". I'm glad to see ICANN asking for the community's feedback on this, and that there will be a public comment period; this obviously won't stop the press from turning any issues into a story, but at least we can say "ICANN asked the community, and then asked again, and then published the plan. It wasn't in a disused lavatory with a sign on the door saying 'Beware of the Leopard.", it was here <link>. We did what the community asked....".
As for the actual question you asked -- I personally think that the data which draft-ietf-dnsop-kskroll-sentinel will eventually produce will be useful, but I'm a: biased and b: don't know when this would be available, not how much of you you'd need before making a decision. I'm punting on the actual criteria; you're welcome!
W
--Paul Hoffman
As I understand it, draft-huston-kskroll-sentinel could be set up by one person. And then anyone could use it for testing. So I think it should be set up (by someone better at DNSSEC than I) and publicized as part of the announcement of a new roll date - "here is how you test if it will affect you, and here is how to find your resolver operator if it fails" If it could be tested broadly with an online browser ad campaign, that would be great. Also, the data gathered was used to try to contact resolver operators to find out why they were not updated, but the finding was that they were difficult to contact, and there was no apparent common cause. So the data was less helpful than they had hoped. Waiting for more data seems unlikely to help. -- Bob Harold
On Jan 16, 2018, at 12:48 PM, Bob Harold <rharolde@umich.edu> wrote:
As I understand it, draft-huston-kskroll-sentinel could be set up by one person.
That doesn't match my understanding from the draft or the clarification that Warren sent to the DNSOP WG yesterday. It has to be installed and configured in resolvers first, and then the test can be run by one person who can get folks to hit a web page or download some JavaScript. Warren, do I have that correctly? --Paul Hoffman
On 17.1.2018 02:19, Paul Hoffman wrote:
On Jan 16, 2018, at 12:48 PM, Bob Harold <rharolde@umich.edu> wrote:
As I understand it, draft-huston-kskroll-sentinel could be set up by one person.
That doesn't match my understanding from the draft or the clarification that Warren sent to the DNSOP WG yesterday. It has to be installed and configured in resolvers first, and then the test can be run by one person who can get folks to hit a web page or download some JavaScript.
Warren, do I have that correctly?
I will reply even though I'm not Warren: Yes, this is correct, it needs support in every validating resolver. In other words, this mechanism suffers from the very same upgrade problem as RFC 8145. I've implemented a prototype of draft-huston-kskroll-sentinel for Knot Resolver, but later I've realized that whatever we do is largely irrelevant when it comes to collecting reliable data for *this* KSK roll. We should go ahead and implement draft-huston-kskroll-sentinel but I do not see it giving us data for KSK-2017 roll. This is how I arrived to conclusion that KSK-2017 will inevitably involve some out-of-band fixes and press coverage, similarly to any other security issue these days. -- Petr Špaček @ CZ.NIC
On Wed, Jan 17, 2018 at 2:39 AM Petr Špaček <petr.spacek@nic.cz> wrote:
On 17.1.2018 02:19, Paul Hoffman wrote:
On Jan 16, 2018, at 12:48 PM, Bob Harold <rharolde@umich.edu> wrote:
As I understand it, draft-huston-kskroll-sentinel could be set up by one person.
That doesn't match my understanding from the draft or the clarification that Warren sent to the DNSOP WG yesterday. It has to be installed and configured in resolvers first, and then the test can be run by one person who can get folks to hit a web page or download some JavaScript.
Warren, do I have that correctly?
I will reply even though I'm not Warren: Yes, this is correct, it needs support in every validating resolver.
In other words, this mechanism suffers from the very same upgrade
problem as RFC 8145.
Yup, what y'all said -- anyone can setup the test, but it won't generate useful data until implemented in resolvers. Sentinal will generate much more useful data (it's measuing what users will experiance, not what resolvers will experiance), but still needs to be deployed -- I was somewhat surprised by how quickly RFC8145 will deployed - I guess we need a: this to be implmented, and then b: some security events to cuase upgrades :-) I ment to include the below in my original bloviation: I think it would be really useful to reach out to the press who published articles on the keyroll pause (e.g: BleepingComputer, Bloomberg, Modern Ghana, The Register, ITWorld, etc) - having them be told ahead of time that ICANN stopped things, got community feedback and is proceeding cautiously (potentially) changes the narrative completely - and, at least, helps prevent the bad PR hit to ICANN (this is an ICANN list, after all) and them feeling blindsided. Converting the potential PR ding into a win would be nice - and may also reach more people. W
I've implemented a prototype of draft-huston-kskroll-sentinel for Knot Resolver, but later I've realized that whatever we do is largely irrelevant when it comes to collecting reliable data for *this* KSK roll.
We should go ahead and implement draft-huston-kskroll-sentinel but I do not see it giving us data for KSK-2017 roll.
This is how I arrived to conclusion that KSK-2017 will inevitably involve some out-of-band fixes and press coverage, similarly to any other security issue these days.
-- Petr Špaček @ CZ.NIC _______________________________________________ ksk-rollover mailing list ksk-rollover@icann.org https://mm.icann.org/mailman/listinfo/ksk-rollover
On Jan 17, 2018, at 1:19 PM, Warren Kumari <warren@kumari.net<mailto:warren@kumari.net>> wrote: I ment to include the below in my original bloviation: I think it would be really useful to reach out to the press who published articles on the keyroll pause (e.g: BleepingComputer, Bloomberg, Modern Ghana, The Register, ITWorld, etc) - having them be told ahead of time that ICANN stopped things, got community feedback and is proceeding cautiously (potentially) changes the narrative completely - and, at least, helps prevent the bad PR hit to ICANN (this is an ICANN list, after all) and them feeling blindsided. Converting the potential PR ding into a win would be nice - and may also reach more people. This is a good suggestion and I will add it to the hopper of PR ideas. Please recall that in late December (https://www.icann.org/news/blog/update-on-the-root-ksk-rollover-project), we wrote: The ICANN org will monitor this mailing list and beginning on 15 January 2018, we will develop a draft plan for proceeding with the root KSK roll based on the input received and discussion on the mailing list. The plan will be published by 31 January 2018 and undergo a formal ICANN public comment process to gather further input. We are indeed planning to publish a draft plan for moving forward at the end of the month based on this discussion, and we're also planning PR activities to publicize the plan and the formal public comment, including outreach to publications that have previously covered the root KSK roll or that we suspect would be willing to cover it. Matt
participants (19)
-
Benno Overeinder -
Bob Harold -
Carlos Marcelo Martinez Cagnazzo -
Carlos Martinez-Cagnazzo -
David Conrad -
Doug Barton -
Erwin Lansing -
Geoff Huston -
Hugo Salgado-Hernández -
Jacques Latour -
Jakob Schlyter -
Matt Larson -
Michael StJohns -
Paul Hoffman -
Paul Wouters -
Petr Špaček -
Stephane Bortzmeyer -
Warren Kumari -
Yazid AKANHO