[RSS Metrics WP] INPUT REQUESTED: Performance Thresholds for Root Servers and Root Server System
Dear RSSAC Metrics WP [and RSSAC Caucus Members], At the next call of the metrics WP, the metrics WP will discuss performance thresholds for individual root servers and the root server system as a whole. To prepare for those discussions, the co-chairs have requested that you fill out the following spreadsheet ahead of the meeting and provide your input. https://docs.google.com/spreadsheets/d/1ZM-eI3SLsG8QVic_bN8W15R52o3ZWXTu7OPv... Kindly fill out the spreadsheet by COB Wed 18 September 2019. Best Steve
On Sep 11, 2019, at 8:44 AM, Steve Sheng <steve.sheng@icann.org> wrote:
At the next call of the metrics WP, the metrics WP will discuss performance thresholds for individual root servers and the root server system as a whole. To prepare for those discussions, the co-chairs have requested that you fill out the following spreadsheet ahead of the meeting and provide your input.
At earlier meetings, we discussed having a set of rationale for our threshold numbers. I put together the following for my own rationale, which people might or might not like. --Paul Hoffman The thresholds define minimum acceptable metrics for each RSO. In order to determine the numeric thresholds, an overall view of the importance of each metric is needed. The rationale here is based on an overall idea of what is expected of an RSO by the caching recursive resolvers that uses that RSO as part of the RSS. In summary, the importance of the RSO metrics is: correctness > publication latency >> availability > response latency Of the four metrics, RSO correctness is by far the most important to recursive resolvers because incorrect answers can have long-lasting negative effects on the caches. It is well known that fewer than half of all resolvers use DNSSEC validation; for those resolvers, incorrect information from an RSO will poison future queries until the TTL associated with the incorrect answer expires. Incorrect information from an RSO about glue records for unsigned TLDs can cause long-term harm even to validating resolvers. RSO publication latency is important both when NS and DS records in the root zone are changing, as well as when RRSIG records are expiring. If an RSO has consistently long publication latency, some RSOs can be affected by getting wrong answers to queries. However, historically, TLD NS change very infrequently, and DS records are usually changed in a fashion where there is a large overlap between old and new records. Emergency updates to the root zone data happen only a few times a year and, even when they do happen, the old records are cached for up to two days. Thus, there is no expectation that an RSO will update its copy of the root zone “instantaneously”, but it should update within a reasonable fraction of time of the TTLs in the root zone. Although RSO availability and RSO response latency negatively affect recursive resolvers, they do so in a way that is easily and automatically fixed, namely by moving to a different RSO. That is, if many instances of an RSO are flaky due either to computer or routing instability, a recursive resolver will see this and switch to a different RSO for some extended period of time. Similarly, if many recursive resolvers see high latency in their responses from one RSO, they will switch to other RSOs before trying that RSO again in the future.
Hello Paul-san, The rationale for each measurement has been discussed in the WP, but I don't beleive that the importance or relation between each metric was discussed in the context. If we need to take such importance into consideration in finding out the threshold, shouldn't we need additional discussion on this prior to expressing the suggestion of the threshold to the spreadsheet? In addition, the importance has only suggested for the RSO metrics. The importance for RSS metrics would need another discussion as well. If these are already discussed before, I'd like to see the minutes, transcripts or whatever, as I did miss some meetings. Regards, Shinta Sato <shinta@jprs.co.jp> Japan Registry Services Co., Ltd. On Wed, 11 Sep 2019 16:44:15 +0000 Paul Hoffman <paul.hoffman@icann.org> wrote:
On Sep 11, 2019, at 8:44 AM, Steve Sheng <steve.sheng@icann.org> wrote:
At the next call of the metrics WP, the metrics WP will discuss performance thresholds for individual root servers and the root server system as a whole. To prepare for those discussions, the co-chairs have requested that you fill out the following spreadsheet ahead of the meeting and provide your input.
At earlier meetings, we discussed having a set of rationale for our threshold numbers. I put together the following for my own rationale, which people might or might not like.
--Paul Hoffman
The thresholds define minimum acceptable metrics for each RSO. In order to determine the numeric thresholds, an overall view of the importance of each metric is needed. The rationale here is based on an overall idea of what is expected of an RSO by the caching recursive resolvers that uses that RSO as part of the RSS.
In summary, the importance of the RSO metrics is: correctness > publication latency >> availability > response latency
Of the four metrics, RSO correctness is by far the most important to recursive resolvers because incorrect answers can have long-lasting negative effects on the caches. It is well known that fewer than half of all resolvers use DNSSEC validation; for those resolvers, incorrect information from an RSO will poison future queries until the TTL associated with the incorrect answer expires. Incorrect information from an RSO about glue records for unsigned TLDs can cause long-term harm even to validating resolvers.
RSO publication latency is important both when NS and DS records in the root zone are changing, as well as when RRSIG records are expiring. If an RSO has consistently long publication latency, some RSOs can be affected by getting wrong answers to queries. However, historically, TLD NS change very infrequently, and DS records are usually changed in a fashion where there is a large overlap between old and new records. Emergency updates to the root zone data happen only a few times a year and, even when they do happen, the old records are cached for up to two days. Thus, there is no expectation that an RSO will update its copy of the root zone “instantaneously”, but it should update within a reasonable fraction of time of the TTLs in the root zone.
Although RSO availability and RSO response latency negatively affect recursive resolvers, they do so in a way that is easily and automatically fixed, namely by moving to a different RSO. That is, if many instances of an RSO are flaky due either to computer or routing instability, a recursive resolver will see this and switch to a different RSO for some extended period of time. Similarly, if many recursive resolvers see high latency in their responses from one RSO, they will switch to other RSOs before trying that RSO again in the future. _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
On Sep 13, 2019, at 7:58 AM, Shinta Sato <shinta@jprs.co.jp> wrote:
The rationale for each measurement has been discussed in the WP, but I don't beleive that the importance or relation between each metric was discussed in the context.
This is correct. In some of the informal conversations, some people have given some rationale for individual metrics, but the WP has not yet discussed pulling the rationale together or coming to consensus on the rationale. That's why I did my own brain dump, to see if it matches other people's thoughts.
If we need to take such importance into consideration in finding out the threshold, shouldn't we need additional discussion on this prior to expressing the suggestion of the threshold to the spreadsheet?
Maybe the discussion can happen at the same time. The WP leaders want to have some ideas of what people want in the spreadsheet in order to start good discussion on the next call, but those ideas don't need to be at all final.
In addition, the importance has only suggested for the RSO metrics. The importance for RSS metrics would need another discussion as well.
So far, I am unable to come up with any technical or procedural justification for RSS thresholds. It is useful to collect the metrics, but I have tried to come up with some threshold ideas and just failed.
If these are already discussed before, I'd like to see the minutes, transcripts or whatever, as I did miss some meetings.
The rationale ideas I posted are new. I'm hoping others will do similar postings, or at least poke holes in the one I did. --Paul Hoffman
Hello Metrics WP members, As I've looked into each definition of the measurement and aggregation of the RSS related metrics to find out the thresholds, I came up with the strong thought that these definition need to be corrected or redefined. Current definitions of the RSS related metrics is just a gathering of the outcome of the RSO metrics. However, what we want to describe is how RSS as a whole looks in the measurement. For example, about the RSS Availability, once any of the responses time out, the avaliability of the RSS as a whole will not be 100%. This is not true, since RSS as a whole at the time of the measurement is avaliable if any of the 13 RSOs response to the queries. Gathering the result of this determination for one day would work for the availability of RSS through the day. For the RSS Response Latency, each RSOs may have the different strategy of the deployment of the anycast locations, and those will cover the whole world collectively. The response latency for RSS as a whole seen from certain vantage point at certain time cannot be describe by the median of the response. It is much more better to select the minimum response time or perhaps 10 percentile or such. I believe there was this kind of discussions before, but current document does not reflect this way. Ozan's discussion slides has just been posted, and I found that RSS Availability part is considered there. But RSS Response Latency part is still not. Regards, Shinta Sato <shinta@jprs.co.jp> Japan Registry Services Co., Ltd. On Fri, 13 Sep 2019 15:11:18 +0000 Paul Hoffman <paul.hoffman@icann.org> wrote:
On Sep 13, 2019, at 7:58 AM, Shinta Sato <shinta@jprs.co.jp> wrote:
The rationale for each measurement has been discussed in the WP, but I don't beleive that the importance or relation between each metric was discussed in the context.
This is correct. In some of the informal conversations, some people have given some rationale for individual metrics, but the WP has not yet discussed pulling the rationale together or coming to consensus on the rationale. That's why I did my own brain dump, to see if it matches other people's thoughts.
If we need to take such importance into consideration in finding out the threshold, shouldn't we need additional discussion on this prior to expressing the suggestion of the threshold to the spreadsheet?
Maybe the discussion can happen at the same time. The WP leaders want to have some ideas of what people want in the spreadsheet in order to start good discussion on the next call, but those ideas don't need to be at all final.
In addition, the importance has only suggested for the RSO metrics. The importance for RSS metrics would need another discussion as well.
So far, I am unable to come up with any technical or procedural justification for RSS thresholds. It is useful to collect the metrics, but I have tried to come up with some threshold ideas and just failed.
If these are already discussed before, I'd like to see the minutes, transcripts or whatever, as I did miss some meetings.
The rationale ideas I posted are new. I'm hoping others will do similar postings, or at least poke holes in the one I did.
--Paul Hoffman
Hi, I am sharing Shinta's concerns, at least at the time I filled the document. Yours, Daniel On Fri, Sep 27, 2019 at 8:04 AM Shinta Sato <shinta@jprs.co.jp> wrote:
Hello Metrics WP members,
As I've looked into each definition of the measurement and aggregation of the RSS related metrics to find out the thresholds, I came up with the strong thought that these definition need to be corrected or redefined.
Current definitions of the RSS related metrics is just a gathering of the outcome of the RSO metrics. However, what we want to describe is how RSS as a whole looks in the measurement.
For example, about the RSS Availability, once any of the responses time out, the avaliability of the RSS as a whole will not be 100%. This is not true, since RSS as a whole at the time of the measurement is avaliable if any of the 13 RSOs response to the queries. Gathering the result of this determination for one day would work for the availability of RSS through the day.
For the RSS Response Latency, each RSOs may have the different strategy of the deployment of the anycast locations, and those will cover the whole world collectively. The response latency for RSS as a whole seen from certain vantage point at certain time cannot be describe by the median of the response. It is much more better to select the minimum response time or perhaps 10 percentile or such.
I believe there was this kind of discussions before, but current document does not reflect this way. Ozan's discussion slides has just been posted, and I found that RSS Availability part is considered there. But RSS Response Latency part is still not.
Regards,
Shinta Sato <shinta@jprs.co.jp> Japan Registry Services Co., Ltd.
On Fri, 13 Sep 2019 15:11:18 +0000 Paul Hoffman <paul.hoffman@icann.org> wrote:
On Sep 13, 2019, at 7:58 AM, Shinta Sato <shinta@jprs.co.jp> wrote:
The rationale for each measurement has been discussed in the WP, but I don't beleive that the importance or relation between each metric was discussed in the context.
This is correct. In some of the informal conversations, some people have given some rationale for individual metrics, but the WP has not yet discussed pulling the rationale together or coming to consensus on the rationale. That's why I did my own brain dump, to see if it matches other people's thoughts.
If we need to take such importance into consideration in finding out the threshold, shouldn't we need additional discussion on this prior to expressing the suggestion of the threshold to the spreadsheet?
Maybe the discussion can happen at the same time. The WP leaders want to have some ideas of what people want in the spreadsheet in order to start good discussion on the next call, but those ideas don't need to be at all final.
In addition, the importance has only suggested for the RSO metrics. The importance for RSS metrics would need another discussion as well.
So far, I am unable to come up with any technical or procedural justification for RSS thresholds. It is useful to collect the metrics, but I have tried to come up with some threshold ideas and just failed.
If these are already discussed before, I'd like to see the minutes, transcripts or whatever, as I did miss some meetings.
The rationale ideas I posted are new. I'm hoping others will do similar postings, or at least poke holes in the one I did.
--Paul Hoffman
_______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
-- Daniel Migault Ericsson 8400 boulevard Decarie Montreal, QC H4P 2N2 Canada Phone: +1 514-452-2160
On 9/27/19 5:04 AM, Shinta Sato wrote:
As I've looked into each definition of the measurement and aggregation of the RSS related metrics to find out the thresholds, I came up with the strong thought that these definition need to be corrected or redefined.
Current definitions of the RSS related metrics is just a gathering of the outcome of the RSO metrics. However, what we want to describe is how RSS as a whole looks in the measurement.
For example, about the RSS Availability, once any of the responses time out, the avaliability of the RSS as a whole will not be 100%. This is not true, since RSS as a whole at the time of the measurement is avaliable if any of the 13 RSOs response to the queries. Gathering the result of this determination for one day would work for the availability of RSS through the day.
For the RSS Response Latency, each RSOs may have the different strategy of the deployment of the anycast locations, and those will cover the whole world collectively. The response latency for RSS as a whole seen from certain vantage point at certain time cannot be describe by the median of the response. It is much more better to select the minimum response time or perhaps 10 percentile or such.
These are very germane issues with the RSS Metrics. My attempt at a summary (that could be wrong) is that there are two views of what the RSS metrics are summarizing: the average for the RSS, or the minimum for the RSS. The document currently is aimed at the average, but that may not make sense for the reasons that Sato-san raises above. Saying that for a particular day, the RSS Availability was 99.97% sounds like the RSS was down for 0.03% of the time, which is clearly wrong. Similarly, saying that for a particular day, the RSS Response Latency was 50 milliseconds sounds like that was what was typically seen, but that too is clearly wrong because nearly all resolvers will have honed in on the RSO with the best connectivity for each resolver, and the latency they see is likely to be much below the average. I'm not sure of the best way forward from here. We could put more explicit wording in the document about the use of averages for the RSS metrics, and the downside of using such averages, but those will probably be missed by most people seeing the published numbers. On the other hand, I don't think we can extrapolate more realistic RSS metrics from the RSO data we are collecting because we don't have vantage points at every (or even a statistically relevant) resolver on the Internet. --Paul Hoffman
From: Steve Sheng <steve.sheng@icann.org> At the next call of the metrics WP,
I didn't see any announcements of the metrics WP calls after August 8. When the next call of the metrics WP will be held ? # Sorry, I could not attend August 8 call. # (I took a vacation at the day and I was moving at the meeting time.) # After then, I could not find any information about calls and minutes. When did the Metrics WP calls hold after August 8 ?
the metrics WP will discuss performance thresholds for individual root servers and the root server system as a whole. To prepare for those discussions, the co-chairs have requested that you fill out the following spreadsheet ahead of the meeting and provide your input.
https://docs.google.com/spreadsheets/d/1ZM-eI3SLsG8QVic_bN8W15R52o3ZWXTu7OPv...
Kindly fill out the spreadsheet by COB Wed 18 September 2019.
I could not fill the sheet because the threshold is severe value and I didnot know about discussions of August 8 call and following calls. -- Kazunori Fujiwara, JPRS <fujiwara@jprs.co.jp>
Dear Kazunori, Back in July 2019, RSS Metrics WP Co-Chairs wanted to continue with standing bi-weekly teleconferences. I circulated a doodle poll on 18 July to find the most popular call time and this appeared to be Thursday, 17:00 UTC. So I sent out a calendar invitation for bi-weekly standing calls on 2 August 2019. We had the following RSS Metrics WP calls thereafter: Thursday, 08 August 2019 – 17:00 UTC Thursday, 22 August 2019 – 17:00 UTC Thursday, 05 September 2019 – 17:00 UTC We have not shared minutes after these calls but you can access the call transcripts through this link: https://community.icann.org/x/YIk2Bg. I had a problem with receiving the transcript from 22 August teleconference. Still working with the vendor to receive and post it. Our next call is today (Thursday, 19 September 2019) at 17:00 UTC. For your convenience I have just forwarded the calendar invitation to you. Please let me know if you have any other questions. Best, Ozan Sahin Policy Specialist Internet Corporation for Assigned Names and Numbers (ICANN) Telephone: +90 212 999 6218 Mobile: +90 533 641 0007 Skype: ozan.sahin.icann www.icann.org<applewebdata://021EB1F1-F6B6-443A-908F-419605BF3A61/www.icann.org> From: rssac-caucus <rssac-caucus-bounces@icann.org> on behalf of "fujiwara@jprs.co.jp" <fujiwara@jprs.co.jp> Date: Thursday, September 19, 2019 at 12:32 PM To: Steve Sheng <steve.sheng@icann.org> Cc: "rssac-caucus@icann.org" <rssac-caucus@icann.org> Subject: Re: [RSSAC Caucus] [RSS Metrics WP] INPUT REQUESTED: Performance Thresholds for Root Servers and Root Server System From: Steve Sheng <steve.sheng@icann.org<mailto:steve.sheng@icann.org>> At the next call of the metrics WP, I didn't see any announcements of the metrics WP calls after August 8. When the next call of the metrics WP will be held ? # Sorry, I could not attend August 8 call. # (I took a vacation at the day and I was moving at the meeting time.) # After then, I could not find any information about calls and minutes. When did the Metrics WP calls hold after August 8 ? the metrics WP will discuss performance thresholds for individual root servers and the root server system as a whole. To prepare for those discussions, the co-chairs have requested that you fill out the following spreadsheet ahead of the meeting and provide your input. https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_spreads... Kindly fill out the spreadsheet by COB Wed 18 September 2019. I could not fill the sheet because the threshold is severe value and I didnot know about discussions of August 8 call and following calls. -- Kazunori Fujiwara, JPRS <fujiwara@jprs.co.jp<mailto:fujiwara@jprs.co.jp>> _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org<mailto:rssac-caucus@icann.org> https://mm.icann.org/mailman/listinfo/rssac-caucus _______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://urldefense.proofpoint.com/v2/url?u=https-3A__www.icann.org_privacy_p... ) and the website Terms of Service (https://urldefense.proofpoint.com/v2/url?u=https-3A__www.icann.org_privacy_t... ). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
Dear Ozan Sahin, Thanks very much. I will prepare. Regards, -- Kazunori Fujiwara, JPRS <fujiwara@jprs.co.jp>
From: Ozan Sahin <ozan.sahin@icann.org> Dear Kazunori,
Back in July 2019, RSS Metrics WP Co-Chairs wanted to continue with standing bi-weekly teleconferences. I circulated a doodle poll on 18 July to find the most popular call time and this appeared to be Thursday, 17:00 UTC.
So I sent out a calendar invitation for bi-weekly standing calls on 2 August 2019. We had the following RSS Metrics WP calls thereafter:
Thursday, 08 August 2019 – 17:00 UTC Thursday, 22 August 2019 – 17:00 UTC Thursday, 05 September 2019 – 17:00 UTC
We have not shared minutes after these calls but you can access the call transcripts through this link: https://community.icann.org/x/YIk2Bg. I had a problem with receiving the transcript from 22 August teleconference. Still working with the vendor to receive and post it.
Our next call is today (Thursday, 19 September 2019) at 17:00 UTC. For your convenience I have just forwarded the calendar invitation to you.
Please let me know if you have any other questions.
Best,
Ozan Sahin Policy Specialist Internet Corporation for Assigned Names and Numbers (ICANN)
Telephone: +90 212 999 6218 Mobile: +90 533 641 0007 Skype: ozan.sahin.icann www.icann.org<applewebdata://021EB1F1-F6B6-443A-908F-419605BF3A61/www.icann.org>
From: rssac-caucus <rssac-caucus-bounces@icann.org> on behalf of "fujiwara@jprs.co.jp" <fujiwara@jprs.co.jp> Date: Thursday, September 19, 2019 at 12:32 PM To: Steve Sheng <steve.sheng@icann.org> Cc: "rssac-caucus@icann.org" <rssac-caucus@icann.org> Subject: Re: [RSSAC Caucus] [RSS Metrics WP] INPUT REQUESTED: Performance Thresholds for Root Servers and Root Server System
From: Steve Sheng <steve.sheng@icann.org<mailto:steve.sheng@icann.org>> At the next call of the metrics WP,
I didn't see any announcements of the metrics WP calls after August 8. When the next call of the metrics WP will be held ?
# Sorry, I could not attend August 8 call. # (I took a vacation at the day and I was moving at the meeting time.) # After then, I could not find any information about calls and minutes.
When did the Metrics WP calls hold after August 8 ?
the metrics WP will discuss performance thresholds for individual root servers and the root server system as a whole. To prepare for those discussions, the co-chairs have requested that you fill out the following spreadsheet ahead of the meeting and provide your input. https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_spreads... Kindly fill out the spreadsheet by COB Wed 18 September 2019.
I could not fill the sheet because the threshold is severe value and I didnot know about discussions of August 8 call and following calls.
-- Kazunori Fujiwara, JPRS <fujiwara@jprs.co.jp<mailto:fujiwara@jprs.co.jp>> _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org<mailto:rssac-caucus@icann.org> https://mm.icann.org/mailman/listinfo/rssac-caucus
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://urldefense.proofpoint.com/v2/url?u=https-3A__www.icann.org_privacy_p... ) and the website Terms of Service (https://urldefense.proofpoint.com/v2/url?u=https-3A__www.icann.org_privacy_t... ). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
participants (6)
-
Daniel Migault -
fujiwara@jprs.co.jp -
Ozan Sahin -
Paul Hoffman -
Shinta Sato -
Steve Sheng