Local Perspective Work Party: "Underserved" use case text
All: I have brought over the user narrative concerning RSOs wanting to determine underserved areas to the current working document. Along the way, I overhauled the description, hopefully making it more concise and readable. I think I have addressed all comments from the Narratives document. The new text is included below, but the google document will contain the latest edits. I would appreciate any comments or feedback on this topic and look forward to discussing it with the group at the work party meeting tomorrow (Monday). https://docs.google.com/document/d/11slZDTqrcwTwywpbi3JwHuU_FoaoN54u0f3B2UFj... -Ken Renard While the RSS has relatively good global coverage, RSOs are interested in deploying additional instances, particularly to areas that are considered “underserved”. An underserved area has a reasonably-sized user base that perceives poor performance of the RSS due to its closest root server instances being topologically distant, resulting in high latency or low availability of the service. This is a subjective designation and it is not reasonable to expect that a tool could directly measure it. The metric described in this section serves only to inform a decision on placing new root server instances in an underserved area. A RSO would require multiple measurements run in diverse geographical locations. The goal of this metric is to assess performance of the RSS at a measurement point compared to other measurement points. A local metric that is much worse than others would be an indicator of an underserved area. Availability and latency are the primary measurements and the impact of slow links should be discarded. Since metric is trying to assess the performance of the RSS versus individual RSIs (Root Server Identities), only the best NRSI measurements should be used. The measurements are direct queries to the individual root servers (versus via a recursive resolver). A set of measurements should be taken at random intervals during a T=30 minute time period. All RSIs will be directly queried and their latencies recorded. Additionally, a normalization factor σ will be measured as the average latency of direct DNS queries to the first NTLD=20 somehow-chosen TLDs. If fewer than NTLD normalization measurements are available, the entire set of measurements are thrown away. Availability of an RSI is the success rate of queries to an RSI over the last 20 measurement periods that were not thrown away. The performance metric is the inverse of the average of the best NRSI=3 RSI latencies, where the availability for each of those RSIs is > p=0.65, divided by the normalization factor. Availability of an RSI: percentage of successful responses over the past 20 measurements Normalization factor: σ = mean latency of NTLD queries Normalized Latency = (best NRSI latencies where availability > p) / σ Performance Metric = 1 / Normalized Latency If fewer than NRSI latency measurements are available (due to timeouts), the performance metric should be multiplied by Navail / NRSI, where Navail is the number of RSIs that have availability > p. When comparing performance metrics among locations, the last 20 successful performance metrics should be averaged before comparison. The bottom P percentile among a large set of local performance metrics or performance metrics below a certain threshold (Θ) may be used to inform a determination of an underserved area. Determining P or Θ will require further analysis and will not be defined in this document.
Ken, Thanks for this. After reading the document and your email, I have the following comments and questions. 1. I believe I understand the general idea of measuring both availability and latency to each of the various root server identities (RSIs) multiple times over a 30 minute period. I had a bit of trouble understanding the precise specifications. I am completely comfortable with math and math notation, so my trouble is due to some missing pieces in the description. I suspect this is easily remedied with a few more words. 2. What is the justification for the numbers, e.g. 20 times, σ = .65, etc? 3. What conclusion would you draw if the local measurements show there is extremely high availability but latency varies between very low and not so low. For example, there might be one RSI that is topologically close but suffers intermittent availability and other RSIs that are not as close but extremely reliable? 4. Adding one or more additional root servers is one way to improve both reliability and availability. Installing local root service is another way to accomplish the same goals. How do you expect the measurement process described in this document will affect discussions regarding these two approaches? Thanks, Steve On Sun, Nov 22, 2020 at 10:16 AM Ken Renard <kdrenard2@gmail.com> wrote:
All:
I have brought over the user narrative concerning RSOs wanting to determine underserved areas to the current working document. Along the way, I overhauled the description, hopefully making it more concise and readable. I think I have addressed all comments from the Narratives document. The new text is included below, but the google document will contain the latest edits. I would appreciate any comments or feedback on this topic and look forward to discussing it with the group at the work party meeting tomorrow (Monday).
https://docs.google.com/document/d/11slZDTqrcwTwywpbi3JwHuU_FoaoN54u0f3B2UFj...
-Ken Renard
While the RSS has relatively good global coverage, RSOs are interested in deploying additional instances, particularly to areas that are considered “underserved”. An underserved area has a reasonably-sized user base that perceives poor performance of the RSS due to its closest root server instances being topologically distant, resulting in high latency or low availability of the service. This is a subjective designation and it is not reasonable to expect that a tool could directly measure it. The metric described in this section serves only to inform a decision on placing new root server instances in an underserved area. A RSO would require multiple measurements run in diverse geographical locations.
The goal of this metric is to assess performance of the RSS at a measurement point compared to other measurement points. A local metric that is much worse than others would be an indicator of an underserved area. Availability and latency are the primary measurements and the impact of slow links should be discarded. Since metric is trying to assess the performance of the RSS versus individual RSIs (Root Server Identities), only the best NRSI measurements should be used. The measurements are direct queries to the individual root servers (versus via a recursive resolver).
A set of measurements should be taken at random intervals during a T=30 minute time period. All RSIs will be directly queried and their latencies recorded. Additionally, a normalization factor σ will be measured as the average latency of direct DNS queries to the first NTLD=20 somehow-chosen TLDs. If fewer than NTLD normalization measurements are available, the entire set of measurements are thrown away. Availability of an RSI is the success rate of queries to an RSI over the last 20 measurement periods that were not thrown away.
The performance metric is the inverse of the average of the best NRSI=3 RSI latencies, where the availability for each of those RSIs is > p=0.65, divided by the normalization factor.
Availability of an RSI: percentage of successful responses over the past 20 measurements
Normalization factor: σ = mean latency of NTLD queries
Normalized Latency = (best NRSI latencies where availability > p) / σ
Performance Metric = 1 / Normalized Latency
If fewer than NRSI latency measurements are available (due to timeouts), the performance metric should be multiplied by Navail / NRSI, where Navail is the number of RSIs that have availability > p.
When comparing performance metrics among locations, the last 20 successful performance metrics should be averaged before comparison. The bottom P percentile among a large set of local performance metrics or performance metrics below a certain threshold (Θ) may be used to inform a determination of an underserved area. Determining P or Θ will require further analysis and will not be defined in this document.
_______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
On Nov 22, 2020, at 2:31 PM, Steve Crocker <steve@shinkuro.com> wrote:
Ken,
Thanks for this. After reading the document and your email, I have the following comments and questions. I believe I understand the general idea of measuring both availability and latency to each of the various root server identities (RSIs) multiple times over a 30 minute period. I had a bit of trouble understanding the precise specifications. I am completely comfortable with math and math notation, so my trouble is due to some missing pieces in the description. I suspect this is easily remedied with a few more words. Agree that it could be expressed better. Adding formulas with more squiggly lines and Greek letters would be good. I’d like to talk through the metric on the call tomorrow and get some consensus. I may then ask for help with the notation.
What is the justification for the numbers, e.g. 20 times, σ = .65, etc The justification for σ = 0.65 is in the user narrative document, but pasted here:
desired probability 0.95 success rate for measurement point among N=3 RSIs. (1-p)^N = (1 - 0.95) p = 0.63. (for N=3) round to p = 0.65 note that desired rate of 0.95 is starting point for discussion As for the “20" values, those are somewhat out of the blue. Just waiting for someone to ask why and start the discussion. For availability, we need a moderate number of samples. 20 samples seems reasonable to me right now, but open to other thoughts. As for averaging 20 metrics before competing to other sites, I just wanted something that can wash away some anomalies and get a better idea of what is happening at a location. I feel the same about the T=30 minute interval; open to ideas.
What conclusion would you draw if the local measurements show there is extremely high availability but latency varies between very low and not so low. For example, there might be one RSI that is topologically close but suffers intermittent availability and other RSIs that are not as close but extremely reliable? These scenarios will hopefully average out over enough samples (do we need more than 20?). We are not treating latency and availability as equal factors, but the 2 scenarios you describe above might end having similar metrics.
Adding one or more additional root servers is one way to improve both reliability and availability. Installing local root service is another way to accomplish the same goals. How do you expect the measurement process described in this document will affect discussions regarding these two approaches? This measurement process should just be data points for each case. A recursive operator might use a bad result as justification to implement local root for its user base (although this measurement would not reflect the improvements of local root). A RSO would probably want a lot of data points to inform their subjective “underserved” decision. Bottom line is that this metric will not be a definitive answer or a direct reason for any particular action. You still need to do your homework.
Thanks for the comments! -Ken
On Sun, Nov 22, 2020 at 5:50 PM Ken Renard <kdrenard2@gmail.com> wrote:
On Nov 22, 2020, at 2:31 PM, Steve Crocker <steve@shinkuro.com> wrote:
Ken,
What is the justification for the numbers, e.g. 20 times, σ = .65, etc
The justification for σ = 0.65 is in the user narrative document, but pasted here:
desired probability 0.95 success rate for measurement point among N=3 RSIs. (1-p)^N = (1 - 0.95) p = 0.63. (for N=3) round to p = 0.65 note that desired rate of 0.95 is starting point for discussion
Ok, so this looks like you're trying to achieve a 95% probability of reaching an RSI using three attempts. Implicit in your model is the failures will be independent, an assumption worth discussing. And your math says if the probability of reaching an RSI is 65% or better, three (independent) attempts gives you a 95% probability of success because the probability of three failures is .35^3 ≅ .043, which is less than .05. What do you mean by "normalization?"
As for the “20" values, those are somewhat out of the blue. Just waiting for someone to ask why and start the discussion. For availability, we need a moderate number of samples. 20 samples seems reasonable to me right now, but open to other thoughts. As for averaging 20 metrics before competing to other sites, I just wanted something that can wash away some anomalies and get a better idea of what is happening at a location. I feel the same about the T=30 minute interval; open to ideas.
Thanks, Steve
participants (2)
-
Ken Renard -
Steve Crocker