[RSSAC Caucus] Work Party: Statistical Prediction of RSS Failure

Nov. 10, 2020

      Brad and I are looking at the question implied by the public call for comments at "Recommendations for Early Warning for Root Zone Scaling" <https://www.icann.org/public-comments/recommendations-early-warning-root-sca...>. My sense is that we don't know how to detect the onset of a potential problem. We wonder whether the Caucus might help us out - looking at extant statistics and other data to see if there is something that might be used as a triggering condition. RSSAC raised this question in RSSAC031, and SSAC raised it in SSAC100. We have a number of years worth of RSSAC002 data (and, if its useful, there is also DITL data) to review in https://github.com/rssac-caucus/rssac002-data, or reachable from https://root-servers.org.

A proposed approach, at least to investigating the question, would be to:
 - start from https://github.com/rssac-caucus/rssac002-data
 - download available RSSAC002 data (there should be data for most RSOs for several years)
 - observe statistics around past burbles
 - if something jumps out, investigate it further and document it

I could imagine this being a post-doc's paper, published somewhere, but I do want the caucus to be able to see it pre-publication, with any necessary confidentiality provisions (as in, if you send a paper to this list and you need to keep it confidential, please say so).

What we might learn is that RSSAC002 data doesn't address the issue, but it might by adding some new statistic to it, or we might find it by looking at RSSAC047 data or that plus some new statistic added to it. The obvious question there would be to describe, prototype, and characterize the indicated data.

For the record, I have done things like this myself in the past. Cloudflare started providing an anycast service to ISC in April of a few years ago, and the following August took a ten day outage for reasons I don't recall. I downloaded the indicated statistics and stuffed them into an Excel spreadsheet, from which I derived a graphic. In the graphic, I was able to observe:
- a stable period before the outage
- a transition period when the outage started, during which request/response traffic moved to other servers
- a stable period during the outage
- a transition period when the outage ended, during which request/response traffic moved back
- a stable period after the outage.

The question would be whether we could look at several events and see if there is some identifiable statistical behavior that consistently predicts an outage.

Another discussion, a little more in the direction of pins with angels partying on them, would be to brainstorm what a failure of the RSS might look like and investigate that - what does a systemic failure look like? How would one detect such a failure?

Let's make this a work party, in the sense that I'm the work party shepherd and interested people can also be part. If you want to be an investigator in the project, please reply to this email.

[RSSAC Caucus] Work Party: Statistical Prediction of RSS Failure

Fred Baker