So I saw a tweet from Gavin Brown (@GavinBrown) that describes how one particular search engine has indexed the RDAP server of a gTLD registry operator:
https://twitter.com/GavinBrown/status/692718904058191872
This is all the more reason to work on a client authentication specification that includes support for varying responses based on client identity and authorization. I've been working on such a specification and welcome feedback on the approach:
I don't see what the problem is. If you set up an http server that returns interlinked data, search engines will find it and index it. All the information RDAP returns to an unauthenticated query is presumably public, so what's the harm in making it easier to find? But anyway, if you don't want them to do that, there's plenty of ways to keep them out. The easiest is to publish a /robots.txt file. All legit search engines will stay out if that's what it says. Another is to look at the agent string the client sends. Google's is googlebot, Bing's is bingbot, Yahoo's is scooter. It's easy enough to find a list of common spider names. If the agent is a spider, tell it to go away or redirect it to a help page. Another is to look at the Accept: header. An RDAP client should ask for a JSON media type. For a client that asks for html or anything else, return an html version with meta fields in the header saying NOINDEX and NOFOLLOW. The big search engines spider at low speed from hosts all over the world to avoid overloading the sites they index. You're not going to keep them out via authentication without also keeping out everyone else who doesn't have a password. I don't think that's a good idea. R's, John