Re: [gtld-tech] [Regops] Search Engines Indexing RDAP Server Content

Jan. 29, 2016

      ...
So I saw a tweet from Gavin Brown (@GavinBrown) that describes how one particular search engine has indexed the RDAP server of a gTLD registry operator:
https://twitter.com/GavinBrown/status/692718904058191872
This is all the more reason to work on a client authentication 
specification that includes support for varying responses based on 
client identity and authorization. I've been working on such a 
specification and welcome feedback on the approach:
I don't see what the problem is.  If you set up an http server that 
returns interlinked data, search engines will find it and index it.  All 
the information RDAP returns to an unauthenticated query is presumably 
public, so what's the harm in making it easier to find?

But anyway, if you don't want them to do that, there's plenty of ways to 
keep them out.

The easiest is to publish a /robots.txt file.  All legit search engines 
will stay out if that's what it says.

Another is to look at the agent string the client sends.  Google's is 
googlebot, Bing's is bingbot, Yahoo's is scooter.  It's easy enough to 
find a list of common spider names.  If the agent is a spider, tell it to 
go away or redirect it to a help page.

Another is to look at the Accept: header.  An RDAP client should ask for a 
JSON media type.  For a client that asks for html or anything else, return 
an html version with meta fields in the header saying NOINDEX and 
NOFOLLOW.

The big search engines spider at low speed from hosts all over the world 
to avoid overloading the sites they index.  You're not going to keep them 
out via authentication without also keeping out everyone else who doesn't 
have a password.  I don't think that's a good idea.

R's,
John

Re: [gtld-tech] [Regops] Search Engines Indexing RDAP Server Content

John R. Levine