Search Engines Indexing RDAP Server Content
(cross-posting to multiple lists - sorry if that's inconvenient) So I saw a tweet from Gavin Brown (@GavinBrown) that describes how one particular search engine has indexed the RDAP server of a gTLD registry operator: https://twitter.com/GavinBrown/status/692718904058191872 This is all the more reason to work on a client authentication specification that includes support for varying responses based on client identity and authorization. I've been working on such a specification and welcome feedback on the approach: https://datatracker.ietf.org/doc/draft-hollenbeck-weirds-rdap-openid/ It also begs the question of the need for a BCP describing operational practices for server operators. There are ways for web servers to influence or restrict crawler behavior, but what's appropriate in this context? Scott
On Fri, Jan 29, 2016 at 8:34 AM, Hollenbeck, Scott <shollenbeck@verisign.com> wrote:
(cross-posting to multiple lists - sorry if that's inconvenient)
So I saw a tweet from Gavin Brown (@GavinBrown) that describes how one particular search engine has indexed the RDAP server of a gTLD registry operator:
Interesting... especially that most of the results are entities but it doesn't seem to be that many.
This is all the more reason to work on a client authentication specification that includes support for varying responses based on client identity and authorization. I've been working on such a specification and welcome feedback on the approach:
https://datatracker.ietf.org/doc/draft-hollenbeck-weirds-rdap-openid/
+1 -andy
So I saw a tweet from Gavin Brown (@GavinBrown) that describes how one particular search engine has indexed the RDAP server of a gTLD registry operator:
https://twitter.com/GavinBrown/status/692718904058191872
This is all the more reason to work on a client authentication specification that includes support for varying responses based on client identity and authorization. I've been working on such a specification and welcome feedback on the approach:
I don't see what the problem is. If you set up an http server that returns interlinked data, search engines will find it and index it. All the information RDAP returns to an unauthenticated query is presumably public, so what's the harm in making it easier to find? But anyway, if you don't want them to do that, there's plenty of ways to keep them out. The easiest is to publish a /robots.txt file. All legit search engines will stay out if that's what it says. Another is to look at the agent string the client sends. Google's is googlebot, Bing's is bingbot, Yahoo's is scooter. It's easy enough to find a list of common spider names. If the agent is a spider, tell it to go away or redirect it to a help page. Another is to look at the Accept: header. An RDAP client should ask for a JSON media type. For a client that asks for html or anything else, return an html version with meta fields in the header saying NOINDEX and NOFOLLOW. The big search engines spider at low speed from hosts all over the world to avoid overloading the sites they index. You're not going to keep them out via authentication without also keeping out everyone else who doesn't have a password. I don't think that's a good idea. R's, John
On 29 Jan 2016, at 11:30, John R. Levine wrote:
I don't see what the problem is.
Agree. On a more meta-level I am nervous over the swap whois -> rdap more and more ends up being "publish openly in rap what is available in whois" which I think misses the whole idea with rdap (well, one of the things) and that is to have profiles where different profiles have different access. So sensitive data (like name and address of private persons that have registered domain names -- at least in Sweden) should simply not be available for all via rdap unless access is from an authenticated and authorized account. All to live up to the data protections laws here and there. I.e. without use of profiles in rdap, and different access rights, we will end up in this mess over and over again. Patrik
On 29 Jan 2016, at 5:34, Hollenbeck, Scott wrote:
It also begs the question of the need for a BCP describing operational practices for server operators. There are ways for web servers to influence or restrict crawler behavior, but what's appropriate in this context?
It would be good to keep in mind that respecting mechanisms such as robots.txt is entirely voluntary on the crawler’s side. Luis Muñoz Director, Registry Operations ____________________________ http://www.uniregistry.link/ 2161 San Joaquin Hills Road Newport Beach, CA 92660 Office +1 949 706 2300 x 4242 lem@uniregistry.link
It also begs the question of the need for a BCP describing operational practices for server operators. There are ways for web servers to influence or restrict crawler behavior, but what's appropriate in this context?
It would be good to keep in mind that respecting mechanisms such as robots.txt is entirely voluntary on the crawler’s side.
Right -- the only reasonable assumption is that if casual users can find it with http, evil search engines (of which there are plenty) will too. The search engines that most people use such as Google and Bing all obey robots.txt so it will keep info out of casual searches. But I'd want to understand what the threat model is before inventing solutions. Regards, John Levine, johnl@taugh.com, Taughannock Networks, Trumansburg NY Please consider the environment before reading this e-mail.
This is not a very good example of the need for operational practices. It represents Google UK’s view of what you can find when you start at “http://rdg.afilias.info/rdap/help”, which includes explicit examples of how to use the RDAP server. Jim On 29 Jan 2016, at 8:34, Hollenbeck, Scott wrote:
(cross-posting to multiple lists - sorry if that's inconvenient)
So I saw a tweet from Gavin Brown (@GavinBrown) that describes how one particular search engine has indexed the RDAP server of a gTLD registry operator:
https://twitter.com/GavinBrown/status/692718904058191872
This is all the more reason to work on a client authentication specification that includes support for varying responses based on client identity and authorization. I've been working on such a specification and welcome feedback on the approach:
https://datatracker.ietf.org/doc/draft-hollenbeck-weirds-rdap-openid/
It also begs the question of the need for a BCP describing operational practices for server operators. There are ways for web servers to influence or restrict crawler behavior, but what's appropriate in this context?
Scott
_______________________________________________ weirds mailing list weirds@ietf.org https://www.ietf.org/mailman/listinfo/weirds
-----Original Message----- From: James Galvin [mailto:galvin@elistx.com] Sent: Wednesday, February 03, 2016 8:01 AM To: Hollenbeck, Scott Cc: weirds@ietf.org; regops@nlnetlabs.nl; gtld-tech@icann.org Subject: Re: [weirds] Search Engines Indexing RDAP Server Content
This is not a very good example of the need for operational practices.
It represents Google UK’s view of what you can find when you start at “http://rdg.afilias.info/rdap/help”, which includes explicit examples of how to use the RDAP server.
Jim, you're probably right about it not being a good example. I'm more concerned with the information that could be returned in response to an entity query being archived. Side note: the response received for http://rdg.afilias.info/rdap/help is encoded using HTML. Sections 7 and 4.3 of RFC 7483 say it should be JSON-encoded. "The appropriate response to /help queries as defined by [RFC7482] is to use the notices structure as defined in Section 4.3." Would it have been indexed at all if it wasn't HTML? There's an operational practice thing in here somewhere... ;) Scott
participants (7)
-
Andrew Newton -
Hollenbeck, Scott -
James Galvin -
John R Levine -
John R. Levine -
Luis E. Muñoz -
Patrik Fältström