Dear colleagues, On Mon, Jul 25, 2011 at 06:06:45PM +0100, Vladimir Shadrunov wrote:
However, I can't help noticing that this document is centred around the concept that variant labels are the ones that are generated by means of character-by-character substitution via Language Variant Tables (I'll refer to them as character-based variants).
That is entirely correct. I'm a little concerned, however, that the following message hasn't come through clearly: there is no requirement that these definitions form the basis of individual teams' reports. The request is that, if you want to use one of these terms, you use it in exactly the sense that it is defined in the document. If a given definition does not work for your case, it is entirely appropriate to define some new term in order to capture that meaning. To emphasise, the most recent posting of the definitions includes this in the note to readers: The purpose of the definitions document is to define and reserve a number of unambiguous terms. The definitions are not required to be used by the case study teams; but if the teams wish to use these terms, they are asked to use them as defined in this document and not in any other way. If the case study teams use these terms, they will be interpreted as they are defined here. (see https://community.icann.org/download/attachments/16842765/initial_definition...)
It has been already suggested by several members that variant strings can also be based on other types of similarity/equivalence, such as:
- *visual* (strings that have different code points, but are visually identical) - *phonetic* (strings that are written differently, but sound the same) - *semantic* (strings that "mean" the same, with full understanding that not every DNS label has a meaning. Some labels, however, do have meanings so "semantically equivalent TLD variant strings" doesn't at all sound outrageous to me).
I agree very strongly that each of these is an important topic, and that each one may need some thought and consideration. Speaking only personally, one of the good things I think is in the definitions provided to the teams is that they refer to external, well-reviewed documents. Even if one thinks that a given definition is wrong, each one is by and large comprehensive. (So far, the exceptions are all ones that were identified by the Cyrillic team; these were addressed in the version posted at the URL above, and I was delighted that the review turned up these omissions.) The first class you identify above seems to be the case covered by the definitions. For that case, we can produce an operational definition of the kinds of sameness for which we are trying to test. Moreover, that case is historically well-explained: we know, for instance, that the same Abstract Character is sometimes encoded in more than one way in Unicode. So there is a simple, practical, technical reason to try to cause two different DNS labels to resolve such that there is minimal difference between them. This class, however, contains a smaller number of cases than some people would like to address. In the case of the second class -- phonetic similarity -- I am not sure where we might find a document that provides the operational definition of this case. "Sound the same" is at least a tricky problem, particularly if one wants to specify this algorithmically (it's not impossible, of course, at least in some scripts). Moreover, I also don't know what the basis for treating this case the same as the first class would be. Strings that sound the same (when spoken) need not even be related to one another. To use a simple example in English, the string "your" and the string "you're" are pronounced similarly, and in some regional accents indistinguishably -- so much so that these are often mistakenly substituted for one another in running text. But they have quite different meanings, since the former means "belonging to you" and the latter means "you are". A rigorous definition in this area would be most helpful, and some additional work to explain why sound-alike labels ought properly to be treated as "the same", for some value of "same". For traditionally, sound-alike labels are just different labels. Why shouldn't that continue to be the case? In the case of the third class -- semantic similarity -- I first of all deny the premise. I claim that _no_ DNS label has a meaning. It is possible (indeed likely) that some people use labels as though they had meaning, but that is quite a different matter from the labels actually having such meaning. That you might attribute a meaning to a given label could well mean that you understand it to have the same meaning as some other label. But that is not the same as deciding that the label has that meaning in itself -- for instance, the same pair might not be related for me. How would one determine this? What authority would one depend on? And how is this to be determined algorithmically? Moreover, given that the tradition in the DNS runs exactly counter to this -- traditionally, the labels "color" and "colour" are simply different labels, period -- what justification is there for including this type of similarity in the work of the teams? Nevertheless, if you and your colleagues need to define these terms for your purpose, then I encourage you to do so, always remembering that any definition other than an operational one will be almost impossible to use in any report that is intended to inform a final policy. Best regards, Andrew -- Andrew Sullivan ajs@anvilwalrusden.com