Re: [vip] Types of variants: do we have consensus?

July 25, 2011

      Dear colleagues,

On Mon, Jul 25, 2011 at 06:06:45PM +0100, Vladimir Shadrunov wrote:
...
However, I can't help noticing that this document is centred around the
concept that variant labels are the ones that are generated by means of
character-by-character substitution via Language Variant Tables (I'll refer
to them as character-based variants).
That is entirely correct.  I'm a little concerned, however, that the
following message hasn't come through clearly: there is no requirement
that these definitions form the basis of individual teams' reports.
The request is that, if you want to use one of these terms, you use it
in exactly the sense that it is defined in the document.  If a given
definition does not work for your case, it is entirely appropriate to
define some new term in order to capture that meaning.  To emphasise,
the most recent posting of the definitions includes this in the note
to readers:

    The purpose of the definitions document is to define and reserve a
    number of unambiguous terms. The definitions are not required to
    be used by the case study teams; but if the teams wish to use
    these terms, they are asked to use them as defined in this
    document and not in any other way. If the case study teams use
    these terms, they will be interpreted as they are defined here.

(see
https://community.icann.org/download/attachments/16842765/initial_definition...)
...
It has been already suggested by several members that variant strings can
also be based on other types of similarity/equivalence, such as:
- *visual* (strings that have different code points, but are visually
   identical)
   - *phonetic* (strings that are written differently, but sound the same)
   - *semantic* (strings that "mean" the same, with full understanding that
   not every DNS label has a meaning. Some labels, however, do have meanings so
   "semantically equivalent TLD variant strings" doesn't at all sound
   outrageous to me).
I agree very strongly that each of these is an important topic, and
that each one may need some thought and consideration.

Speaking only personally, one of the good things I think is in the
definitions provided to the teams is that they refer to external,
well-reviewed documents.  Even if one thinks that a given definition
is wrong, each one is by and large comprehensive.  (So far, the
exceptions are all ones that were identified by the Cyrillic team;
these were addressed in the version posted at the URL above, and I was
delighted that the review turned up these omissions.)

The first class you identify above seems to be the case covered by the
definitions.  For that case, we can produce an operational definition
of the kinds of sameness for which we are trying to test.  Moreover,
that case is historically well-explained: we know, for instance, that
the same Abstract Character is sometimes encoded in more than one way
in Unicode.  So there is a simple, practical, technical reason to try
to cause two different DNS labels to resolve such that there is
minimal difference between them.  This class, however, contains a
smaller number of cases than some people would like to address.

In the case of the second class -- phonetic similarity -- I am not
sure where we might find a document that provides the operational
definition of this case.  "Sound the same" is at least a tricky
problem, particularly if one wants to specify this algorithmically
(it's not impossible, of course, at least in some scripts).  Moreover,
I also don't know what the basis for treating this case the same as
the first class would be.  Strings that sound the same (when spoken)
need not even be related to one another.  To use a simple example in
English, the string "your" and the string "you're" are pronounced
similarly, and in some regional accents indistinguishably -- so much
so that these are often mistakenly substituted for one another in
running text.  But they have quite different meanings, since the
former means "belonging to you" and the latter means "you are".  A
rigorous definition in this area would be most helpful, and some
additional work to explain why sound-alike labels ought properly to be
treated as "the same", for some value of "same".  For traditionally,
sound-alike labels are just different labels.  Why shouldn't that
continue to be the case?

In the case of the third class -- semantic similarity -- I first of
all deny the premise.  I claim that _no_ DNS label has a meaning.  It
is possible (indeed likely) that some people use labels as though they
had meaning, but that is quite a different matter from the labels
actually having such meaning.  That you might attribute a meaning to a
given label could well mean that you understand it to have the same
meaning as some other label.  But that is not the same as deciding
that the label has that meaning in itself -- for instance, the same
pair might not be related for me.  How would one determine this?  What
authority would one depend on?  And how is this to be determined
algorithmically?  Moreover, given that the tradition in the DNS runs
exactly counter to this -- traditionally, the labels "color" and
"colour" are simply different labels, period -- what justification is
there for including this type of similarity in the work of the teams?

Nevertheless, if you and your colleagues need to define these terms
for your purpose, then I encourage you to do so, always remembering
that any definition other than an operational one will be almost
impossible to use in any report that is intended to inform a final
policy.

Best regards,

Andrew

-- 
Andrew Sullivan
ajs@anvilwalrusden.com