After some of the discussion
in the chat in this morning's meeting, I feel like a little
more extended discussion about variants might be helpful.
The repertoire for the Latin
script consists of "codepoints" -- some are letters and some
are letters plus diacritics. "Variants" are pairs of
codepoints which are indistinguishable. That is, in the
process that the Panel used, 5 of the 7 experts on the panel
couldn't see a difference. The Latin GP did not look at
diacritics per se. Just at codepoints which might involve
diacritics.
Thus, a codepoint consisting
of a letter with a caron diacritic ( ̌ ) and a
codepoint with the same letter combined with a breve diacritic
( ̆ ) may always result in a variant pair, but
only because the Panel's comparison worked out that way. For
example, a G with caron (ǧ) and a G with breve (ğ) are variants.
On the other hand, a caron and a macron ( ¯ )
never result in a variant pair.
However some cases with
diacritics are mixed. For example, a codepoint consisting of
letter with a dot above ( ˙ ) and a codepoint
consisting of a letter with an acute accent results in a
variant pair for letters C (ċ vs ć),
N (ṅ vs ń), and Z (ż
vs ź ). But, in the Panel's original finding, not for letters
E (ė vs é), and I (i vs í).
(Note that a majority of the
Panel found the vowels to produce variants as well. Just not
a supermajority, as required by the process the Panel had
adopted. As a result, the Panel's official position is that,
in various cases not just this one, even though a majority of
the experts, looking side by side, could not see a difference,
the average "reasonably careful user" will somehow magically
notice the difference when looking at a domain name.)
Then we have cross-script
variants, including those identified by other Panels. For
example, the Greek Panel found that the Greek letter Iota was
a variant both of the Latin letter I and the Latin letter I
with acute. As a result I and I with acute became variants.
But there is no Greek letter
which is a variant of the Latin letter E. So we are left with
a situation where the dot above diacritic and the acute
produce variants for all letters EXCEPT for the letter E.
(When I suggested that, for consistency, we should make the
letter E case a variant as well, the response was "It is more
important that we follow our process than that we have
consistency.")
TLDs
consist of a series of codepoints. Proposed TLDs which
differ only by one or more variants from another
TLD will be automatically be rejected in the software.
For example, .çom would be allowed,
despite its similarity to .com, because C with Cedilla
is not a variant of C. Also .сом (using Cyrillic
letters) would be allowed because, while C and the
Cyrillic letter Es are variants, and O and the
Cyrillic letter O are variants, the letter M and the
Cyrillic letter Em are not variants (the Panel was
directed to ignore Upper Case when deciding what might
confuse users).
But .cóm could be rejected, because O and O with
acute are variants.
"Confusables" are pairs
of codepoints which some for the experts could not
distinguish, just not enough to be designated as
variants. Confusables are intended as
suggestions for the panel which will manually review the
proposed TLDs.
I hope this all will help
everyone understand what we are looking at here.
Regards,
Bill Jouris