Re: [CPWG] Variants and Process

Oct. 23, 2021

      Dear Dr. Gopal, 
There may be such software out there.  I can only say that, if there is, I am not familiar with it.  I rather suspect that, if there is, one of the first parameters required is "How similar, or how different, do you want?  Set a threshold."  Followed by "Do you require consistence?  That is, if two diacritics produce variants in some cases, must they do so in all cases?"  Which, as you say, leaves the most critical question still to be answered. 
In the end, I think we are stuck with some variation of a "consensus of experts" judgement.  The more cogent question is, What kinds of experts?  That is, linguists?  Or experts in human perception (specifically visual perception)?  Or experts in the behavior of end users.  The IDN project has opted, essentially, for linguists -- whether by default or actual preference I do not know.
Regards, 
Bill Jouris   

    On Friday, October 22, 2021, 07:32:40 PM PDT, <gopal@annauniv.edu> wrote:  

 Dear Bill Jouris,

Many thanks again for your presentation to the CPWG on 6 October 2021.

It has been a fantastic effort by your Seven Member team from six 
different
countries.

Ref Slide #12: UNICODE 00FE and 01A5

The quantification for decision making was based on a 5-point linear 
scale and
the Seven experts using "2-4" range only. Also, this for three popular 
typefaces.

I know this is just one sample and your question in the next slide "How 
Much is
Enough ?" is very vital.

Is there a tool / simulator that makes it all more generic for larger 
samples, different
languages and different quantificatio scales such as the Likert Scale ?

We can then anticipate the code generator within acceptable confidence 
interval.

Once again a big thank you from me for such a nice work and 
presentation.

Please advise.

Sincerely,

Gopal T V
0 9840121302
https://vidwan.inflibnet.ac.in/profile/57545
https://www.facebook.com/gopal.tadepalli
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Dr. T V Gopal
Professor
Department of Computer Science and Engineering
College of Engineering
Anna University
Chennai - 600 025, INDIA
Ph : (Off) 22351723 Extn. 3340
      (Res) 24454753
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

On 2021-10-23 03:28, Bill Jouris via CPWG wrote:
...
Dear Roberto,
Not all that off-topic.  In general, you are correct that combinations
of letters got ignored.  For example, a Latin letter R, followed by a
Latin Letter N is, to my mind, hard to distinguish from a Latin letter
M.  If you saw .corn, would you realize it was about maize, rather
than being a normal .com?  But it didn't get considered in identifying
variants.
The Sharp S is the exception.  The panel concluded that the Sharp S
(ß) and a double S (ss) are variants.  Most variants are
bidirectional -- that is, it doesn't matter which one was registered
first, the other is blocked.  But this case is different.  If the name
with a double S is registered first, then the Sharp S is indeed
blocked.  However, if the name with Sharp S is registered first, then
the variant is considered "allocatable."  That is the same name with a
double S rather than Sharp S _can_ be registered, provided:
1) ALL of the instances of Sharp S in the name (if there is more than
one) are changed to double S, and
2) the name is registered to the same registrant.
On the other hand, the possibility of substituting a vowel with
diaresis for the same vowel followed by E did not come up.  That is
the way I learned German (as an American) long ago.  But the native
German speakers on the Panel did not consider it worth worrying about.
Sorry if that doesn't totally clarify things.  But that's all I've got
on the subject.
Bill Jouris
  On Thursday, October 21, 2021, 11:47:20 PM PDT, Roberto Gaetano
<roberto_gaetano@hotmail.com> wrote:
  Dear Bill,
I wonder whether I am off-topic with this question, but here it is
anyway.
Has the Latin GP considered an additional potential confusion coming
from cases like the german equivalency between “ae” and “ä”
or “ss” and “ß”? Just to make an example, the Austrian
Touring Club (ÖAMTC) has the site oeamtc.at [2], as it is customary
in german-speaking countries to get around the problem in this way.
This is most probably out of scope, because the work is likely to be
limited to single characters and not combination of characters, but
from the user’s point of view it could be a source of confusion
anyway.
Thanks,
Roberto
...
On 21.10.2021, at 19:47, Bill Jouris via CPWG <cpwg@icann.org>
wrote:
Dear Olivier,
That is the problem I see as well.  My sense is the both the Latin
GP, and the Integration Panel (which is the next level higher)
desire primarily to minimize the number of variants.  Two codepoints
which are identical, such as the Latin schwa and the Latin turned E,
obviously cannot be distinguished by anyone, and so are necessarily
variants.  (Although one of my fellow Panel members argued against
variant status even for that specific case.)  But how strict the
constraints were on making two codepoints reflects that desire for
minimization.  Also, in at least one case, the Integration Panel
requested the Latin GP review (and modify) some variant findings
because one set of codepoints which were variants of each other was
"too large."  ("Too large" wasn't defined.  Nor was there indication
of why one would care. Certainly it wouldn't impact the performance
of the software doing the automatic filtering of proposed TLDs.)
Given that
a) the Panel members are experts,
b) we were doing side-by-side comparisons, and
c) we knew that we were looking at two different codepoints
it seemed to me that if any of us couldn't tell the difference, then
neither could the average user looking at a domain name in
isolation.  Setting a higher threshold seems to me like phishing,
and especially pharming, enablement.
It also might appear that having a group of codepoints which are not
variants, but which users cannot really distinguish, provides a
marketing opportunity.  Not to sell to bad actors, who are typically
one-off buyers and so not worth pursuing.  But to sell defensive
registrations to legitimate registrants, who merely want to make
sure that their customers find them.  Such defensive registrations
would be likely to be renewed indefinitely, making them worthwhile
even in a low margin business.**
Bill
** 5 of the 7 members of the Latin Panel being employees of one or
another of the contracted parties.  I believe most of them were
sincerely making a good faith effort to do the right thing.  But
their experience there may nevertheless have colored their
perceptions.
Sent from Yahoo Mail on Android [1]
On Thu, Oct 21, 2021 at 2:13 AM, Olivier MJ Crépin-Leblond
<ocl@gih.com> wrote:
Dear Bill,
thank you for explaining this in further detail. The problem I see
with the process here, is that *experts* have been used to notice a
difference. Because they are experts, they might be able to see
differences which the average Internet end user will not. And this
is the concern I have: is the panel of experts being conservative
enough in making their decisions? If there is any suspicion about
two characters being a variant, would a conservative approach them
as variants?
What is the end goal of identifying variants? If it is to avoid the
use of IDNs for phishing, then the only approach possible should be
a conservative approach.
Kindest regards,
Olivier
On 21/10/2021 05:17, Bill Jouris via CPWG wrote:
After some of the discussion in the chat in this morning's meeting,
I feel like a little more extended discussion about variants might
be helpful.
The repertoire for the Latin script consists of "codepoints" -- some
are letters and some are letters plus diacritics.  "Variants" are
pairs of codepoints which are indistinguishable.  That is, in the
process that the Panel used, 5 of the 7 experts on the panel
couldn't see a difference.  The Latin GP did not look at diacritics
per se.  Just at codepoints which might involve diacritics.
Thus, a codepoint consisting of a letter with a caron diacritic ( ̌
) and a codepoint with the same letter combined with a breve
diacritic (  ̆  ) may always result in a variant pair, but only
because the Panel's comparison worked out that way.  For example, a
G with caron (ǧ) and a G with breve (ğ) are variants.  On the
other hand, a caron and a macron ( ¯ ) never result in a variant
pair.
However some cases with diacritics are mixed.  For example, a
codepoint consisting of letter with a dot above ( ˙ ) and a
codepoint consisting of a letter with an acute accent results in a
variant pair for letters C (ċ vs ć), N (ṅ vs ń), and Z (ż vs
ź ). But, in the Panel's original finding, not for letters E (ė vs
é), and I (i vs í).
(Note that a majority of the Panel found the vowels to produce
variants as well.  Just not a supermajority, as required by the
process the Panel had adopted.  As a result, the Panel's official
position is that, in various cases not just this one, even though a
majority of the experts, looking side by side, could not see a
difference, the average "reasonably careful user" will somehow
magically notice the difference when looking at a domain name.)
Then we have cross-script variants, including those identified by
other Panels.  For example, the Greek Panel found that the Greek
letter Iota was a variant both of the Latin letter I and the Latin
letter I with acute.  As a result I and I with acute became
variants.
But there is no Greek letter which is a variant of the Latin letter
E.  So we are left with a situation where the dot above diacritic
and the acute produce variants for all letters EXCEPT for the letter
E.  (When I suggested that, for consistency, we should make the
letter E case a variant as well, the response was "It is more
important that we follow our process than that we have
consistency.")
TLDs consist of a series of codepoints.  Proposed TLDs which differ
_only_ by one or more variants from another TLD will be
automatically be rejected in the software.  For example, .çom
would be allowed, despite its similarity to .com, because C with
Cedilla is not a variant of C.  Also .сом (using Cyrillic
letters) would be allowed because, while C and the Cyrillic letter
Es are variants, and O and the Cyrillic letter O are variants, the
letter M and the Cyrillic letter Em are not variants (the Panel was
directed to ignore Upper Case when deciding what might confuse
users).  But .cóm could be rejected, because O and O with acute are
variants.
"Confusables" are pairs of codepoints which some for the experts
could not distinguish, just not enough to be designated as variants.
Confusables are intended as suggestions for the panel which will
manually review the proposed TLDs.
I hope this all will help everyone understand what we are looking at
here.
Regards,
Bill Jouris
_______________________________________________
CPWG mailing list
CPWG@icann.org
https://mm.icann.org/mailman/listinfo/cpwg
_______________________________________________
By submitting your personal data, you consent to the processing of
your personal data for purposes of subscribing to this mailing list
accordance with the ICANN Privacy Policy
(https://www.icann.org/privacy/policy) and the website Terms of
Service (https://www.icann.org/privacy/tos). You can visit the
Mailman link above to change your membership status or
configuration, including unsubscribing, setting digest-style
delivery or disabling delivery altogether (e.g., for a vacation),
and so on.
--
Olivier MJ Crépin-Leblond, PhD
http://www.gih.com/ocl.html
  _______________________________________________
CPWG mailing list
CPWG@icann.org
https://mm.icann.org/mailman/listinfo/cpwg
_______________________________________________
By submitting your personal data, you consent to the processing of
your personal data for purposes of subscribing to this mailing list
accordance with the ICANN Privacy Policy
(https://www.icann.org/privacy/policy) and the website Terms of
Service (https://www.icann.org/privacy/tos). You can visit the Mailman
link above to change your membership status or configuration,
including unsubscribing, setting digest-style delivery or disabling
delivery altogether (e.g., for a vacation), and so on.
Links:
------
[1]
https://go.onelink.me/107872968?pid=InProduct&c=Global_Internal_YGrowth_Andr...
[2] http://oeamtc.at
_______________________________________________
CPWG mailing list
CPWG@icann.org
https://mm.icann.org/mailman/listinfo/cpwg
_______________________________________________
By submitting your personal data, you consent to the processing of
your personal data for purposes of subscribing to this mailing list
accordance with the ICANN Privacy Policy
(https://www.icann.org/privacy/policy) and the website Terms of
Service (https://www.icann.org/privacy/tos). You can visit the Mailman
link above to change your membership status or configuration,
including unsubscribing, setting digest-style delivery or disabling
delivery altogether (e.g., for a vacation), and so on.