Dear Latin GP members,

 

Please find below the message from the IP.

This is a follow-up from the Phone call with Latin GP, suggesting guidelines for Latin script variant analysis.

 

Regards,

Pitinan

 

From: Sarmad Hussain <sarmad.hussain@icann.org>
Date: Wednesday, February 5, 2020 at 16:43
To: Pitinan Kooarmornpatana <pitinan.koo@icann.org>
Subject: Fwd: Follow-up on Phone Call with Latin GP

 

Dear Pitinan,

Please share with Latin GP from IP.

Regards,
Sarmad


Considerations in designing variant sets

First, we generally say that more blocked variants are "good"; reducing the namespace a bit can make it safer, because it removes instances that can be exploited for malicious registrations.

However, as we've recognized with cross-script variants, there's also a cost.

There are two costs, one is that large variant sets tend to link things via transitivity that do not pass the "smell test" for users. They appear unmotivated, and when they lead to a desirable label being blocked by some other label that on the face of it isn't a true look-alike, the LGR comes out looking bad.

There was the case of the Cyrillic country code that people claimed looked too much like "br". The similarity was not that close and it was allowed on appeal. An LGR variant, by design, does not allow appeal, so the goal has to be to limit such variants to "clear-cut" cases (where one would expect an appeal to fail, if it had been possible to lodge one).

The other cost is that there is a tendency to impose in-language variants: except for well-defined exceptions we can assume that any two characters that are used in contrast in a language can (and will) be distinguished by its users (even if they look confusingly similar to people unfamiliar with them). We have been making similar arguments when it comes to in-script and cross-script variants.

Preventing simultaneous registrations of words that are seen as clearly different by native users is not ideal. We need to treat that on the "cost" side of the ledger. Even in cases where the distinctions are subtle or where two spellings lead to the same result. For example, for ASCII TLD's, both .unicorn and .unicom are available (subject to further manual review) as are ".color" and ".colour".

When it comes to diacritical marks, these precedents need to be borne in mind.

However, diacritical marks themselves are not always easy to tell apart at small sizes (and/or when an underline interferes that is drawn without a "skip-ink" feature"). That also needs to be recognized, because it may well be that they provide an especially convenient target for spoofing.

When it comes to possible confusions, the variant process is intended to deal with cases where one character, if placed inside a label, makes the label look like it is spelled with another character. Usually by substituting something that is "out of context".

It is often said that labels do not have a linguistic context, but if a label matches the spelling of a word in the user's language (or a wordlike entity like a brand name that the user thinks they recognize) then we know that users do not necessarily read such labels letter by letter, but recognize them as a whole. If the altered "whole" looks acceptable enough, it will be treated as equivalent - even if we assume our "reasonably observant user".

A user in a language that only has a macron above as a diacritic may be unlikely to accept most of the other diacritics as a substitute, but in the right context, a g with dot above, perhaps, might not be noticed when standing in for a g with cedilla (a small comma like shape above).

If, however, the language context in which a user thinks they are recognizing a label includes both a dot above and some other mark, these two are less likely conflated --- even if they do not occur on the same base character.

So, what we have are a number of principles, some that define benefits, some that define costs for adding or not adding variants.

Like the case with so many other principles, any two individual principles may be in tension with each other. The goal is therefore to judge how much weight to give each one, so that they are collectively best satisfied, even if that means that some won't be.

This also goes with the principle of language neutrality: ideally, we don't want to either burden or treat preferentially any given language to the exclusion of other. However, sometimes, taking other principles to their logical extreme, may impose too much of a cost. Or, in reverse, some balancing may lead to such an uneven treatment inside a single language that we are better off with more consistency. (This situation isn't new to the Latin script, it has been a constant across the Root Zone LGR -- and, again, it is a matter of judgment to get it right).

For the Latin script:

Hopefully, this summary is a useful guideline  to help reasoning about these variant sets. As always, it's not a recipe or algorithm, but might be an aid in making judgements.