Dear Latin GP members,

Please find below the message from the IP.

This is a follow-up from the Phone call with Latin GP, suggesting guidelines for Latin script variant analysis.

Regards,

Pitinan

From: Sarmad Hussain <sarmad.hussain@icann.org>
Date: Wednesday, February 5, 2020 at 16:43
To: Pitinan Kooarmornpatana <pitinan.koo@icann.org>
Subject: Fwd: Follow-up on Phone Call with Latin GP

Dear Pitinan,

Please share with Latin GP from IP.

Regards,
Sarmad

Considerations in designing variant sets

First, we generally say that more blocked variants are "good"; reducing the namespace a bit can make it safer, because it removes instances that can be exploited for malicious registrations.

However, as we've recognized with cross-script variants, there's also a cost.

There are two costs, one is that large variant sets tend to link things via transitivity that do not pass the "smell test" for users. They appear unmotivated, and when they lead to a desirable label being blocked by some other label that on the face of it isn't a true look-alike, the LGR comes out looking bad.

There was the case of the Cyrillic country code that people claimed looked too much like "br". The similarity was not that close and it was allowed on appeal. An LGR variant, by design, does not allow appeal, so the goal has to be to limit such variants to "clear-cut" cases (where one would expect an appeal to fail, if it had been possible to lodge one).

The other cost is that there is a tendency to impose in-language variants: except for well-defined exceptions we can assume that any two characters that are used in contrast in a language can (and will) be distinguished by its users (even if they look confusingly similar to people unfamiliar with them). We have been making similar arguments when it comes to in-script and cross-script variants.

Preventing simultaneous registrations of words that are seen as clearly different by native users is not ideal. We need to treat that on the "cost" side of the ledger. Even in cases where the distinctions are subtle or where two spellings lead to the same result. For example, for ASCII TLD's, both .unicorn and .unicom are available (subject to further manual review) as are ".color" and ".colour".

When it comes to diacritical marks, these precedents need to be borne in mind.

However, diacritical marks themselves are not always easy to tell apart at small sizes (and/or when an underline interferes that is drawn without a "skip-ink" feature"). That also needs to be recognized, because it may well be that they provide an especially convenient target for spoofing.

When it comes to possible confusions, the variant process is intended to deal with cases where one character, if placed inside a label, makes the label look like it is spelled with another character. Usually by substituting something that is "out of context".

It is often said that labels do not have a linguistic context, but if a label matches the spelling of a word in the user's language (or a wordlike entity like a brand name that the user thinks they recognize) then we know that users do not necessarily read such labels letter by letter, but recognize them as a whole. If the altered "whole" looks acceptable enough, it will be treated as equivalent - even if we assume our "reasonably observant user".

A user in a language that only has a macron above as a diacritic may be unlikely to accept most of the other diacritics as a substitute, but in the right context, a g with dot above, perhaps, might not be noticed when standing in for a g with cedilla (a small comma like shape above).

If, however, the language context in which a user thinks they are recognizing a label includes both a dot above and some other mark, these two are less likely conflated --- even if they do not occur on the same base character.

So, what we have are a number of principles, some that define benefits, some that define costs for adding or not adding variants.

Like the case with so many other principles, any two individual principles may be in tension with each other. The goal is therefore to judge how much weight to give each one, so that they are collectively best satisfied, even if that means that some won't be.

This also goes with the principle of language neutrality: ideally, we don't want to either burden or treat preferentially any given language to the exclusion of other. However, sometimes, taking other principles to their logical extreme, may impose too much of a cost. Or, in reverse, some balancing may lead to such an uneven treatment inside a single language that we are better off with more consistency. (This situation isn't new to the Latin script, it has been a constant across the Root Zone LGR -- and, again, it is a matter of judgment to get it right).

For the Latin script:

If two diacritics are never used in the same language, then there should be very limited cost on making those two a variant pair - unless doing so creates a large variant set with transitivity issues). This is true, in particular, if they occur on the same base character: G with dot above and G with cedilla might be an example; there's little downside and potential benefit of making them variants.
If the language with the G cedilla also has a dot above some other letter, that language would not benefit from the variant, but the other language still might. If both have a dot above on some letters, but the other language has some other mark above some letters, the case would be even weaker: neither set of users would be very likely to accept a substitute of the other language's marks.
If some diacritics are practically only ever used in one language (and look quite distinct) then it makes little sens to make them variants of each other. Native users can be expected to discriminate them and therefore a variant would have substantial cost and little benefit.
If some diacritics are used contrastively in some language(s) on the same base character, the cost of making them a variant is potentially quite high (blocking many labels that users consider distinct). It might be best to limit variants in that case to diacritics that are particularly prone to being misidentified in any other contexts.
If marks are reasonably distinct in printed presentation (as typical for user interfaces), causing in-language variants between them by transitivity should probably be avoided (if possible). Even if there's some hypothetical scenario why certain users might confuse them. Example is tilde vs. o-diaeresis. Even if there's a scenario where these can be substituted in Swedish, that scenario is perhaps not the most obvious and if it were to lead to some other language having two common characters made variants it would not pass the cost/benefit analysis.
Variant sets with cross-script members need to be well-scrutinized for issues of transitivity.
Overlapped variants cause permutations and need to be curtailed sharply; otherwise the attendant cost imposed especially in cross script cases will violate some of the overall principles against complexity and unexpected behavior.

Hopefully, this summary is a useful guideline to help reasoning about these variant sets. As always, it's not a recipe or algorithm, but might be an aid in making judgements.