So, while we cannot handle all languages in our small group, we can
still come up with a rule that takes all languages into account.

Simple: require that whenever a new gTLD is being proposed, its
possible significance in all languages must be considered.

It's not that hard: just look for it with your favourite search
engine, use Google Translate with "Detect language" option, or ask
ChatGPT or whatever. Then ask speakers of those languages their view.

Maybe one thing to take into consideration here is that we are far from having all languages properly codified, never mind in a state advanced enough for widespread Internet usage. It's the issue we are facing with helping codify some indigenous American continent languages at the moment, in which some characters can't be mapped properly even if we wanted to, like the many cases where something similar to a ' (apostrophe) is present in the transliteration. So it's unfortunately not as simple as that. We could theoretically backwards-proof this, but never future-proof it, which raises questions around whether doing that successfully is even possible.

But again, the discussions in this group are super interesting, and it will be a pleasure working through these questions with you all.

On 26 Mar 2025 06:48, Tarvainen Tapani via Gnso-latin-diacritics wrote:

Comments inline.

On Tue, Mar 25, 2025 at 04:17:01PM +0000, Bill Jouris (b_jouris@yahoo.com) wrote:

So let's look at the type case which motivated our PDP: Quebec vs Québec.

That is really not a representative example, as "quebec" is not a word.

Clearly e and é are not visually confusable

I tend to think the very notion of "visual" confusability is a misguided
attempt to find a language-independent way of measuring similarity.
It won't work.

For example, I can tell from experience that many if not most Finns
*will* confuse é and è, even though a Frenchman would not. Just like
pretty much all Chinese characters look confusingly similar to me.

Similarity is indeed in the eyes of the beholder.

They [Quebec and Québec] are not visually confusable, but rather
functionally interchangeable/confusable.

Agreed. But that's not true with diacritics in general.

Now there are doubtless some cases, in some languages, where the
addition of adiacritic does result in a different word.

Indeed there are. Lots of them in lots of languages.

To pester you with Finnish again, ä and ö are common letters, whose
similarity to a and o is merely coincidental, and removing the dots
will frequently produce a completely different word, even if it
sometimes produces a non-word whose intent can be guessed. Same can
occur with š and ž, although they're much less common.

To pick a random example, a Finn who doesn't speak English would
find it hard to understand they'd have to write "saari" (island)
when they mean "sääri" (leg). I can come up with lots of examples
like that from Finnish alone. Others have already given some from
other languages.

There are also cases where the ASCII version is a word with, er,
negative connotations.

Not to mention proper names. A multilingual example: Sjoberg, Sjöberg
and Sjøberg are all extant family names, and domains sjoberg.us,
sjöberg.fi and sjøberg.no exist, along with sjöberg.se, sjøberg.dk,
sjoberg.com &c.

As an aside, if you want to get an idea how ASCIIfying such languages
feels, you can try writing English without any two common wovels,
replacing them with whatever.

But the question before us is, I think, do we try to go thru all
the hundreds of languages which use the Latin script, trying to
identify those cases?

We obviously can't do that.

Or do we make a rule which can be applied across all cases involving
a diacritic, whether they are real words or not?

We clearly should come up with a general rule. But it really cannot be
done without linguistic considerations. If we try to do that, we'll
just end up reinforcing the special status of English.

For well-known historical reasons English is the dominant language on
the Internet. Users of other languages have indeed gotten used to
accommodating it by abusing their language, shoehorning its words to
fit into the Procrustean bed of English alphabet.

But that is not a solution. It is a problem, indeed *the* problem IDNs
attempt to solve. And it is only going to get worse as more and more
non-English speakers get online.

So, while we cannot handle all languages in our small group, we can
still come up with a rule that takes all languages into account.

Simple: require that whenever a new gTLD is being proposed, its
possible significance in all languages must be considered.

It's not that hard: just look for it with your favourite search
engine, use Google Translate with "Detect language" option, or ask
ChatGPT or whatever. Then ask speakers of those languages their view.

--
Mark W. Datysgeld
Director at Governance Primer [governanceprimer.com]
Project Lead Developer at ICANNWiki [icannwiki.org]