Maybe one thing to take into consideration here is that we are far from having all languages properly codified, never mind in a state advanced enough for widespread Internet usage. It's the issue we are facing with helping codify some indigenous American continent languages at the moment, in which some characters can't be mapped properly even if we wanted to, like the many cases where something similar to a ' (apostrophe) is present in the transliteration. So it's unfortunately not as simple as that. We could theoretically backwards-proof this, but never future-proof it, which raises questions around whether doing that successfully is even possible.So, while we cannot handle all languages in our small group, we can still come up with a rule that takes all languages into account. Simple: require that whenever a new gTLD is being proposed, its possible significance in all languages must be considered. It's not that hard: just look for it with your favourite search engine, use Google Translate with "Detect language" option, or ask ChatGPT or whatever. Then ask speakers of those languages their view.
But again, the discussions in this group are super interesting,
and it will be a pleasure working through these questions with you
all.
Comments inline. On Tue, Mar 25, 2025 at 04:17:01PM +0000, Bill Jouris (b_jouris@yahoo.com) wrote:So let's look at the type case which motivated our PDP: Quebec vs Québec.That is really not a representative example, as "quebec" is not a word.Clearly e and é are not visually confusableI tend to think the very notion of "visual" confusability is a misguided attempt to find a language-independent way of measuring similarity. It won't work. For example, I can tell from experience that many if not most Finns *will* confuse é and è, even though a Frenchman would not. Just like pretty much all Chinese characters look confusingly similar to me. Similarity is indeed in the eyes of the beholder.They [Quebec and Québec] are not visually confusable, but rather functionally interchangeable/confusable.Agreed. But that's not true with diacritics in general.Now there are doubtless some cases, in some languages, where the addition of adiacritic does result in a different word.Indeed there are. Lots of them in lots of languages. To pester you with Finnish again, ä and ö are common letters, whose similarity to a and o is merely coincidental, and removing the dots will frequently produce a completely different word, even if it sometimes produces a non-word whose intent can be guessed. Same can occur with š and ž, although they're much less common. To pick a random example, a Finn who doesn't speak English would find it hard to understand they'd have to write "saari" (island) when they mean "sääri" (leg). I can come up with lots of examples like that from Finnish alone. Others have already given some from other languages. There are also cases where the ASCII version is a word with, er, negative connotations. Not to mention proper names. A multilingual example: Sjoberg, Sjöberg and Sjøberg are all extant family names, and domains sjoberg.us, sjöberg.fi and sjøberg.no exist, along with sjöberg.se, sjøberg.dk, sjoberg.com &c. As an aside, if you want to get an idea how ASCIIfying such languages feels, you can try writing English without any two common wovels, replacing them with whatever.But the question before us is, I think, do we try to go thru all the hundreds of languages which use the Latin script, trying to identify those cases?We obviously can't do that.Or do we make a rule which can be applied across all cases involving a diacritic, whether they are real words or not?We clearly should come up with a general rule. But it really cannot be done without linguistic considerations. If we try to do that, we'll just end up reinforcing the special status of English. For well-known historical reasons English is the dominant language on the Internet. Users of other languages have indeed gotten used to accommodating it by abusing their language, shoehorning its words to fit into the Procrustean bed of English alphabet. But that is not a solution. It is a problem, indeed *the* problem IDNs attempt to solve. And it is only going to get worse as more and more non-English speakers get online. So, while we cannot handle all languages in our small group, we can still come up with a rule that takes all languages into account. Simple: require that whenever a new gTLD is being proposed, its possible significance in all languages must be considered. It's not that hard: just look for it with your favourite search engine, use Google Translate with "Detect language" option, or ask ChatGPT or whatever. Then ask speakers of those languages their view.