Mark, 

You make an excellent point about the need to consider non-ASCII characters which might be combined with diacritics.  I had totally missed that. (I don't recall any of those being included in the Latin script repertoire, but it's been a few years since I looked at it.) 

But as for the details of Unicode, combining diacritics, etc., I think that's a level of technical detail we don't need to get into with this WG.  We need to look at human perceptions.  The technical details of how to implement whatever recommendation we make?  Not sure that's part of our remit -- unless, I suppose, someone can show that one of our recommendations is impossible to implement with existing technology. 

Regards, 
Bill 


Sent from Yahoo Mail on Android

On Tue, Mar 25, 2025 at 10:32 AM, Mark W. Datysgeld
<mark@governanceprimer.com> wrote:

Bill,

My understanding of the issue is informed by thoughts shared by Sarmad, but as I do not have his exact words at hand at the moment, I will explain them in the way that I perceive the matter. I do not claim to be quoting him accurately, this is merely my way of interpreting the question.

In a broad sense, a diacritic is the amalgamation of two Unicode characters, with one of them being a letter and the other being a combining mark. In composed form, we have something like á (U+00E1, Latin small letter a with acute), which in turn decomposes to "a" (U+0061, Latin small letter a) + ◌́ (U+0301, Combining acute accent).

The funny thing is that while "a" is indeed U+0061, it is also ASCII code point 97 (01100001).

So, in the ASCII form that the DNS runs on, we have a situation in which there might be a (imaginary) TLD named "arvore" (PT-BR for "tree"), which, were it to be expressed following grammatical rules, should be named "árvore". If this TLD had both a pure ASCII form and a diacritics one, these are some of the ways we could express it in:

  • arvore (7-bit): 97 114 118 111 114 101
  • árvore (UTF-8 bytes): 195 161 114 118 111 114 101
  • árvore (Unicode code points): 225 114 118 111 114 101

This is an ASCII code point which has a direct Unicode version, in such a way that, were we to decompose it, it could be expressed as one letter plus a combining mark.

Meanwhile, When we look into the examples that Tarvani has helpfully been pointing out, while we can express þ (U+00FE) in an ASCII-like manner, and it is part of the Latin repertory, it does not have a direct ASCII correlation or any decomposable relation in Unicode.

This might be the case with other letters that may accept a combining mark but do not have an ASCII correlation. We need to explore that, and it is possible that these distinctions are at the core of our discussion. Remains to be seen.

Best,

On 25 Mar 2025 13:17, Bill Jouris via Gnso-latin-diacritics wrote:
First, I’d like to thank Michael for his patience, at our meeting at ICANN82, as I tried to put my concern about the phrasing of our into words.  Unsuccessfully.  I hope I can do better here. 

When the Latin Generation Panel was creating the RZ-LGR, with just a couple of exceptions we were looking at visual confusable similarity. That is, we compared a code point (ASCII letter or that letter plus one or more diacritics) to another.  If, in a side by side comparison, we found it very difficult to distinguish between them, they were designated variants.  If we could distinguish them, they were not variants.  Note that, when one code point was just an ASCII letter, and the other was that letter plus a diacritic, there were no cases where they were variants.  None.  Zero.

So let's look at the type case which motivated our PDP: Quebec vs Québec.  Clearly e and é are not visually confusable; the Latin GP already ruled on that.  So what are we even looking at?  Why are we here? 
  
If a French-speaking resident of Québec is presented with Quebec (that is, without the diacritic), will it occur to him, even for an instant, that he is looking at a different word?  Never.  He will see it as an error, due to the lack of the diacritic – whether due to ignorance on the part of one of the English speakers that surround Québec in North America, or due to even a French speaker using a keyboard which doesn’t provide for diacritics.  Wrong, but NOT different. They are not visually confusable, but rather functionally interchangeable/confusable. 

The various other examples given in our direction from the GNSO Council all have the same characteristic.  There’s no problem seeing the difference; no visual confusion.  It’s just that the difference, in those cases, is not considered significant by the user.

Now there are doubtless some cases, in some languages, where the addition of a diacritic does result in a different word.  But the question before us is, I think, do we try to go thru all the hundreds of languages which use the Latin script, trying to identify those cases?  Or do we make a rule which can be applied across all cases involving a diacritic, whether they are real words or not?  I submit that the latter is the only feasible course.

Bill Jouris

On Monday, March 24, 2025 at 10:16:37 AM PDT, Amadeu Abril i Abril (CORE) via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> wrote:


Hello all.

I would say “yes” to both your questions.

As I explained in previous mail exchanges, we should not base our decisions and distinctions in linguistic, semantic, morphological or otrhographic criteria. TLDs may be grammaticaly incorrect, or mean jut nothing in no language.

And indeed, what is a diacritic varies from context to context. In linguistic terms, it also varies historically: w was,and still is in some lalnguages, a variant, a diacritic version, of v. And u and v were just variants of the same character, as G amc C….

The point is not whehter they are considered proper charactars in a given alphabet or a modification/diacritic of another character. The point, for us, is that a set of the Latin charactr repoertoire iknon as ASCII does NOT require specicalencoding inthe DNS. While allthe rest does require althe IDN/punycode encoding. So when we say ASCII and diacrtitics we mean ASCII and non-ASCII (if they are similar enough to raise issues under the String Evaluation test which is a requirement for new gTLDs).

And indeed there may be more than one “similar” IDN/non-ASCII/string with diacritics which falls ntothe “too similar to be allowed” scenario, so the solution(s) we are toduscuss mayapply to more than 2 TLDs strings at once. Even if unlikely for practical (including financial) reasons.

Regards.

AAA


> El 24 març 2025, a les 15:14, Tapani Tarvainen via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> va escriure:
>
> Dear all,
>
> In preparation for our next call, I wrote down a few issues
> I'd like to clarify about our scope:
>
> (1) We use the term "diacritics" differently than linguists do.
> If I understand correctly, we're including all non-ASCII characters
> that various Latin-based alphabets use, such as æ, ŋ, ß, ð, þ etc.
>
> (2) The base ASCII gTLD version of a diacritic is for us by definition
> a single ASCII letter with the sole exception of German ß. Other cases
> where the conventional ASCII conversion is two or more letters are not
> treated similarly (as per RZ-LGR). Presumably we could, however, take
> the existence of such as a factor in our recommendations.
>
> E.g., to pick an example from Finnish, consider ".šakki" vs ".sakki",
> should the existence of ".shakki" matter? (Both "šakki" and "sakki"
> are Finnish words with completely unrelated meanings, while "shakki" is
> an alternative spelling of "šakki" - not a computer-era fallback, but
> actually older and still more common spelling even though deprecated
> by modern ortography.)
>
> Or consider "æ", which also looks obviously similar to "ae".
>
> (3) Our first question talks about "a base ASCII gTLD and *the* Latin
> script diacritic version", in singular. It seems pretty clear to me,
> however, that we should also consider cases with multiple diacritic
> versions of the same ASCII letter, like the "sjö" / "sjø" case I
> brought up earlier.
>
> --
> Tapani Tarvainen
> _______________________________________________
> Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org
> To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org

_______________________________________________
Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org
To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org

_______________________________________________
Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org
To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org
--
Mark W. Datysgeld
Director at Governance Primer [governanceprimer.com]
Project Lead Developer at ICANNWiki [icannwiki.org]