Scope issues

older
Re: Proposed Agenda for Meeting #1

Tapani Tarvainen

March 24, 2025

2:14 p.m.

Dear all, In preparation for our next call, I wrote down a few issues I'd like to clarify about our scope: (1) We use the term "diacritics" differently than linguists do. If I understand correctly, we're including all non-ASCII characters that various Latin-based alphabets use, such as æ, ŋ, ß, ð, þ etc. (2) The base ASCII gTLD version of a diacritic is for us by definition a single ASCII letter with the sole exception of German ß. Other cases where the conventional ASCII conversion is two or more letters are not treated similarly (as per RZ-LGR). Presumably we could, however, take the existence of such as a factor in our recommendations. E.g., to pick an example from Finnish, consider ".šakki" vs ".sakki", should the existence of ".shakki" matter? (Both "šakki" and "sakki" are Finnish words with completely unrelated meanings, while "shakki" is an alternative spelling of "šakki" - not a computer-era fallback, but actually older and still more common spelling even though deprecated by modern ortography.) Or consider "æ", which also looks obviously similar to "ae". (3) Our first question talks about "a base ASCII gTLD and *the* Latin script diacritic version", in singular. It seems pretty clear to me, however, that we should also consider cases with multiple diacritic versions of the same ASCII letter, like the "sjö" / "sjø" case I brought up earlier. -- Tapani Tarvainen

Show replies by date

Amadeu Abril i Abril (CORE)

March 2025

5:15 p.m.

Hello all. I would say “yes” to both your questions. As I explained in previous mail exchanges, we should not base our decisions and distinctions in linguistic, semantic, morphological or otrhographic criteria. TLDs may be grammaticaly incorrect, or mean jut nothing in no language. And indeed, what is a diacritic varies from context to context. In linguistic terms, it also varies historically: w was,and still is in some lalnguages, a variant, a diacritic version, of v. And u and v were just variants of the same character, as G amc C…. The point is not whehter they are considered proper charactars in a given alphabet or a modification/diacritic of another character. The point, for us, is that a set of the Latin charactr repoertoire iknon as ASCII does NOT require specicalencoding inthe DNS. While allthe rest does require althe IDN/punycode encoding. So when we say ASCII and diacrtitics we mean ASCII and non-ASCII (if they are similar enough to raise issues under the String Evaluation test which is a requirement for new gTLDs). And indeed there may be more than one “similar” IDN/non-ASCII/string with diacritics which falls ntothe “too similar to be allowed” scenario, so the solution(s) we are toduscuss mayapply to more than 2 TLDs strings at once. Even if unlikely for practical (including financial) reasons. Regards. AAA

...

El 24 març 2025, a les 15:14, Tapani Tarvainen via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> va escriure:

Dear all,

In preparation for our next call, I wrote down a few issues I'd like to clarify about our scope:

(1) We use the term "diacritics" differently than linguists do. If I understand correctly, we're including all non-ASCII characters that various Latin-based alphabets use, such as æ, ŋ, ß, ð, þ etc.

(2) The base ASCII gTLD version of a diacritic is for us by definition a single ASCII letter with the sole exception of German ß. Other cases where the conventional ASCII conversion is two or more letters are not treated similarly (as per RZ-LGR). Presumably we could, however, take the existence of such as a factor in our recommendations.

E.g., to pick an example from Finnish, consider ".šakki" vs ".sakki", should the existence of ".shakki" matter? (Both "šakki" and "sakki" are Finnish words with completely unrelated meanings, while "shakki" is an alternative spelling of "šakki" - not a computer-era fallback, but actually older and still more common spelling even though deprecated by modern ortography.)

Or consider "æ", which also looks obviously similar to "ae".

(3) Our first question talks about "a base ASCII gTLD and *the* Latin script diacritic version", in singular. It seems pretty clear to me, however, that we should also consider cases with multiple diacritic versions of the same ASCII letter, like the "sjö" / "sjø" case I brought up earlier.

-- Tapani Tarvainen _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org

Bill Jouris

4:17 p.m.

First, I’d like to thank Michael for his patience, at our meetingat ICANN82, as I tried to put my concern about the phrasing of our intowords. Unsuccessfully. I hope I can do better here. When the Latin Generation Panel was creating the RZ-LGR, withjust a couple of exceptions we were looking at visual confusable similarity.That is, we compared a code point (ASCII letter or that letter plus one ormore diacritics) to another. If, in aside by side comparison, we found it very difficult to distinguish between them,they were designated variants. If wecould distinguish them, they were not variants. Note that, when one code point was just an ASCII letter, and the otherwas that letter plus a diacritic, there were no cases where they were variants. None. Zero. So let's look at the type case which motivated our PDP: Quebecvs Québec. Clearly e and é are not visuallyconfusable; the Latin GP already ruled on that. So what are we even looking at? Why are we here? If a French-speaking resident of Québec is presented withQuebec (that is, without the diacritic), will it occur to him, even for an instant, that he is looking at a differentword? Never. He will see it as an error, due to the lack ofthe diacritic – whether due to ignorance on the part of one of the English speakers that surround Québecin North America, or due to even a French speaker using a keyboard which doesn’tprovide for diacritics. Wrong, but NOTdifferent. They are not visually confusable, but rather functionallyinterchangeable/confusable. The various other examples given in our directionfrom the GNSO Council all have the same characteristic. There’s no problem seeing thedifference; no visual confusion. It’s just that thedifference, in those cases, is not considered significant by the user. Now there are doubtless some cases, in some languages, where the addition of adiacritic does result in a different word. But the question before us is, I think, do we try to go thru all thehundreds of languages which use the Latin script, trying to identify thosecases? Or do we make a rule which can beapplied across all cases involving a diacritic, whether they are real words or not? I submit that the latter is the only feasible course. Bill Jouris On Monday, March 24, 2025 at 10:16:37 AM PDT, Amadeu Abril i Abril (CORE) via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> wrote: Hello all. I would say “yes” to both your questions. As I explained in previous mail exchanges, we should not base our decisions and distinctions in linguistic, semantic, morphological or otrhographic criteria. TLDs may be grammaticaly incorrect, or mean jut nothing in no language. And indeed, what is a diacritic varies from context to context. In linguistic terms, it also varies historically: w was,and still is in some lalnguages, a variant, a diacritic version, of v. And u and v were just variants of the same character, as G amc C…. The point is not whehter they are considered proper charactars in a given alphabet or a modification/diacritic of another character. The point, for us, is that a set of the Latin charactr repoertoire iknon as ASCII does NOT require specicalencoding inthe DNS. While allthe rest does require althe IDN/punycode encoding. So when we say ASCII and diacrtitics we mean ASCII and non-ASCII (if they are similar enough to raise issues under the String Evaluation test which is a requirement for new gTLDs). And indeed there may be more than one “similar” IDN/non-ASCII/string with diacritics which falls ntothe “too similar to be allowed” scenario, so the solution(s) we are toduscuss mayapply to more than 2 TLDs strings at once. Even if unlikely for practical (including financial) reasons. Regards. AAA

...

El 24 març 2025, a les 15:14, Tapani Tarvainen via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> va escriure:

Dear all,

In preparation for our next call, I wrote down a few issues I'd like to clarify about our scope:

(1) We use the term "diacritics" differently than linguists do. If I understand correctly, we're including all non-ASCII characters that various Latin-based alphabets use, such as æ, ŋ, ß, ð, þ etc.

(2) The base ASCII gTLD version of a diacritic is for us by definition a single ASCII letter with the sole exception of German ß. Other cases where the conventional ASCII conversion is two or more letters are not treated similarly (as per RZ-LGR). Presumably we could, however, take the existence of such as a factor in our recommendations.

E.g., to pick an example from Finnish, consider ".šakki" vs ".sakki", should the existence of ".shakki" matter? (Both "šakki" and "sakki" are Finnish words with completely unrelated meanings, while "shakki" is an alternative spelling of "šakki" - not a computer-era fallback, but actually older and still more common spelling even though deprecated by modern ortography.)

Or consider "æ", which also looks obviously similar to "ae".

(3) Our first question talks about "a base ASCII gTLD and *the* Latin script diacritic version", in singular. It seems pretty clear to me, however, that we should also consider cases with multiple diacritic versions of the same ASCII letter, like the "sjö" / "sjø" case I brought up earlier.

-- Tapani Tarvainen _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org

Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org

Sergio Salinas Porto

5:15 p.m.

Dear all, since there will be no Spanish interpretation on our call today (which I consider a truly serious error, highlighting a complex bias and inequity in the debates), I will now make a contribution to the discussion of our document. Undoubtedly, it may not be of much relevance to those who do not use diacritical characters in their native language, but for those of us who embrace Latin languages (Hispanic, French, Portuguese, etc.), the diacritical accent is of paramount importance. As Bill rightly points out, perhaps in some words we only think they are misspelled (with spelling mistakes) and it may be as sensitizing as the example he gives or Québéc, but in others, it radically changes the word and its meaning. To give an example (there are hundreds), one can write "Revólver" or "revolver" in Spanish. In the case of "revolver" (it's a verb that means to move something from side to side, around, or up and down. It can also mean to mix or shake. For example: stir my coffee with a spoon) In the case of a revólver: it's a short firearm whose ammunition is housed in a rotating drum. As you can see, we're not talking about the same thing. Our challenge is for the internet to evolve toward a true internationalization of the DNS and incorporate these diacritical characters, also considering Ñ, Ç, ü, etc., and considering cases with multiple diacritics. This implies moving toward an 8-bit (1-byte) character code that can represent up to 256 different characters, instead of the 7-bit code we currently use for 128 characters. We could take ISO/IEC 8859-1, which contains the Spanish alphabet, as an example. The key to this is the addition of that check bit called the Parity Bit, which allows the character size to be changed from 128 to 256. My two cents *Sergio Salinas Porto**Presidente Internauta Argentina - LACRALO/ICANN <https://atlarge.icann.org/ralos/lacralo>**Asociación Argentina de Usuarios de Internet <http://www.internauta.org.ar/>/FeTIA <http://www.fetia.org.ar/>**FUILAC- Federación de Usuarios de Internet de LAC <https://fuilac.org>**facebook: salinasporto <http://www.facebook.com/salinasporto> **twitter: sergiosalinas <http://twitter.com/sergiosalinas>**Mobi:+54 9 223 5 215819**"Ojalá podamos ser desobedientes, cada vez que recibimos órdenes que humillan nuestra * * conciencia o violan nuestro sentido común" Eduardo Galeano* El mar, 25 mar 2025 a las 13:17, Bill Jouris via Gnso-latin-diacritics (< gnso-latin-diacritics@icann.org>) escribió:

...

First, I’d like to thank Michael for his patience, at our meeting at ICANN82, as I tried to put my concern about the phrasing of our into words. Unsuccessfully. I hope I can do better here.

When the Latin Generation Panel was creating the RZ-LGR, with just a couple of exceptions we were looking at *visual* confusable similarity. That is, we compared a code point (ASCII letter or that letter plus one or more diacritics) to another. If, in a side by side comparison, we found it very difficult to distinguish between them, they were designated variants. If we could distinguish them, they were not variants. Note that, when one code point was just an ASCII letter, and the other was that letter plus a diacritic, there were no cases where they were variants. None. Zero. So let's look at the type case which motivated our PDP: Quebec vs Québec. Clearly e and é are not *visually* confusable; the Latin GP already ruled on that. So what are we even looking at? Why are we here?

If a French-speaking resident of Québec is presented with Quebec (that is, without the diacritic), will it occur to him, even for an instant, that he is looking at a *different* word? Never. He will see it as an error, due to the lack of the diacritic – whether due to ignorance on the part of one of the English speakers that surround Québec in North America, or due to even a French speaker using a keyboard which doesn’t provide for diacritics. Wrong, but NOT different. They are not *visually* confusable, but rather *functionally* interchangeable/confusable.

The various other examples given in our direction from the GNSO Council all have the same characteristic. There’s no problem seeing the difference; no *visual* confusion. It’s just that the difference, in those cases, is not considered significant by the user.

Now there are doubtless some cases, in some languages, where the addition of a diacritic does result in a different word. But the question before us is, I think, do we try to go thru all the hundreds of languages which use the Latin script, trying to identify those cases? Or do we make a rule which can be applied across all cases involving a diacritic, whether they are real words or not? I submit that the latter is the only feasible course. Bill Jouris

On Monday, March 24, 2025 at 10:16:37 AM PDT, Amadeu Abril i Abril (CORE) via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> wrote:

Hello all.

I would say “yes” to both your questions.

As I explained in previous mail exchanges, we should not base our decisions and distinctions in linguistic, semantic, morphological or otrhographic criteria. TLDs may be grammaticaly incorrect, or mean jut nothing in no language.

And indeed, what is a diacritic varies from context to context. In linguistic terms, it also varies historically: w was,and still is in some lalnguages, a variant, a diacritic version, of v. And u and v were just variants of the same character, as G amc C….

The point is not whehter they are considered proper charactars in a given alphabet or a modification/diacritic of another character. The point, for us, is that a set of the Latin charactr repoertoire iknon as ASCII does NOT require specicalencoding inthe DNS. While allthe rest does require althe IDN/punycode encoding. So when we say ASCII and diacrtitics we mean ASCII and non-ASCII (if they are similar enough to raise issues under the String Evaluation test which is a requirement for new gTLDs).

And indeed there may be more than one “similar” IDN/non-ASCII/string with diacritics which falls ntothe “too similar to be allowed” scenario, so the solution(s) we are toduscuss mayapply to more than 2 TLDs strings at once. Even if unlikely for practical (including financial) reasons.

Regards.

AAA

...
El 24 març 2025, a les 15:14, Tapani Tarvainen via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> va escriure:

Dear all,

In preparation for our next call, I wrote down a few issues I'd like to clarify about our scope:

(1) We use the term "diacritics" differently than linguists do. If I understand correctly, we're including all non-ASCII characters that various Latin-based alphabets use, such as æ, ŋ, ß, ð, þ etc.

(2) The base ASCII gTLD version of a diacritic is for us by definition a single ASCII letter with the sole exception of German ß. Other cases where the conventional ASCII conversion is two or more letters are not treated similarly (as per RZ-LGR). Presumably we could, however, take the existence of such as a factor in our recommendations.

E.g., to pick an example from Finnish, consider ".šakki" vs ".sakki", should the existence of ".shakki" matter? (Both "šakki" and "sakki" are Finnish words with completely unrelated meanings, while "shakki" is an alternative spelling of "šakki" - not a computer-era fallback, but actually older and still more common spelling even though deprecated by modern ortography.)

Or consider "æ", which also looks obviously similar to "ae".

(3) Our first question talks about "a base ASCII gTLD and *the* Latin script diacritic version", in singular. It seems pretty clear to me, however, that we should also consider cases with multiple diacritic versions of the same ASCII letter, like the "sjö" / "sjø" case I brought up earlier.

-- Tapani Tarvainen _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org

_______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org

Mark W. Datysgeld

5:32 p.m.

Bill, My understanding of the issue is informed by thoughts shared by Sarmad, but as I do not have his exact words at hand at the moment, I will explain them in the way that I perceive the matter. I do not claim to be quoting him accurately, this is merely my way of interpreting the question. In a broad sense, a diacritic is the amalgamation of two Unicode characters, with one of them being a letter and the other being a combining mark. In composed form, we have something like á (U+00E1, Latin small letter a with acute), which in turn decomposes to "a" (U+0061, Latin small letter a) + ◌́ (U+0301, Combining acute accent). The funny thing is that while "a" is indeed U+0061, it is also ASCII code point 97 (01100001). So, in the ASCII form that the DNS runs on, we have a situation in which there might be a (imaginary) TLD named "arvore" (PT-BR for "tree"), which, were it to be expressed following grammatical rules, should be named "árvore". If this TLD had both a pure ASCII form and a diacritics one, these are some of the ways we could express it in: * arvore (7-bit): 97 114 118 111 114 101 * árvore (UTF-8 bytes): 195 161 114 118 111 114 101 * árvore (Unicode code points): 225 114 118 111 114 101 This is an ASCII code point which has a direct Unicode version, in such a way that, were we to decompose it, it could be expressed as one letter plus a combining mark. Meanwhile, When we look into the examples that Tarvani has helpfully been pointing out, while we can express þ (U+00FE) in an ASCII-like manner, and it is part of the Latin repertory, it does not have a *direct* ASCII correlation or any decomposable relationin Unicode. This might be the case with other letters that may accept a combining mark but do not have an ASCII correlation. We need to explore that, and it is possible that these distinctions are at the core of our discussion. Remains to be seen. Best, On 25 Mar 2025 13:17, Bill Jouris via Gnso-latin-diacritics wrote:

...

First, I’d like to thank Michael for his patience, at our meeting at ICANN82, as I tried to put my concern about the phrasing of our into words.Unsuccessfully.I hope I can do better here.

When the Latin Generation Panel was creating the RZ-LGR, with just a couple of exceptions we were looking at /visual/ confusable similarity. That is, we compared a code point (ASCII letter or that letter plus one or more diacritics) to another.If, in a side by side comparison, we found it very difficult to distinguish between them, they were designated variants.If we could distinguish them, they were not variants.Note that, when one code point was just an ASCII letter, and the other was that letter plus a diacritic, there were no cases where they were variants.None.Zero.

So let's look at the type case which motivated our PDP: Quebec vs Québec.Clearly e and é are not /visually/ confusable; the Latin GP already ruled on that. So what are we even looking at? Why are we here? If a French-speaking resident of Québec is presented with Quebec (that is, without the diacritic), will it occur to him, even for an instant, that he is looking at a /different/ word?Never. He will see it as an error, due to the lack of the diacritic – whether due to ignorance on the part of one of the English speakers that surround Québec in North America, or due to even a French speaker using a keyboard which doesn’t provide for diacritics.Wrong, but NOT different. They are not /visually/ confusable, but rather /functionally/ interchangeable/confusable.

The various other examples given in our direction from the GNSO Council all have the same characteristic.There’s no problem seeing the difference; no /visual/ confusion.It’s just that the difference, in those cases, is not considered significant by the user.

Now there are doubtless some cases, in some languages, where the addition of a diacritic does result in a different word. But the question before us is, I think, do we try to go thru all the hundreds of languages which use the Latin script, trying to identify those cases?Or do we make a rule which can be applied across all cases involving a diacritic, whether they are real words or not?I submit that the latter is the only feasible course.

Bill Jouris

On Monday, March 24, 2025 at 10:16:37 AM PDT, Amadeu Abril i Abril (CORE) via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> wrote:

Hello all.

I would say “yes” to both your questions.

As I explained in previous mail exchanges, we should not base our decisions and distinctions in linguistic, semantic, morphological or otrhographic criteria. TLDs may be grammaticaly incorrect, or mean jut nothing in no language.

And indeed, what is a diacritic varies from context to context. In linguistic terms, it also varies historically: w was,and still is in some lalnguages, a variant, a diacritic version, of v. And u and v were just variants of the same character, as G amc C….

The point is not whehter they are considered proper charactars in a given alphabet or a modification/diacritic of another character. The point, for us, is that a set of the Latin charactr repoertoire iknon as ASCII does NOT require specicalencoding inthe DNS. While allthe rest does require althe IDN/punycode encoding. So when we say ASCII and diacrtitics we mean ASCII and non-ASCII (if they are similar enough to raise issues under the String Evaluation test which is a requirement for new gTLDs).

And indeed there may be more than one “similar” IDN/non-ASCII/string with diacritics which falls ntothe “too similar to be allowed” scenario, so the solution(s) we are toduscuss mayapply to more than 2 TLDs strings at once. Even if unlikely for practical (including financial) reasons.

Regards.

AAA

...
El 24 març 2025, a les 15:14, Tapani Tarvainen via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> va escriure:

Dear all,

In preparation for our next call, I wrote down a few issues I'd like to clarify about our scope:

(1) We use the term "diacritics" differently than linguists do. If I understand correctly, we're including all non-ASCII characters that various Latin-based alphabets use, such as æ, ŋ, ß, ð, þ etc.

(2) The base ASCII gTLD version of a diacritic is for us by definition a single ASCII letter with the sole exception of German ß. Other cases where the conventional ASCII conversion is two or more letters are not treated similarly (as per RZ-LGR). Presumably we could, however, take the existence of such as a factor in our recommendations.

E.g., to pick an example from Finnish, consider ".šakki" vs ".sakki", should the existence of ".shakki" matter? (Both "šakki" and "sakki" are Finnish words with completely unrelated meanings, while "shakki" is an alternative spelling of "šakki" - not a computer-era fallback, but actually older and still more common spelling even though deprecated by modern ortography.)

Or consider "æ", which also looks obviously similar to "ae".

(3) Our first question talks about "a base ASCII gTLD and *the* Latin script diacritic version", in singular. It seems pretty clear to me, however, that we should also consider cases with multiple diacritic versions of the same ASCII letter, like the "sjö" / "sjø" case I brought up earlier.

-- Tapani Tarvainen _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org

_______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org

_______________________________________________ Gnso-latin-diacritics mailing list --gnso-latin-diacritics@icann.org To unsubscribe send an email tognso-latin-diacritics-leave@icann.org -- Mark W. Datysgeld Director at Governance Primer [governanceprimer.com <https://governanceprimer.com>] Project Lead Developer at ICANNWiki [icannwiki.org <https://icannwiki.org/>]

Bill Jouris

4:11 a.m.

Mark, You make an excellent point about the need to consider non-ASCII characters which might be combined with diacritics. I had totally missed that. (I don't recall any of those being included in the Latin script repertoire, but it's been a few years since I looked at it.) But as for the details of Unicode, combining diacritics, etc., I think that's a level of technical detail we don't need to get into with this WG. We need to look at human perceptions. The technical details of how to implement whatever recommendation we make? Not sure that's part of our remit -- unless, I suppose, someone can show that one of our recommendations is impossible to implement with existing technology. Regards, Bill Sent from Yahoo Mail on Android On Tue, Mar 25, 2025 at 10:32 AM, Mark W. Datysgeld<mark@governanceprimer.com> wrote: Bill, My understanding of the issue is informed by thoughts shared by Sarmad, but as I do not have his exact words at hand at the moment, I will explain them in the way that I perceive the matter. I do not claim to be quoting him accurately, this is merely my way of interpreting the question. In a broad sense, a diacritic is the amalgamation of two Unicode characters, with one of them being a letter and the other being a combining mark. In composed form, we have something like á (U+00E1, Latin small letter a with acute), which in turn decomposes to "a" (U+0061, Latin small letter a) + ◌́ (U+0301, Combining acute accent). The funny thing is that while "a" is indeed U+0061, it is also ASCII code point 97 (01100001). So, in the ASCII form that the DNS runs on, we have a situation in which there might be a (imaginary) TLD named "arvore" (PT-BR for "tree"), which, were it to be expressed following grammatical rules, should be named "árvore". If this TLD had both a pure ASCII form and a diacritics one, these are some of the ways we could express it in: - arvore (7-bit): 97 114 118 111 114 101 - árvore (UTF-8 bytes): 195 161 114 118 111 114 101 - árvore (Unicode code points): 225 114 118 111 114 101 This is an ASCII code point which has a direct Unicode version, in such a way that, were we to decompose it, it could be expressed as one letter plus a combining mark. Meanwhile, When we look into the examples that Tarvani has helpfully been pointing out, while we can express þ (U+00FE) in an ASCII-like manner, and it is part of the Latin repertory, it does not have a direct ASCII correlation or any decomposable relation in Unicode. This might be the case with other letters that may accept a combining mark but do not have an ASCII correlation. We need to explore that, and it is possible that these distinctions are at the core of our discussion. Remains to be seen. Best, On 25 Mar 2025 13:17, Bill Jouris via Gnso-latin-diacritics wrote: First, I’d like to thank Michael for his patience, at our meeting at ICANN82, as I tried to put my concern about the phrasing of our into words. Unsuccessfully. I hope I can do better here. When the Latin Generation Panel was creating the RZ-LGR, with just a couple of exceptions we were looking at visual confusable similarity. That is, we compared a code point (ASCII letter or that letter plus one or more diacritics) to another. If, in a side by side comparison, we found it very difficult to distinguish between them, they were designated variants. If we could distinguish them, they were not variants. Note that, when one code point was just an ASCII letter, and the other was that letter plus a diacritic, there were no cases where they were variants. None. Zero. So let's look at the type case which motivated our PDP: Quebec vs Québec. Clearly e and é are not visually confusable; the Latin GP already ruled on that. So what are we even looking at? Why are we here? If a French-speaking resident of Québec is presented with Quebec (that is, without the diacritic), will it occur to him, even for an instant, that he is looking at a different word? Never. He will see it as an error, due to the lack of the diacritic – whether due to ignorance on the part of one of the English speakers that surround Québec in North America, or due to even a French speaker using a keyboard which doesn’t provide for diacritics. Wrong, but NOT different. They are not visually confusable, but rather functionally interchangeable/confusable. The various other examples given in our direction from the GNSO Council all have the same characteristic. There’s no problem seeing the difference; no visual confusion. It’s just that the difference, in those cases, is not considered significant by the user. Now there are doubtless some cases, in some languages, where the addition of a diacritic does result in a different word. But the question before us is, I think, do we try to go thru all the hundreds of languages which use the Latin script, trying to identify those cases? Or do we make a rule which can be applied across all cases involving a diacritic, whether they are real words or not? I submit that the latter is the only feasible course. Bill Jouris On Monday, March 24, 2025 at 10:16:37 AM PDT, Amadeu Abril i Abril (CORE) via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> wrote: Hello all. I would say “yes” to both your questions. As I explained in previous mail exchanges, we should not base our decisions and distinctions in linguistic, semantic, morphological or otrhographic criteria. TLDs may be grammaticaly incorrect, or mean jut nothing in no language. And indeed, what is a diacritic varies from context to context. In linguistic terms, it also varies historically: w was,and still is in some lalnguages, a variant, a diacritic version, of v. And u and v were just variants of the same character, as G amc C…. The point is not whehter they are considered proper charactars in a given alphabet or a modification/diacritic of another character. The point, for us, is that a set of the Latin charactr repoertoire iknon as ASCII does NOT require specicalencoding inthe DNS. While allthe rest does require althe IDN/punycode encoding. So when we say ASCII and diacrtitics we mean ASCII and non-ASCII (if they are similar enough to raise issues under the String Evaluation test which is a requirement for new gTLDs). And indeed there may be more than one “similar” IDN/non-ASCII/string with diacritics which falls ntothe “too similar to be allowed” scenario, so the solution(s) we are toduscuss mayapply to more than 2 TLDs strings at once. Even if unlikely for practical (including financial) reasons. Regards. AAA

...

El 24 març 2025, a les 15:14, Tapani Tarvainen via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> va escriure:

Dear all,

In preparation for our next call, I wrote down a few issues I'd like to clarify about our scope:

(1) We use the term "diacritics" differently than linguists do. If I understand correctly, we're including all non-ASCII characters that various Latin-based alphabets use, such as æ, ŋ, ß, ð, þ etc.

(2) The base ASCII gTLD version of a diacritic is for us by definition a single ASCII letter with the sole exception of German ß. Other cases where the conventional ASCII conversion is two or more letters are not treated similarly (as per RZ-LGR). Presumably we could, however, take the existence of such as a factor in our recommendations.

E.g., to pick an example from Finnish, consider ".šakki" vs ".sakki", should the existence of ".shakki" matter? (Both "šakki" and "sakki" are Finnish words with completely unrelated meanings, while "shakki" is an alternative spelling of "šakki" - not a computer-era fallback, but actually older and still more common spelling even though deprecated by modern ortography.)

Or consider "æ", which also looks obviously similar to "ae".

(3) Our first question talks about "a base ASCII gTLD and *the* Latin script diacritic version", in singular. It seems pretty clear to me, however, that we should also consider cases with multiple diacritic versions of the same ASCII letter, like the "sjö" / "sjø" case I brought up earlier.

-- Tapani Tarvainen _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org

Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org -- Mark W. Datysgeld Director at Governance Primer [governanceprimer.com] Project Lead Developer at ICANNWiki [icannwiki.org]

Amadeu Abril i Abril (CORE)

6:15 p.m.

Bill Jouris

4:19 a.m.

Amadeu, I definitely need to work on my communication skills. Because, if I'm understanding your third and fourth paragraphs correctly, we are precisely in agreement -- that, for users of languages with diacritics, TLDs with and without the diacritics are, however regretably, seen as the same thing. Although, obviously, I may be wrong in what I think you are saying. Bill Sent from Yahoo Mail on Android On Tue, Mar 25, 2025 at 11:15 AM, Amadeu Abril i Abril (CORE)<amadeu.abril@corenic.org> wrote: Bill I have to completely disagree with your statemnts below. It is a fact that the LFRs are what they are, and that for that set of rules, a letter and the same letter with a diacritic are not varaints. Nothing else to be discussed here. But the String Similarity rules in the Applicant Guidebook have an autonomous set of cirteria to assess when a string, a label, a propsoed TLD is confusing with anotehr one or not. IN 2012 .unicorn and .unicom were found confusingly similar and could not proceed both of them. Same for .hotels and .hoteis. I am sure that you don’t consider r+n and m, or i and l as variants. And I am sure you think youcvan perfectly distinguish the characters in both cases. But this was the result of applying the rules backthen. Additionally, I submit that your asumption that any French-speaking user would never be confused by .quebec and .québec, and that would consier .quebec a misspelling is most likely factually wrong. Most of us users of languages with diaccritics have beensooooo overwhemingly used to NOT use them in domains, be that first level or second level, that this is the DEFAULT behavoiur for the vast, overwhemingly vast, majorituy among us. For us, for most of us, i8n most cases, .quebec and .qúebec, plicia.info, polícia.info and policía.info are not different, not even confusinglysimilar: are the SAME thing. At very least, the ASCII version is perfectly identicalto the correct grammatical version with diacritics… simply because it is the default version, the one we see used most foten, allthe time,everywhere in brwoser bars, in urls, and domains, in emails….. Infact a person who does not apeak a language with those diacritics is more likely to notice the diacritc, it catxhes her attention as an odd thing. AAll i say above is not an unviersal situation: we all would reson like you propsoe in a text. Everywhere… except in DNS. But the main point is the first one: rules and criteria for what is a “real variant” and what is “confusingly similar” as a TLD are different. Amadeu El 25 març 2025, a les 17:17, Bill Jouris <b_jouris@yahoo.com> va escriure: First, I’d like to thank Michael for his patience, at our meetingat ICANN82, as I tried to put my concern about the phrasing of our intowords. Unsuccessfully. I hope I can do better here. When the Latin Generation Panel was creating the RZ-LGR, withjust a couple of exceptions we were looking at visual confusable similarity.That is, we compared a code point (ASCII letter or that letter plus one ormore diacritics) to another. If, in aside by side comparison, we found it very difficult to distinguish between them,they were designated variants. If wecould distinguish them, they were not variants. Note that, when one code point was just an ASCII letter, and the otherwas that letter plus a diacritic, there were no cases where they were variants. None. Zero. So let's look at the type case which motivated our PDP: Quebecvs Québec. Clearly e and é are not visuallyconfusable; the Latin GP already ruled on that. So what are we even looking at? Why are we here? If a French-speaking resident of Québec is presented withQuebec (that is, without the diacritic), will it occur to him, even for an instant, that he is looking at a differentword? Never. He will see it as an error, due to the lack ofthe diacritic – whether due to ignorance on the part of one of the English speakers that surround Québecin North America, or due to even a French speaker using a keyboard which doesn’tprovide for diacritics. Wrong, but NOTdifferent. They are not visually confusable, but rather functionallyinterchangeable/confusable. The various other examples given in our directionfrom the GNSO Council all have the same characteristic. There’s no problem seeing thedifference; no visual confusion. It’s just that thedifference, in those cases, is not considered significant by the user. Now there are doubtless some cases, in some languages, where the addition of adiacritic does result in a different word. But the question before us is, I think, do we try to go thru all thehundreds of languages which use the Latin script, trying to identify thosecases? Or do we make a rule which can beapplied across all cases involving a diacritic, whether they are real words or not? I submit that the latter is the only feasible course. Bill Jouris On Monday, March 24, 2025 at 10:16:37 AM PDT, Amadeu Abril i Abril (CORE) via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> wrote: Hello all. I would say “yes” to both your questions. As I explained in previous mail exchanges, we should not base our decisions and distinctions in linguistic, semantic, morphological or otrhographic criteria. TLDs may be grammaticaly incorrect, or mean jut nothing in no language. And indeed, what is a diacritic varies from context to context. In linguistic terms, it also varies historically: w was,and still is in some lalnguages, a variant, a diacritic version, of v. And u and v were just variants of the same character, as G amc C…. The point is not whehter they are considered proper charactars in a given alphabet or a modification/diacritic of another character. The point, for us, is that a set of the Latin charactr repoertoire iknon as ASCII does NOT require specicalencoding inthe DNS. While allthe rest does require althe IDN/punycode encoding. So when we say ASCII and diacrtitics we mean ASCII and non-ASCII (if they are similar enough to raise issues under the String Evaluation test which is a requirement for new gTLDs). And indeed there may be more than one “similar” IDN/non-ASCII/string with diacritics which falls ntothe “too similar to be allowed” scenario, so the solution(s) we are toduscuss mayapply to more than 2 TLDs strings at once. Even if unlikely for practical (including financial) reasons. Regards. AAA

...

El 24 març 2025, a les 15:14, Tapani Tarvainen via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> va escriure:

Dear all,

In preparation for our next call, I wrote down a few issues I'd like to clarify about our scope:

(1) We use the term "diacritics" differently than linguists do. If I understand correctly, we're including all non-ASCII characters that various Latin-based alphabets use, such as æ, ŋ, ß, ð, þ etc.

(2) The base ASCII gTLD version of a diacritic is for us by definition a single ASCII letter with the sole exception of German ß. Other cases where the conventional ASCII conversion is two or more letters are not treated similarly (as per RZ-LGR). Presumably we could, however, take the existence of such as a factor in our recommendations.

E.g., to pick an example from Finnish, consider ".šakki" vs ".sakki", should the existence of ".shakki" matter? (Both "šakki" and "sakki" are Finnish words with completely unrelated meanings, while "shakki" is an alternative spelling of "šakki" - not a computer-era fallback, but actually older and still more common spelling even though deprecated by modern ortography.)

Or consider "æ", which also looks obviously similar to "ae".

(3) Our first question talks about "a base ASCII gTLD and *the* Latin script diacritic version", in singular. It seems pretty clear to me, however, that we should also consider cases with multiple diacritic versions of the same ASCII letter, like the "sjö" / "sjø" case I brought up earlier.

-- Tapani Tarvainen _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org

Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org

Michael Bauland

8:46 a.m.

Hi all, instead of writing multiple mails, I'll try to answer the most important points from the recent e-mails in a single e-mail. @Bill You say that the Latin GP was looking for "visual confusable similarity". That's misleading, as they were not merely looking for similarity, but they were looking for "same" and "nearly identical" cases. This is far more strict than what is going to happen in the String Similarity Review of the application process. In that, also cases that are "just" similar will be caught. The String Similarity Review Guidelines may explain what is going to happen, in more detail. They can be found at: https://itp.cdn.icann.org/en/files/internationalized-domain-names-idn/string... In particular, Section 11 page 37 shows a list of classifications: 1. Identical or Near Identical Labels (homograph) 2. Highly Confusable Labels (strongly similar, or near homograph) 3. Similar Labels 4. Distantly Similar Labels (weakly similar) 5. Distinct, not similar The String Similarity Process will (mostly) look up to Level 3 (Similar Labels). So, what the Latin GP considered to be similar or not may be very different from what the String Similarity Review Panel will consider to be similar. @Sergio What is happening at the DNS level is out of scope for us. We certainly won't be changing the DNS to allow more characters than those currently allowed. With the punycode algorithm, there is already the possibility to express even more than your suggested 256 characters. We can express all characters from the RZ-LGR in the DNS using punycode. Also, whether labels are directly available in DNS or via some encoding does not change what people see in browsers and elsewhere. @All * The RZ LGR and its variant definitions is a fact. We cannot change anything there as part of this PDP. There are other processes for this. * Also, the way String Similarity Review is going to work in next round is nothing, we are going to change in this PDP. We also have to take this as a fact (even if it's not pre-determined, what exact cases will be considered as similar and which don't, but that's also out of scope for us). Our goal is very limited. It's an exceptions process. These exceptions are very narrowly defined to find a solution for cases in which an *ASCII* label is going to be considered confusingly similar with a *Latin Diacritic* version of that label (and vice versa) by the String Similarity Review Panel. We will talk more about, what exactly is a diacritic in today's meeting. I'm sure there are many more unfair and incorrect things in the world, many of which I would also like to see solved, but this is the wrong place for it. ;-) Cheers, Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp Certified according DIN ISO/IEC 27001:2017

Tarvainen Tapani

9:56 a.m.

On Wed, Mar 26, 2025 at 04:19:43AM +0000, Bill Jouris (b_jouris@yahoo.com) wrote:

...

for users of languages with diacritics, TLDs with and without the diacritics are, however regretably, seen as the same thing.

With due respect, I must disagree - simply because unless I'm mistaken, there are no TLDs with diacritics at the moment. Whether or not they *will* be seen as the same thing we don't know. It may even depend on what we do in this WG! Looking forward to interesting discussion today, -- Tapani Tarvainen

Michael Bauland

10:01 a.m.

Hei Tapani, Am 26.03.2025 um 10:56 schrieb Tarvainen Tapani via Gnso-latin-diacritics:

...

On Wed, Mar 26, 2025 at 04:19:43AM +0000, Bill Jouris (b_jouris@yahoo.com) wrote:

...
for users of languages with diacritics, TLDs with and without the diacritics are, however regretably, seen as the same thing.

With due respect, I must disagree - simply because unless I'm mistaken, there are no TLDs with diacritics at the moment.

this easy to answer. There actually are: .vermögensverwaltung and .vermögensverwalter

...

Whether or not they *will* be seen as the same thing we don't know.

If I see vermogensverwaltung, I would be confused. I recognise the word (it should be Vermögensverwaltung), but then again, I would also be confused, because it's lower-case. It should be Vermögensverwaltung. Seeing it with a small letter looks equally wrong like seeing it with the two dots above the o. This is just my personal view. Cheers, Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp Certified according DIN ISO/IEC 27001:2017

Tapani Tarvainen

10:11 a.m.

On Wed, Mar 26, 2025 at 11:01:31AM +0100, Michael Bauland via Gnso-latin-diacritics (gnso-latin-diacritics@icann.org) wrote:

...

Am 26.03.2025 um 10:56 schrieb Tarvainen Tapani via Gnso-latin-diacritics:

...
On Wed, Mar 26, 2025 at 04:19:43AM +0000, Bill Jouris (b_jouris@yahoo.com) wrote:

...

...
...
for users of languages with diacritics, TLDs with and without the diacritics are, however regretably, seen as the same thing.

With due respect, I must disagree - simply because unless I'm mistaken, there are no TLDs with diacritics at the moment.

this easy to answer. There actually are:

.vermögensverwaltung and .vermögensverwalter

I stand corrected. But my point still stands: whether or not such TLDs will be seen as the same thing by users in the future is not obvious. People may well get used to seeing the difference, especially in cases where the ASCII and diacritic version of the word are both meaningful but with different meanings. -- Tapani Tarvainen

Tarvainen Tapani

9:48 a.m.

Comments inline. On Tue, Mar 25, 2025 at 04:17:01PM +0000, Bill Jouris (b_jouris@yahoo.com) wrote:

...

So let's look at the type case which motivated our PDP: Quebec vs Québec.

That is really not a representative example, as "quebec" is not a word.

...

Clearly e and é are not visually confusable

I tend to think the very notion of "visual" confusability is a misguided attempt to find a language-independent way of measuring similarity. It won't work. For example, I can tell from experience that many if not most Finns *will* confuse é and è, even though a Frenchman would not. Just like pretty much all Chinese characters look confusingly similar to me. Similarity is indeed in the eyes of the beholder.

...

They [Quebec and Québec] are not visually confusable, but rather functionally interchangeable/confusable.

Agreed. But that's not true with diacritics in general.

...

Now there are doubtless some cases, in some languages, where the addition of adiacritic does result in a different word.

Indeed there are. Lots of them in lots of languages. To pester you with Finnish again, ä and ö are common letters, whose similarity to a and o is merely coincidental, and removing the dots will frequently produce a completely different word, even if it sometimes produces a non-word whose intent can be guessed. Same can occur with š and ž, although they're much less common. To pick a random example, a Finn who doesn't speak English would find it hard to understand they'd have to write "saari" (island) when they mean "sääri" (leg). I can come up with lots of examples like that from Finnish alone. Others have already given some from other languages. There are also cases where the ASCII version is a word with, er, negative connotations. Not to mention proper names. A multilingual example: Sjoberg, Sjöberg and Sjøberg are all extant family names, and domains sjoberg.us, sjöberg.fi and sjøberg.no exist, along with sjöberg.se, sjøberg.dk, sjoberg.com &c. As an aside, if you want to get an idea how ASCIIfying such languages feels, you can try writing English without any two common wovels, replacing them with whatever.

...

But the question before us is, I think, do we try to go thru all the hundreds of languages which use the Latin script, trying to identify those cases?

We obviously can't do that.

...

Or do we make a rule which can be applied across all cases involving a diacritic, whether they are real words or not?

We clearly should come up with a general rule. But it really cannot be done without linguistic considerations. If we try to do that, we'll just end up reinforcing the special status of English. For well-known historical reasons English is the dominant language on the Internet. Users of other languages have indeed gotten used to accommodating it by abusing their language, shoehorning its words to fit into the Procrustean bed of English alphabet. But that is not a solution. It is a problem, indeed *the* problem IDNs attempt to solve. And it is only going to get worse as more and more non-English speakers get online. So, while we cannot handle all languages in our small group, we can still come up with a rule that takes all languages into account. Simple: require that whenever a new gTLD is being proposed, its possible significance in all languages must be considered. It's not that hard: just look for it with your favourite search engine, use Google Translate with "Detect language" option, or ask ChatGPT or whatever. Then ask speakers of those languages their view. -- Tapani Tarvainen

Mark W. Datysgeld

10 a.m.

...

So, while we cannot handle all languages in our small group, we can still come up with a rule that takes all languages into account.

Simple: require that whenever a new gTLD is being proposed, its possible significance in all languages must be considered.

It's not that hard: just look for it with your favourite search engine, use Google Translate with "Detect language" option, or ask ChatGPT or whatever. Then ask speakers of those languages their view. Maybe one thing to take into consideration here is that we are far from having all languages properly codified, never mind in a state advanced enough for widespread Internet usage. It's the issue we are facing with helping codify some indigenous American continent languages at the moment, in which some characters can't be mapped properly even if we wanted to, like the many cases where something similar to a ' (apostrophe) is present in the transliteration. So it's unfortunately not as simple as that. We could theoretically backwards-proof this, but never future-proof it, which raises questions around whether doing that successfully is even possible.

But again, the discussions in this group are super interesting, and it will be a pleasure working through these questions with you all. On 26 Mar 2025 06:48, Tarvainen Tapani via Gnso-latin-diacritics wrote:

...

Comments inline.

On Tue, Mar 25, 2025 at 04:17:01PM +0000, Bill Jouris (b_jouris@yahoo.com) wrote:

...
So let's look at the type case which motivated our PDP: Quebec vs Québec. That is really not a representative example, as "quebec" is not a word.

...
Clearly e and é are not visually confusable I tend to think the very notion of "visual" confusability is a misguided attempt to find a language-independent way of measuring similarity. It won't work.

For example, I can tell from experience that many if not most Finns *will* confuse é and è, even though a Frenchman would not. Just like pretty much all Chinese characters look confusingly similar to me.

Similarity is indeed in the eyes of the beholder.

...
They [Quebec and Québec] are not visually confusable, but rather functionally interchangeable/confusable. Agreed. But that's not true with diacritics in general.

...
Now there are doubtless some cases, in some languages, where the addition of adiacritic does result in a different word. Indeed there are. Lots of them in lots of languages.

To pester you with Finnish again, ä and ö are common letters, whose similarity to a and o is merely coincidental, and removing the dots will frequently produce a completely different word, even if it sometimes produces a non-word whose intent can be guessed. Same can occur with š and ž, although they're much less common.

To pick a random example, a Finn who doesn't speak English would find it hard to understand they'd have to write "saari" (island) when they mean "sääri" (leg). I can come up with lots of examples like that from Finnish alone. Others have already given some from other languages.

There are also cases where the ASCII version is a word with, er, negative connotations.

Not to mention proper names. A multilingual example: Sjoberg, Sjöberg and Sjøberg are all extant family names, and domains sjoberg.us, sjöberg.fi and sjøberg.no exist, along with sjöberg.se, sjøberg.dk, sjoberg.com &c.

As an aside, if you want to get an idea how ASCIIfying such languages feels, you can try writing English without any two common wovels, replacing them with whatever.

...
But the question before us is, I think, do we try to go thru all the hundreds of languages which use the Latin script, trying to identify those cases? We obviously can't do that.

...
Or do we make a rule which can be applied across all cases involving a diacritic, whether they are real words or not? We clearly should come up with a general rule. But it really cannot be done without linguistic considerations. If we try to do that, we'll just end up reinforcing the special status of English.

For well-known historical reasons English is the dominant language on the Internet. Users of other languages have indeed gotten used to accommodating it by abusing their language, shoehorning its words to fit into the Procrustean bed of English alphabet.

But that is not a solution. It is a problem, indeed *the* problem IDNs attempt to solve. And it is only going to get worse as more and more non-English speakers get online.

So, while we cannot handle all languages in our small group, we can still come up with a rule that takes all languages into account.

Simple: require that whenever a new gTLD is being proposed, its possible significance in all languages must be considered.

It's not that hard: just look for it with your favourite search engine, use Google Translate with "Detect language" option, or ask ChatGPT or whatever. Then ask speakers of those languages their view.

-- Mark W. Datysgeld Director at Governance Primer [governanceprimer.com <https://governanceprimer.com>] Project Lead Developer at ICANNWiki [icannwiki.org <https://icannwiki.org/>]

Michael Bauland

10:23 a.m.

Hei Tapani, thanks for your input. I am not sure, however, whether this is something that falls into our PDP remit. I fully agree that sääri and saari are different words. But so are many other examples including ASCII-only like "just" and "jest". In the context of TLD (application), I see the following possible outcomes (without our PDP), if both saari and sääri are applied for. 1. saari and sääri are not considered similar by the review panel. => both TLDs can co-exist. 2. saari and sääri are considered similar by the review panel. => The TLDs go into a contention set. Only one can be approved. The other will be rejected. Now, if our PDP were to decide that ä vs. a would be an exception case. Then we have the following situation. 1. => no change, still both TLDs can co-exist. 2. There are now two sub-cases: a) The applicant for saari is different from the applicant for sääri. => Same as previous situation, Contention Set and only one can win. b) The applicant for saari is the same as for sääri. => Depending on our exact recommendations, most likely saari and sääri can co-exist with some special requirements. Language considerations are therefore irrelevant to our work. Language considerations may or may not be relevant to the String Similarity Review Panel, but that's out of scope. We don't know if they consider languages when deciding whether two labels are too similar or whether they do not. As a Finn you might want to reject a decision that saari and sääri are too similar (if that were to be decided by the String Similarity Panel), but this is out of scope for our PDP. We have to take the String Similarity Panel as a given fact. It exists. It will decide one way or another. We do not discuss their decisions/outcomes. We only deal with situations in which they decide: these strings should be rejected, because they are confusingly similar. Cheers, Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp Certified according DIN ISO/IEC 27001:2017

Tapani Tarvainen

10:42 a.m.

On Wed, Mar 26, 2025 at 11:23:21AM +0100, Michael Bauland via Gnso-latin-diacritics (gnso-latin-diacritics@icann.org) wrote:

...

Hei Tapani,

thanks for your input.

I am not sure, however, whether this is something that falls into our PDP remit.

I'm not sure either. But that's what we're discussing now, that's why I raised it under the subject "Scope issues". Looking at your analysis below, it seems to me the special requirements in case 2 b) might well depend on linguistic considerations. In case 2 a), does our mandate exclude making an exception with special requirements to allow the domains to different applicants?

...

I fully agree that sääri and saari are different words. But so are many other examples including ASCII-only like "just" and "jest".

In the context of TLD (application), I see the following possible outcomes (without our PDP), if both saari and sääri are applied for.

1. saari and sääri are not considered similar by the review panel. => both TLDs can co-exist.

2. saari and sääri are considered similar by the review panel. => The TLDs go into a contention set. Only one can be approved. The other will be rejected.

Now, if our PDP were to decide that ä vs. a would be an exception case. Then we have the following situation.

1. => no change, still both TLDs can co-exist.

2. There are now two sub-cases: a) The applicant for saari is different from the applicant for sääri. => Same as previous situation, Contention Set and only one can win.

b) The applicant for saari is the same as for sääri. => Depending on our exact recommendations, most likely saari and sääri can co-exist with some special requirements.

Language considerations are therefore irrelevant to our work.

-- Tapani Tarvainen

Michael Bauland

10:54 a.m.

Hi Tapani, Am 26.03.2025 um 11:42 schrieb Tapani Tarvainen via Gnso-latin-diacritics:

...

On Wed, Mar 26, 2025 at 11:23:21AM +0100, Michael Bauland via Gnso-latin-diacritics (gnso-latin-diacritics@icann.org) wrote:

...
Hei Tapani,

thanks for your input.

I am not sure, however, whether this is something that falls into our PDP remit.

I'm not sure either. But that's what we're discussing now, that's why I raised it under the subject "Scope issues".

yes, that's what we're here for and will do in the call(s), too.

...

Looking at your analysis below, it seems to me the special requirements in case 2 b) might well depend on linguistic considerations.

I personally don't think, it does. In particular because domain names do not need to be words within a certain language. They can be arbitrarily invented words. But yes, lets discuss this.

...

In case 2 a), does our mandate exclude making an exception with special requirements to allow the domains to different applicants?

No, I don't think it excludes this. I personally don't think it's a good idea, but we certainly have to discuss this and I might be convinced otherwise depending on the requirements and the case. Cheers, Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp Certified according DIN ISO/IEC 27001:2017

475

Age (days ago)

477

Last active (days ago)

List overview

Download

16 comments

7 participants

participants (7)

Amadeu Abril i Abril (CORE)
Bill Jouris
Mark W. Datysgeld
Michael Bauland
Sergio Salinas Porto
Tapani Tarvainen
Tarvainen Tapani

Scope issues

tags

participants (7)