Diacritics definition
Dear all, Looking at the initial report recommendation 1, it gives the chosen definition of diacritics as a fact, without explaining or justifying it in any way. It should, so that if someone wonders why ø or ŋ are treated differently they should be able to look at the report and find an eplanation. I also note the ISOC Norway comment that wants a technical reason for the exclusion of characters used in Norwegian and Sami. I tried to think of how to explain it. As I recall, the argument was that the charter specifies diacritics only and Unicode defines them that way and if we want to include something else we'd have to go back to the council and there's no time for that. But does Unicode really define them that way? I went to the source to see what Unicode actually says about the matter, and importantly also what it does *not* say. Not an easy task - there's *a lot* of stuff in there. But try as I might, I could not find a general definition of "diacritic" anywhere. Perhaps I've missed it - if so, feel free to point it out. There's lots of talk about combining diacritic marks and decomposable diacritics - but no claim that those are the only kind of diacritics. It's simply that for Unicode's purposes the non-decomposable diacritics don't need any special treatment, so there's no need to even define them. For Unicode, decomposable diacritics are special because they need special rules for canonicalization. But for our purposes that doesn't matter. In effect we took the Unicode table of decomposable diacritics and decided to treat it as if those were the only diacritics. This decision was not given to us by the charter nor by Unicode, it was our choice. I see no good reason not to allow all Unicode characters described in Latin-1 Supplement or any of the Latin Extended tables as "LATIN SMALL LETTER <x> WITH <y>" where <x> is an ASCII letter and <y> is any glyph that modifies the base letter (which is the general meaning of "diacritic mark") in a way that makes them confusable, regardless of the existence of a decomposed canonical equivalent, as long as they are in the RZ-LGR. My favorite example here is of course "latin small letter with stroke", ø. If we conclude that there is a good reason for excluding such non-decomposable diacritical characters, we should spell it out. I don't see any good *technical* reason to exclude even characters like ð, þ, ŋ, æ etc, but they are not diacritical and thus outside of the charter, and in some cases might require aditional considerations, so I guess we'd have to back to the council for those, or wait for another PDP ("Extended Latin Characters PDP" perhaps?). (Side note: during a coffee break discussion in Mumbai someone said old issues shouldn't be relitigated *unless* there're new facts or new viewpoints to consider. Here I think there are.) -- Tapani Tarvainen
Dear all, I support Tapani’s point. The current formulation is too restrictive and insufficiently justified. From a Francophone and African perspective, many characters in the Latin Extended sets, such as ø, are effectively treated as diacritic variations of base letters in real usage. What matters here is confusability and linguistic function, not whether a character is decomposable in Unicode. Unicode does not impose such a limitation. This is therefore a policy choice, and it should be clearly explained in the report. The rationale for excluding non-decomposable forms needs to be made explicit. Tapani’s suggestion to consider the “LATIN SMALL LETTER WITH ” pattern offers a more practical and linguistically grounded approach. Best regards, Emmanuel Le lun. 23 mars 2026 à 09:10, Tapani Tarvainen via Gnso-latin-diacritics < gnso-latin-diacritics@icann.org> a écrit :
Dear all,
Looking at the initial report recommendation 1, it gives the chosen definition of diacritics as a fact, without explaining or justifying it in any way.
It should, so that if someone wonders why ø or ŋ are treated differently they should be able to look at the report and find an eplanation. I also note the ISOC Norway comment that wants a technical reason for the exclusion of characters used in Norwegian and Sami.
I tried to think of how to explain it.
As I recall, the argument was that the charter specifies diacritics only and Unicode defines them that way and if we want to include something else we'd have to go back to the council and there's no time for that.
But does Unicode really define them that way?
I went to the source to see what Unicode actually says about the matter, and importantly also what it does *not* say.
Not an easy task - there's *a lot* of stuff in there.
But try as I might, I could not find a general definition of "diacritic" anywhere. Perhaps I've missed it - if so, feel free to point it out.
There's lots of talk about combining diacritic marks and decomposable diacritics - but no claim that those are the only kind of diacritics. It's simply that for Unicode's purposes the non-decomposable diacritics don't need any special treatment, so there's no need to even define them.
For Unicode, decomposable diacritics are special because they need special rules for canonicalization. But for our purposes that doesn't matter.
In effect we took the Unicode table of decomposable diacritics and decided to treat it as if those were the only diacritics.
This decision was not given to us by the charter nor by Unicode, it was our choice.
I see no good reason not to allow all Unicode characters described in Latin-1 Supplement or any of the Latin Extended tables as
"LATIN SMALL LETTER <x> WITH <y>"
where <x> is an ASCII letter and <y> is any glyph that modifies the base letter (which is the general meaning of "diacritic mark") in a way that makes them confusable, regardless of the existence of a decomposed canonical equivalent, as long as they are in the RZ-LGR.
My favorite example here is of course "latin small letter with stroke", ø.
If we conclude that there is a good reason for excluding such non-decomposable diacritical characters, we should spell it out.
I don't see any good *technical* reason to exclude even characters like ð, þ, ŋ, æ etc, but they are not diacritical and thus outside of the charter, and in some cases might require aditional considerations, so I guess we'd have to back to the council for those, or wait for another PDP ("Extended Latin Characters PDP" perhaps?).
(Side note: during a coffee break discussion in Mumbai someone said old issues shouldn't be relitigated *unless* there're new facts or new viewpoints to consider. Here I think there are.)
-- Tapani Tarvainen _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org
Hello all, To expand slightly on Emmanuel's point. The fact that a particular symbol is "not decomposable" simply means that Unicode does not currently include that particular diacritic among those which are recognized as combining diacritics. That is, to be blunt, a shortcoming in Unicode. They, for whatever reason, did not choose to include it. We may be constrained by that (depending on exactly how we interpret our mandate). But we need ot be clear on why we cannot decompose something that is clearly a Latin letter plus a diacritic. I am inclined to think we should reconsider whether decomposability is really an appropriate restriction on our work. Bill Jouris On Monday, March 23, 2026 at 03:13:27 AM PDT, Emmanuel Vitus via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> wrote: Dear all, I support Tapani’s point. The current formulation is too restrictive and insufficiently justified. From a Francophone and African perspective, many characters in the Latin Extended sets, such as ø, are effectively treated as diacritic variations of base letters in real usage. What matters here is confusability and linguistic function, not whether a character is decomposable in Unicode. Unicode does not impose such a limitation. This is therefore a policy choice, and it should be clearly explained in the report. The rationale for excluding non-decomposable forms needs to be made explicit. Tapani’s suggestion to consider the “LATIN SMALL LETTER WITH ” pattern offers a more practical and linguistically grounded approach. Best regards, Emmanuel Le lun. 23 mars 2026 à 09:10, Tapani Tarvainen via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> a écrit : Dear all, Looking at the initial report recommendation 1, it gives the chosen definition of diacritics as a fact, without explaining or justifying it in any way. It should, so that if someone wonders why ø or ŋ are treated differently they should be able to look at the report and find an eplanation. I also note the ISOC Norway comment that wants a technical reason for the exclusion of characters used in Norwegian and Sami. I tried to think of how to explain it. As I recall, the argument was that the charter specifies diacritics only and Unicode defines them that way and if we want to include something else we'd have to go back to the council and there's no time for that. But does Unicode really define them that way? I went to the source to see what Unicode actually says about the matter, and importantly also what it does *not* say. Not an easy task - there's *a lot* of stuff in there. But try as I might, I could not find a general definition of "diacritic" anywhere. Perhaps I've missed it - if so, feel free to point it out. There's lots of talk about combining diacritic marks and decomposable diacritics - but no claim that those are the only kind of diacritics. It's simply that for Unicode's purposes the non-decomposable diacritics don't need any special treatment, so there's no need to even define them. For Unicode, decomposable diacritics are special because they need special rules for canonicalization. But for our purposes that doesn't matter. In effect we took the Unicode table of decomposable diacritics and decided to treat it as if those were the only diacritics. This decision was not given to us by the charter nor by Unicode, it was our choice. I see no good reason not to allow all Unicode characters described in Latin-1 Supplement or any of the Latin Extended tables as "LATIN SMALL LETTER <x> WITH <y>" where <x> is an ASCII letter and <y> is any glyph that modifies the base letter (which is the general meaning of "diacritic mark") in a way that makes them confusable, regardless of the existence of a decomposed canonical equivalent, as long as they are in the RZ-LGR. My favorite example here is of course "latin small letter with stroke", ø. If we conclude that there is a good reason for excluding such non-decomposable diacritical characters, we should spell it out. I don't see any good *technical* reason to exclude even characters like ð, þ, ŋ, æ etc, but they are not diacritical and thus outside of the charter, and in some cases might require aditional considerations, so I guess we'd have to back to the council for those, or wait for another PDP ("Extended Latin Characters PDP" perhaps?). (Side note: during a coffee break discussion in Mumbai someone said old issues shouldn't be relitigated *unless* there're new facts or new viewpoints to consider. Here I think there are.) -- Tapani Tarvainen _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org
Dear all, I would like to simplify the issue a little, skip all the difficult cases and consider just these two competing definitions of diacritical character (simplified here): (1) Unicode-decomposable (2) LATIN SMALL LETTER ... WITH * Both are well-defined and objective, they could be mechanically derived from Unicode tables. * Both make it unambiguous what is the matching ASCII letter. There is no need to evaluate confusability in any subjective way. * Eligibility change would be equally simple to do mechanically in either case: character name is no harder to get from Unicode tables than decomposability is. * Both are stable in the sense that they're unlikely to change in future Unicode versions (apart from possible new additions). Although I think the former is actually more likely to change (that is, Unicode could specify a decomposition for a character that doesn't have it now), that's unlikely enough not to matter. * The latter includes more letters and supports more languages. In that sense it would also be fairer. * The latter is more closely aligned with linguistics. * The latter is easier to understand to non-specialists. * The former is what we currently have and we'd save ourselves some time and work by sticking with it. Am I missing something important? Having read Mark's scoping exercise, I agree that including all similarly confusable characters like ŋ or æ would be more difficult and require extra work, so leaving them for later makes sense. But it's not a binary choice, narrowest possible or widest possible, and I think option (2) above would be a good compromise. -- Tapani Tarvainen
Thank you for offering a clear and actionable framing, Tapani. Let's bring it for group discussion for sure. From my side, I commit to expanding my tool to allow us to look at what that table would look like. I'd do it right now, but I need some rest before the call... On March 25, 2026 4:56:20 AM GMT-03:00, Tapani Tarvainen via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> wrote:
Dear all,
I would like to simplify the issue a little, skip all the difficult cases and consider just these two competing definitions of diacritical character (simplified here):
(1) Unicode-decomposable (2) LATIN SMALL LETTER ... WITH
* Both are well-defined and objective, they could be mechanically derived from Unicode tables.
* Both make it unambiguous what is the matching ASCII letter. There is no need to evaluate confusability in any subjective way.
* Eligibility change would be equally simple to do mechanically in either case: character name is no harder to get from Unicode tables than decomposability is.
* Both are stable in the sense that they're unlikely to change in future Unicode versions (apart from possible new additions). Although I think the former is actually more likely to change (that is, Unicode could specify a decomposition for a character that doesn't have it now), that's unlikely enough not to matter.
* The latter includes more letters and supports more languages. In that sense it would also be fairer.
* The latter is more closely aligned with linguistics.
* The latter is easier to understand to non-specialists.
* The former is what we currently have and we'd save ourselves some time and work by sticking with it.
Am I missing something important?
Having read Mark's scoping exercise, I agree that including all similarly confusable characters like ŋ or æ would be more difficult and require extra work, so leaving them for later makes sense.
But it's not a binary choice, narrowest possible or widest possible, and I think option (2) above would be a good compromise.
-- Tapani Tarvainen _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org
Mark W. Datysgeld Governance Primer & ICANNWiki
Hi Tapani and all. Thanks for bringing this topic back to the table, for the approach taken in this email. I do confess I have both sympathy and skepticism for this issue of the small o ewith stroke + some other chasracters that we have left out as they do not fit in the definition we are using. Starting with my reserves about moing into that direction is that we need to provide a clear definition of what is acceptable for these Latin Diactritic bundles. We can opt for either a list or a definition, one or more criteria. The problem with the lsit is how can this evolve? Would we need a new PDP if, say Unicode changes something? if it appears we have made a msitake by picking or not picking a concrete character? The problem with the definition is that I have not seen so far one that really addresses the issue. We simply cannot say “all characters”; “the sidest possible set”, or use trms such as “obvious” or “similar”, or “resaonable”. If “do what is reasonable” would suffice, this whole PP would not be here ;-) We also have to always bear in mind we are dealing with aan exception process, which, by nature must be restrictive in nature. Open-eneded exceptions are not exceptions, are overrulings ;-) Now, let me go to the other side for a moment: I cannot see any material reason to exclude, say Latin small letter o with stroke, except, as said, that it does not fit in the working definition we have. To give a practical example in a different but closely related scenario, when establishing the **second-level** IIDN tables for LATN script for the TLDs we manage, we have considered that all “latin Letter x with y” are “pseudo variants” of the base Latin small letter. So, for instance, Latin small lettr o and o iwth acute, o with grave, o with tilde…. and o with stroke, yes, are all part of the same bundle, and the “same entity” principle applies. We have done so, precisely, by the method Tapani describes in his last email. Now, the difference is that for a second-level IDN Table for a given TLD, if there is something to be changed, the Registry may submit a new table to IANA, and apply the changes to the Registry. The lsit has a clear amendment proces, with all the required caveats for existing domains, if you want. So, here we still need a formal definition, simple or complex (simple is always preferable, not always possible) that does not require a PDP every now and then ofr a single character to be added or removed…. I am not sure if the method Tapani propsoes solves all problems or, more importantly, create some additional ones. I ahve not checked the whole list. Not snce we prepared the IDN tables I mentioned above, over 10 years ago in most caseds. But it is an area to explore. Otherwise, if we cannot find a definition that is clear and does not create additional problems, I would be relcutant to spend six additional months on the o with stroke, quite frankly….. For your info, here are the TLD that use the second level IDN Tables for Latin script with these “bdiacritic bundles”, and including the famous U+99F8, o with stroke: .barcelona; .bcn; .erni,; .eurovision; .eus, .gal; .lacaixa, .madrid; .rdio; .scot; .seat; .sport; .swiss Amadeu
Missatge enviat per Tapani Tarvainen via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> el dia 25 març 2026 a les 8:56:
Dear all,
I would like to simplify the issue a little, skip all the difficult cases and consider just these two competing definitions of diacritical character (simplified here):
(1) Unicode-decomposable (2) LATIN SMALL LETTER ... WITH
* Both are well-defined and objective, they could be mechanically derived from Unicode tables.
* Both make it unambiguous what is the matching ASCII letter. There is no need to evaluate confusability in any subjective way.
* Eligibility change would be equally simple to do mechanically in either case: character name is no harder to get from Unicode tables than decomposability is.
* Both are stable in the sense that they're unlikely to change in future Unicode versions (apart from possible new additions). Although I think the former is actually more likely to change (that is, Unicode could specify a decomposition for a character that doesn't have it now), that's unlikely enough not to matter.
* The latter includes more letters and supports more languages. In that sense it would also be fairer.
* The latter is more closely aligned with linguistics.
* The latter is easier to understand to non-specialists.
* The former is what we currently have and we'd save ourselves some time and work by sticking with it.
Am I missing something important?
Having read Mark's scoping exercise, I agree that including all similarly confusable characters like ŋ or æ would be more difficult and require extra work, so leaving them for later makes sense.
But it's not a binary choice, narrowest possible or widest possible, and I think option (2) above would be a good compromise.
-- Tapani Tarvainen _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org
Group, Find attached an updated list of characters reflecting what would/could be included in the "LATIN SMALL LETTER WITH" thesis. That table is contained in pages 10 and 11. Our tool's repository has been updated and the script can now generate lists with additional criteria based on different "thesis" arguments, of which the above is the first one: https://github.com/mark-wd/ASCII-Unicode-Diacritics-Analyzer-Tool Regards, On 23/03/2026 07:13, Emmanuel Vitus via Gnso-latin-diacritics wrote:
Dear all,
I support Tapani’s point.
The current formulation is too restrictive and insufficiently justified. From a Francophone and African perspective, many characters in the Latin Extended sets, such as ø, are effectively treated as diacritic variations of base letters in real usage. What matters here is confusability and linguistic function, not whether a character is decomposable in Unicode. Unicode does not impose such a limitation. This is therefore a policy choice, and it should be clearly explained in the report. The rationale for excluding non-decomposable forms needs to be made explicit. Tapani’s suggestion to consider the “LATIN SMALL LETTER WITH ” pattern offers a more practical and linguistically grounded approach.
Best regards, Emmanuel
Le lun. 23 mars 2026 à 09:10, Tapani Tarvainen via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> a écrit :
Dear all,
Looking at the initial report recommendation 1, it gives the chosen definition of diacritics as a fact, without explaining or justifying it in any way.
It should, so that if someone wonders why ø or ŋ are treated differently they should be able to look at the report and find an eplanation. I also note the ISOC Norway comment that wants a technical reason for the exclusion of characters used in Norwegian and Sami.
I tried to think of how to explain it.
As I recall, the argument was that the charter specifies diacritics only and Unicode defines them that way and if we want to include something else we'd have to go back to the council and there's no time for that.
But does Unicode really define them that way?
I went to the source to see what Unicode actually says about the matter, and importantly also what it does *not* say.
Not an easy task - there's *a lot* of stuff in there.
But try as I might, I could not find a general definition of "diacritic" anywhere. Perhaps I've missed it - if so, feel free to point it out.
There's lots of talk about combining diacritic marks and decomposable diacritics - but no claim that those are the only kind of diacritics. It's simply that for Unicode's purposes the non-decomposable diacritics don't need any special treatment, so there's no need to even define them.
For Unicode, decomposable diacritics are special because they need special rules for canonicalization. But for our purposes that doesn't matter.
In effect we took the Unicode table of decomposable diacritics and decided to treat it as if those were the only diacritics.
This decision was not given to us by the charter nor by Unicode, it was our choice.
I see no good reason not to allow all Unicode characters described in Latin-1 Supplement or any of the Latin Extended tables as
"LATIN SMALL LETTER <x> WITH <y>"
where <x> is an ASCII letter and <y> is any glyph that modifies the base letter (which is the general meaning of "diacritic mark") in a way that makes them confusable, regardless of the existence of a decomposed canonical equivalent, as long as they are in the RZ-LGR.
My favorite example here is of course "latin small letter with stroke", ø.
If we conclude that there is a good reason for excluding such non-decomposable diacritical characters, we should spell it out.
I don't see any good *technical* reason to exclude even characters like ð, þ, ŋ, æ etc, but they are not diacritical and thus outside of the charter, and in some cases might require aditional considerations, so I guess we'd have to back to the council for those, or wait for another PDP ("Extended Latin Characters PDP" perhaps?).
(Side note: during a coffee break discussion in Mumbai someone said old issues shouldn't be relitigated *unless* there're new facts or new viewpoints to consider. Here I think there are.)
-- Tapani Tarvainen _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org
_______________________________________________ Gnso-latin-diacritics mailing list --gnso-latin-diacritics@icann.org To unsubscribe send an email tognso-latin-diacritics-leave@icann.org
-- Mark W. Datysgeld Director at Governance Primer [governanceprimer.com <https://governanceprimer.com>] Project Lead Developer at ICANNWiki [icannwiki.org <https://icannwiki.org/>]
Hi Tapani, thank you very much for the additional analysis and for providing the info a few days before our next call. So everybody has the chance to look at this and we can discuss it during Wednesday's call. I would like everybody to read and think about Tapani's arguments. If understood this correctly, it mainly is about whether diacritics are only those characters, which are decomposable into an ASCII character and one or more diacritics (i.e., Combining Diacritical Marks https://www.unicode.org/charts/nameslist/n_0300.html)? Or whether characters that are not decomposable, but still have a name like: "LATIN SMALL LETTER O WITH STROKE" which denotes a composition of ASCII and something else should be considered as diacritics? Talk you to you all on Wednesday, Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeißer-Weg 9 44227 Dortmund Deutschland Dipl.-Informatiker Tel: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software-Entwicklung E-Mail: Michael.Bauland@knipp.de Registereintrag: Amtsgericht Dortmund, HRB 13728 Geschäftsführer: Dietmar Knipp, Elmar Knipp Zertifiziert nach DIN ISO/IEC 27001:2017
Following up, based on feedback during the WG call, please find a v2 of the document with a bounded 15 items list. Our Github has also been updated. On 23/03/2026 06:10, Tapani Tarvainen via Gnso-latin-diacritics wrote:
Dear all,
Looking at the initial report recommendation 1, it gives the chosen definition of diacritics as a fact, without explaining or justifying it in any way.
It should, so that if someone wonders why ø or ŋ are treated differently they should be able to look at the report and find an eplanation. I also note the ISOC Norway comment that wants a technical reason for the exclusion of characters used in Norwegian and Sami.
I tried to think of how to explain it.
As I recall, the argument was that the charter specifies diacritics only and Unicode defines them that way and if we want to include something else we'd have to go back to the council and there's no time for that.
But does Unicode really define them that way?
I went to the source to see what Unicode actually says about the matter, and importantly also what it does *not* say.
Not an easy task - there's *a lot* of stuff in there.
But try as I might, I could not find a general definition of "diacritic" anywhere. Perhaps I've missed it - if so, feel free to point it out.
There's lots of talk about combining diacritic marks and decomposable diacritics - but no claim that those are the only kind of diacritics. It's simply that for Unicode's purposes the non-decomposable diacritics don't need any special treatment, so there's no need to even define them.
For Unicode, decomposable diacritics are special because they need special rules for canonicalization. But for our purposes that doesn't matter.
In effect we took the Unicode table of decomposable diacritics and decided to treat it as if those were the only diacritics.
This decision was not given to us by the charter nor by Unicode, it was our choice.
I see no good reason not to allow all Unicode characters described in Latin-1 Supplement or any of the Latin Extended tables as
"LATIN SMALL LETTER <x> WITH <y>"
where <x> is an ASCII letter and <y> is any glyph that modifies the base letter (which is the general meaning of "diacritic mark") in a way that makes them confusable, regardless of the existence of a decomposed canonical equivalent, as long as they are in the RZ-LGR.
My favorite example here is of course "latin small letter with stroke", ø.
If we conclude that there is a good reason for excluding such non-decomposable diacritical characters, we should spell it out.
I don't see any good *technical* reason to exclude even characters like ð, þ, ŋ, æ etc, but they are not diacritical and thus outside of the charter, and in some cases might require aditional considerations, so I guess we'd have to back to the council for those, or wait for another PDP ("Extended Latin Characters PDP" perhaps?).
(Side note: during a coffee break discussion in Mumbai someone said old issues shouldn't be relitigated *unless* there're new facts or new viewpoints to consider. Here I think there are.)
-- Mark W. Datysgeld Director at Governance Primer [governanceprimer.com <https://governanceprimer.com>] Project Lead Developer at ICANNWiki [icannwiki.org <https://icannwiki.org/>]
On Wed, Mar 25, 2026 at 11:41:56AM -0300, Mark W. Datysgeld (mark@governanceprimer.com) wrote:
Following up, based on feedback during the WG call, please find a v2 of the document with a bounded 15 items list. Our Github has also been updated.
Thank you, Mark. That is most helpful. I did a quick review of the 15 characters in your "Thesis Section: LATIN SMALL LETTER [A-Z] WITH ..." But first I'd like to note a couple of Unicode references: "Diacritic: Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics." https://www.unicode.org/reports/tr44/#Diacritic "Diacritic. (1) A mark applied or attached to a symbol to create a new symbol that represents a modified or new value. (2) A mark applied to a symbol irrespective of whether it changes the value of that symbol. In the latter case, the diacritic usually represents an independent value (for example, an accent, tone, or some other linguistic information). Also called diacritical mark or diacritical. (See also combining character and nonspacing mark.)" https://www.unicode.org/glossary/#diacritic Unfortunately those descriptions, or definitions if you like, don't include or reference a list of diacritics or a tool for determining them, but they make it clear enough that decomposability is not a definitive characteristic. I don't recall any mention these or anything much else in Unicode apart from the table of decomposable diacritics in our discussion at the time. So I would not agree that the issue was thoroughly discussed. Then to those 15 letters. Almost all of them are listed here: https://en.wikipedia.org/wiki/Category:Latin_letters_with_diacritics but obviously Wikipedia is not a definitive source either. They all have their individual Wikipedia pages, too, with some (incomplete) information on the languages and standards using them (I confess to not having checked the references, this is just cut'n'paste from Wikipedia): ø (LATIN SMALL LETTER O WITH STROKE, U+00F8) Danish, Norwegian, Faroese, Southern Sámi, several languages of Africa like Lendu đ (LATIN SMALL LETTER D WITH STROKE, U+0111) African reference alphabet, Kven, Latin, South Slavic languages, Sámi, Vietnamese, old Spanish ł (LATIN SMALL LETTER L WITH STROKE, U+0142) Polish, Kashubian, Sorbian, Silesian, Belarusian Latin, Ukrainian Latin, Kurdish, Wymysorys, Navajo, Dëne Sųłıné, Iñupiaq, Zuni, Hupa, Sm'algya̱x, Nisga'a, Dogrib ŧ (LATIN SMALL LETTER T WITH STROKE, U+0167) Sámi, Hualapai, several orthographies for African languages ƈ (LATIN SMALL LETTER C WITH HOOK, U+0188) African languages such as Serer ƙ (LATIN SMALL LETTER K WITH HOOK, U+0199) Hausa and Karai-karai ƥ (LATIN SMALL LETTER P WITH HOOK, U+01A5) African languages such as Serer ƭ (LATIN SMALL LETTER T WITH HOOK, U+01AD) Serer ƴ (LATIN SMALL LETTER Y WITH HOOK, U+01B4) African languages, such as Fula, Hausa and Noon ɓ (LATIN SMALL LETTER B WITH HOOK, U+0253) International African Alphabet, various languages, notably Fula, Hausa, and Giziga ɖ (LATIN SMALL LETTER D WITH TAIL, U+0256) African reference alphabet, Ewe, Fon, Aja, Bassa ɗ (LATIN SMALL LETTER D WITH HOOK, U+0257) African reference alphabet, Fula, Hausa ɠ (LATIN SMALL LETTER G WITH HOOK, U+0260) African reference alphabet ɲ (LATIN SMALL LETTER N WITH LEFT HOOK, U+0272) some African languages, e.g. Bambara and Fula ɨ (LATIN SMALL LETTER I WITH STROKE, U+0268) IPA You will notice that, like Emmanuel said, most of them are used in various African languages. Omitting them on a technicality might give the impression that ICANN doesn't care so much about Africa, even though that obviously isn't our intent. I hope this will help in our discussion of the subject. Regards, -- Tapani Tarvainen
Tapani, You are correct that Unicode does not define a diacritic solely by decomposability. However, canonical decomposition is the only Unicode mechanism that encodes a relationship (and note that the word relationship is central here) between a precomposed character and a base-plus-mark sequence. In Unicode terms, characters with canonical decomposition mappings in 'UnicodeData.txt' have 'Decomposition_Type=Canonical', and normalization exposes that relationship through NFD. By contrast, the 'Diacritic' property is informative, and Unicode expressly notes that some diacritics are not combining characters and some combining characters are not diacritics. So it's not a derivational test for whether a precomposed Latin character is structurally equivalent to a base letter plus combining mark(s). 'General_Category' has the same limitation in a different way: it classifies marks as characters, but it does not encode whether another code point is derived from a base-plus-mark sequence. Going back to the main point of my letter: we arrived at the only normative, machine-testable Unicode mechanism that proves such a relationship exists. I like your "Small Latin Letter +", nevertheless. And I'll keep providing tools for the WG to decide the direction that everyone finds best. Regards, On March 29, 2026 4:32:42 AM GMT-03:00, Tapani Tarvainen via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> wrote:
On Wed, Mar 25, 2026 at 11:41:56AM -0300, Mark W. Datysgeld (mark@governanceprimer.com) wrote:
Following up, based on feedback during the WG call, please find a v2 of the document with a bounded 15 items list. Our Github has also been updated.
Thank you, Mark. That is most helpful.
I did a quick review of the 15 characters in your "Thesis Section: LATIN SMALL LETTER [A-Z] WITH ..."
But first I'd like to note a couple of Unicode references:
"Diacritic: Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics."
https://www.unicode.org/reports/tr44/#Diacritic
"Diacritic. (1) A mark applied or attached to a symbol to create a new symbol that represents a modified or new value. (2) A mark applied to a symbol irrespective of whether it changes the value of that symbol. In the latter case, the diacritic usually represents an independent value (for example, an accent, tone, or some other linguistic information). Also called diacritical mark or diacritical. (See also combining character and nonspacing mark.)"
https://www.unicode.org/glossary/#diacritic
Unfortunately those descriptions, or definitions if you like, don't include or reference a list of diacritics or a tool for determining them, but they make it clear enough that decomposability is not a definitive characteristic.
I don't recall any mention these or anything much else in Unicode apart from the table of decomposable diacritics in our discussion at the time. So I would not agree that the issue was thoroughly discussed.
Then to those 15 letters. Almost all of them are listed here:
https://en.wikipedia.org/wiki/Category:Latin_letters_with_diacritics
but obviously Wikipedia is not a definitive source either.
They all have their individual Wikipedia pages, too, with some (incomplete) information on the languages and standards using them (I confess to not having checked the references, this is just cut'n'paste from Wikipedia):
ø (LATIN SMALL LETTER O WITH STROKE, U+00F8) Danish, Norwegian, Faroese, Southern Sámi, several languages of Africa like Lendu
đ (LATIN SMALL LETTER D WITH STROKE, U+0111) African reference alphabet, Kven, Latin, South Slavic languages, Sámi, Vietnamese, old Spanish
ł (LATIN SMALL LETTER L WITH STROKE, U+0142) Polish, Kashubian, Sorbian, Silesian, Belarusian Latin, Ukrainian Latin, Kurdish, Wymysorys, Navajo, Dëne Sųłıné, Iñupiaq, Zuni, Hupa, Sm'algya̱x, Nisga'a, Dogrib
ŧ (LATIN SMALL LETTER T WITH STROKE, U+0167) Sámi, Hualapai, several orthographies for African languages
ƈ (LATIN SMALL LETTER C WITH HOOK, U+0188) African languages such as Serer
ƙ (LATIN SMALL LETTER K WITH HOOK, U+0199) Hausa and Karai-karai
ƥ (LATIN SMALL LETTER P WITH HOOK, U+01A5) African languages such as Serer
ƭ (LATIN SMALL LETTER T WITH HOOK, U+01AD) Serer
ƴ (LATIN SMALL LETTER Y WITH HOOK, U+01B4) African languages, such as Fula, Hausa and Noon
ɓ (LATIN SMALL LETTER B WITH HOOK, U+0253) International African Alphabet, various languages, notably Fula, Hausa, and Giziga
ɖ (LATIN SMALL LETTER D WITH TAIL, U+0256) African reference alphabet, Ewe, Fon, Aja, Bassa
ɗ (LATIN SMALL LETTER D WITH HOOK, U+0257) African reference alphabet, Fula, Hausa
ɠ (LATIN SMALL LETTER G WITH HOOK, U+0260) African reference alphabet
ɲ (LATIN SMALL LETTER N WITH LEFT HOOK, U+0272) some African languages, e.g. Bambara and Fula
ɨ (LATIN SMALL LETTER I WITH STROKE, U+0268) IPA
You will notice that, like Emmanuel said, most of them are used in various African languages.
Omitting them on a technicality might give the impression that ICANN doesn't care so much about Africa, even though that obviously isn't our intent.
I hope this will help in our discussion of the subject.
Regards,
-- Tapani Tarvainen _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org
Mark W. Datysgeld Governance Primer & ICANNWiki
Tapani, Mark, From a Council liaison perspective, I just wanted to offer a small observation that might help as you continue this discussion. It seems like this exchange is surfacing two different, but equally valid, questions: one around how to define a boundary that is technically testable and implementable, and another around how to ensure that boundary appropriately reflects real-world language use and expectations around inclusion. The current discussion is engaging both at once, which may be making it harder to converge. It might be helpful to separate those threads slightly and consider: – what constitutes a clear, objective, and machine-testable baseline; – whether that baseline is sufficient for the WG’s goals; and – if not, what characteristics any expansion would need to have in order to remain finite, predictable, and implementable. One way to think about this, just as a framing device, is that the group may be defining a “floor” (what can be clearly and objectively established) and considering whether there is also a need for a “ceiling” (what the group ultimately wants to include), and how those two come together. The key question then becomes how to connect those in a way that remains finite, predictable, and implementable. It may also be useful, if helpful to the group, to have a short, focused discussion with ICANN org to clarify what would be required from an implementation standpoint for any approach the WG is considering. Just offering this in the spirit of supporting leadership as you work toward a path that balances clarity, inclusivity, and implementability. All the best, Christian Dawson Executive Director i2Coalition <http://i2coalition.com/> c: 703 623 2612
On Mar 29, 2026, at 6:57 AM, Mark W. Datysgeld via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> wrote:
Tapani,
You are correct that Unicode does not define a diacritic solely by decomposability.
However, canonical decomposition is the only Unicode mechanism that encodes a relationship (and note that the word relationship is central here) between a precomposed character and a base-plus-mark sequence. In Unicode terms, characters with canonical decomposition mappings in 'UnicodeData.txt' have 'Decomposition_Type=Canonical', and normalization exposes that relationship through NFD.
By contrast, the 'Diacritic' property is informative, and Unicode expressly notes that some diacritics are not combining characters and some combining characters are not diacritics. So it's not a derivational test for whether a precomposed Latin character is structurally equivalent to a base letter plus combining mark(s). 'General_Category' has the same limitation in a different way: it classifies marks as characters, but it does not encode whether another code point is derived from a base-plus-mark sequence.
Going back to the main point of my letter: we arrived at the only normative, machine-testable Unicode mechanism that proves such a relationship exists.
I like your "Small Latin Letter +", nevertheless. And I'll keep providing tools for the WG to decide the direction that everyone finds best.
Regards,
On March 29, 2026 4:32:42 AM GMT-03:00, Tapani Tarvainen via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> wrote:
On Wed, Mar 25, 2026 at 11:41:56AM -0300, Mark W. Datysgeld (mark@governanceprimer.com) wrote:
Following up, based on feedback during the WG call, please find a v2 of the document with a bounded 15 items list. Our Github has also been updated.
Thank you, Mark. That is most helpful.
I did a quick review of the 15 characters in your "Thesis Section: LATIN SMALL LETTER [A-Z] WITH ..."
But first I'd like to note a couple of Unicode references:
"Diacritic: Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics."
https://www.unicode.org/reports/tr44/#Diacritic
"Diacritic. (1) A mark applied or attached to a symbol to create a new symbol that represents a modified or new value. (2) A mark applied to a symbol irrespective of whether it changes the value of that symbol. In the latter case, the diacritic usually represents an independent value (for example, an accent, tone, or some other linguistic information). Also called diacritical mark or diacritical. (See also combining character and nonspacing mark.)"
https://www.unicode.org/glossary/#diacritic
Unfortunately those descriptions, or definitions if you like, don't include or reference a list of diacritics or a tool for determining them, but they make it clear enough that decomposability is not a definitive characteristic.
I don't recall any mention these or anything much else in Unicode apart from the table of decomposable diacritics in our discussion at the time. So I would not agree that the issue was thoroughly discussed.
Then to those 15 letters. Almost all of them are listed here:
https://en.wikipedia.org/wiki/Category:Latin_letters_with_diacritics
but obviously Wikipedia is not a definitive source either.
They all have their individual Wikipedia pages, too, with some (incomplete) information on the languages and standards using them (I confess to not having checked the references, this is just cut'n'paste from Wikipedia):
ø (LATIN SMALL LETTER O WITH STROKE, U+00F8) Danish, Norwegian, Faroese, Southern Sámi, several languages of Africa like Lendu
đ (LATIN SMALL LETTER D WITH STROKE, U+0111) African reference alphabet, Kven, Latin, South Slavic languages, Sámi, Vietnamese, old Spanish
ł (LATIN SMALL LETTER L WITH STROKE, U+0142) Polish, Kashubian, Sorbian, Silesian, Belarusian Latin, Ukrainian Latin, Kurdish, Wymysorys, Navajo, Dëne Sųłıné, Iñupiaq, Zuni, Hupa, Sm'algya̱x, Nisga'a, Dogrib
ŧ (LATIN SMALL LETTER T WITH STROKE, U+0167) Sámi, Hualapai, several orthographies for African languages
ƈ (LATIN SMALL LETTER C WITH HOOK, U+0188) African languages such as Serer
ƙ (LATIN SMALL LETTER K WITH HOOK, U+0199) Hausa and Karai-karai
ƥ (LATIN SMALL LETTER P WITH HOOK, U+01A5) African languages such as Serer
ƭ (LATIN SMALL LETTER T WITH HOOK, U+01AD) Serer
ƴ (LATIN SMALL LETTER Y WITH HOOK, U+01B4) African languages, such as Fula, Hausa and Noon
ɓ (LATIN SMALL LETTER B WITH HOOK, U+0253) International African Alphabet, various languages, notably Fula, Hausa, and Giziga
ɖ (LATIN SMALL LETTER D WITH TAIL, U+0256) African reference alphabet, Ewe, Fon, Aja, Bassa
ɗ (LATIN SMALL LETTER D WITH HOOK, U+0257) African reference alphabet, Fula, Hausa
ɠ (LATIN SMALL LETTER G WITH HOOK, U+0260) African reference alphabet
ɲ (LATIN SMALL LETTER N WITH LEFT HOOK, U+0272) some African languages, e.g. Bambara and Fula
ɨ (LATIN SMALL LETTER I WITH STROKE, U+0268) IPA
You will notice that, like Emmanuel said, most of them are used in various African languages.
Omitting them on a technicality might give the impression that ICANN doesn't care so much about Africa, even though that obviously isn't our intent.
I hope this will help in our discussion of the subject.
Regards,
Mark W. Datysgeld Governance Primer & ICANNWiki _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org
On the subject of "machine testable", it may be worth noting that this isn't an obvious thing. For example, while Unicode may not provide a file of entries defined by "Latin Small Letter [X] With...", it is entirely possible to write code which will parse the Unicode table and identify those. How big an effort that would be, and whether we think it worth it are debatable. But "machine testable" isn't a bar to doing it. Bill Jouris Yahoo Mail: Search, Organize, Conquer On Mon, Mar 30, 2026 at 10:56 AM, Christian Dawson via Gnso-latin-diacritics<gnso-latin-diacritics@icann.org> wrote: _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org
I think Christian brings a very strong point. Bill, the more ad-hoc our implementation is, the more challenges we create downstream for adoption in terms of Universal Acceptance. So it needs to be a balancing act. Best, On 30/03/2026 16:31, Bill Jouris wrote:
On the subject of "machine testable", it may be worth noting that this isn't an obvious thing.
For example, while Unicode may not provide a file of entries defined by "Latin Small Letter [X] With...", it is entirely possible to write code which will parse the Unicode table and identify those. How big an effort that would be, and whether we think it worth it are debatable. But "machine testable" isn't a bar to doing it.
Bill Jouris
Yahoo Mail: Search, Organize, Conquer <https://mail.onelink.me/107872968?pid=nativeplacement&c=US_Acquisition_YMktg...>
On Mon, Mar 30, 2026 at 10:56 AM, Christian Dawson via Gnso-latin-diacritics <gnso-latin-diacritics@icann.org> wrote: _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org
-- Mark W. Datysgeld Director at Governance Primer [governanceprimer.com <https://governanceprimer.com>] Project Lead Developer at ICANNWiki [icannwiki.org <https://icannwiki.org/>]
On Mon, Mar 30, 2026 at 07:08:46PM -0300, Mark W. Datysgeld (mark@governanceprimer.com) wrote:
the more ad-hoc our implementation is, the more challenges we create downstream for adoption in terms of Universal Acceptance.
This is an interesting point. But while it is plausible that adding new characters will have an impact on Universal Acceptance, that is not really a good reason not to add them. That would be a defeatist, indeed almost a circular argument. The whole point of Universal Acceptance is to get all languages and their scripts universally accepted, and deciding not to support some languages because it would make it easier to get the rest accepted is rather obviously against the spirit of UA, if not the letter. -- Tapani Tarvainen
Dear all, Let's consider three alternatives: (1) Decomposable diacritics only. This is technically easy, single, straightforward rule. The downside is that it excludes several diacritical letters. (2) Include also "SMALL LATIN LETTER [A-Z] WITH ..." Also technically easy (how long did it take Mark to add this to his tool?), only a bit more complicated in that we'd need two different rules. It provides unambiguous, machine-testable set of letters and relationships between the diacritics and their ASCII counterparts, and it adds 15 letters that are used in at least 35 languages. (3) Add also letters that aren't real diacritics, like ŋ, æ, ð, þ etc, but that could still be handled the same way. These would need to be evaluated individually and automated processing would have to be based on tables. It would, however, cover even more languages. I would argue that (2) is closest match to our charter as it covers almost(?) all diacritics. I would also argue that those extra letters and languages supported by (2) are at least symbolically significant. I do not see a significant difference in technical difficulty between (1) and (2), but I am open to persuasion here. Staff input would be appreciated. I don't think (3) would be technically all that difficult either, but it would certainly be at least stretching our mandate and it would require significant amount of extra work. Looking forward to an interesting discussion tomorrow, -- Tapani Tarvainen
I've become strongly in favor of 1 + 2 from my perspective as a group member. But from a VC perspective, I evidently leave this open to the group. The more I study 1 + 2, the more it looks like an implementable rule. We would not need to ship a library or a large table with exceptions, but only define two rules that developers can incorporate where needed. That makes adoption much more feasible at the application level where these ASCII-IDN equivalences would need to be recognized beyond the DNS itself, such as string comparison, validation, normalization, display, and matching. My UA claim only really applies beyond 1 + 2, where we establish truly ad-hoc rules. Best, On 31/03/2026 13:37, Tapani Tarvainen via Gnso-latin-diacritics wrote:
Dear all,
Let's consider three alternatives:
(1) Decomposable diacritics only.
This is technically easy, single, straightforward rule. The downside is that it excludes several diacritical letters.
(2) Include also "SMALL LATIN LETTER [A-Z] WITH ..."
Also technically easy (how long did it take Mark to add this to his tool?), only a bit more complicated in that we'd need two different rules. It provides unambiguous, machine-testable set of letters and relationships between the diacritics and their ASCII counterparts, and it adds 15 letters that are used in at least 35 languages.
(3) Add also letters that aren't real diacritics, like ŋ, æ, ð, þ etc, but that could still be handled the same way. These would need to be evaluated individually and automated processing would have to be based on tables. It would, however, cover even more languages.
I would argue that (2) is closest match to our charter as it covers almost(?) all diacritics.
I would also argue that those extra letters and languages supported by (2) are at least symbolically significant.
I do not see a significant difference in technical difficulty between (1) and (2), but I am open to persuasion here. Staff input would be appreciated.
I don't think (3) would be technically all that difficult either, but it would certainly be at least stretching our mandate and it would require significant amount of extra work.
Looking forward to an interesting discussion tomorrow,
-- Mark W. Datysgeld Director at Governance Primer [governanceprimer.com <https://governanceprimer.com>] Project Lead Developer at ICANNWiki [icannwiki.org <https://icannwiki.org/>]
Hi all, thanks to everybody who provided input to this topic. Really great discussions with interesting arguments. I won't join into the discussion with my own opinion, but just would like to mention two facts: Am 31.03.2026 um 18:37 schrieb Tapani Tarvainen via Gnso-latin-diacritics:
Dear all,
Let's consider three alternatives:
(1) Decomposable diacritics only.
This is technically easy, single, straightforward rule. The downside is that it excludes several diacritical letters.
(2) Include also "SMALL LATIN LETTER [A-Z] WITH ..."
Also technically easy (how long did it take Mark to add this to his tool?), only a bit more complicated in that we'd need two different rules.
We would not necessarily need to implement two rules. The rule for (2) would already include all cases of (1). All decomposable diacritics already have a corresponding name, e.g., "LATIN SMALL LETTER E WITH ACUTE", which is part of (1).
It provides unambiguous, machine-testable set of letters and relationships between the diacritics and their ASCII counterparts, and it adds 15 letters that are used in at least 35 languages.
(3) Add also letters that aren't real diacritics, like ŋ, æ, ð, þ etc, but that could still be handled the same way. These would need to be evaluated individually and automated processing would have to be based on tables. It would, however, cover even more languages.
The problem with these is not only that they aren't real diacritics (and we are restricted to diacritics), some of them don't map clearly to an ASCII letter. Take for example þ (U+00FE, LATIN SMALL LETTER THORN): would the base ASCII letter be "p" or "b"? Both choices are equally viable. Let's continue the discussion during our call. Cheers, Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp Certified according DIN ISO/IEC 27001:2017
We would not necessarily need to implement two rules. The rule for (2) would already include all cases of (1). All decomposable diacritics already have a corresponding name, e.g., "LATIN SMALL LETTER E WITH ACUTE", which is part of (1). What do you know, our Chair is an actual genius. He just created the easiest UA implementation rule ever. -- Mark W. Datysgeld Director at Governance Primer [governanceprimer.com <https://governanceprimer.com>] Project Lead Developer at ICANNWiki [icannwiki.org <https://icannwiki.org/>]
Hi Mark, Am 01.04.2026 um 10:47 schrieb Mark W. Datysgeld:
We would not necessarily need to implement two rules. The rule for (2) would already include all cases of (1). All decomposable diacritics already have a corresponding name, e.g., "LATIN SMALL LETTER E WITH ACUTE", which is part of (1). What do you know, our Chair is an actual genius. He just created the easiest UA implementation rule ever.
thanks for the kind words, but reading Tapani's comment, I see that I possibly have oversimplified this. It's a bit more complex: If we just use "LATIN SMALL LETTER [ASCII] WITH [diacritic]", this would leave out the previously included letters with two diacritics. So it would need to be "LATIN SMALL LETTER [ASCII] WITH [diacritic](AND [diacritic]" But even then, another group of characters that would be missed (and have previously been included) are the ones that are not pre-composed. These would be the ones classified as Type 2 in Pitinan's e-mail, e.g., U+0061 U+0331 a̱ LATIN SMALL LETTER A + COMBINING MACRON BELOW So, it appears the rule might be slightly more complex, in the end. Cheers, Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp Certified according DIN ISO/IEC 27001:2017
On Wed, Apr 01, 2026 at 11:44:33AM +0200, Michael Bauland via Gnso-latin-diacritics (gnso-latin-diacritics@icann.org) wrote:
I see that I possibly have oversimplified this. It's a bit more complex:
If we just use "LATIN SMALL LETTER [ASCII] WITH [diacritic]", this would leave out the previously included letters with two diacritics. So it would need to be "LATIN SMALL LETTER [ASCII] WITH [diacritic](AND [diacritic]"
But even then, another group of characters that would be missed (and have previously been included) are the ones that are not pre-composed. These would be the ones classified as Type 2 in Pitinan's e-mail, e.g., U+0061 U+0331 a̱ LATIN SMALL LETTER A + COMBINING MACRON BELOW
So, it appears the rule might be slightly more complex, in the end.
Yes. There are in effect two, partially overlapping sets of diacritical letters: * precomposed ones that have a single code point form; these would be included by LATIN SMALL LETTER [A-Z] WITH DIACRITIC(s) * those that have a multi-character (decomposed) form, like LATIN SMALL LETTER [A-Z] + COMBINING DIACRITIC(s) Most diacriticals actually belong to both of those sets, but some belong only to one or the other. So to cover all bases two rules would be needed. - In theory some character might be a combination of those, like LATIN SMALL LETTER A WITH ACUTE ACCENT + COMBINING CARON, but I think there are none that only exist in such a form (that is, all such would also have one of the above forms). -- Tapani Tarvainen
Hi all, For considerations (1) and (2), I’d like to point to the Latin script RZ-LGR which may be useful to scope the discussion as a Latin gTLD string must be valid as per the Latin script RZ-LGR. Please see the code point repertoire at https://www.icann.org/sites/default/files/packages/lgr/rz-lgr-6-latin-script.... There are two types of base character and diacritic encoding. Type 1: The base character and the diacritic are encoded as a single code point. Examples: U+00E0 à LATIN SMALL LETTER A WITH GRAVE U+00E1 á LATIN SMALL LETTER A WITH ACUTE Type 2: The base character and the diacritic are encoded as two separate code points. This is because there is no single code point for such character with diacritic and that character is required for some languages e.g. Neur in the examples below. Examples: U+0061 U+0331 a̱ LATIN SMALL LETTER A + COMBINING MACRON BELOW U+0065 U+0331 e̱ LATIN SMALL LETTER E + COMBINING MACRON BELOW U+025B U+0331 U+0308 ɛ̱̈ LATIN SMALL LETTER OPEN E + COMBINING MACRON BELOW + COMBINING DIAERESIS Please also note that the separate diacritics are not included in the repertoire on its own, it only comes in as a sequence of a base character. There are 20 cases of Type 2 encoding in the Latin RZ-LGR to support the following languages: Neur, Guarani, Marshallese, Raga, Malagasy, Dinka, Cubeo, Yoruba. Regards, Pitinan On 1/4/26, 14:50, "Michael Bauland via Gnso-latin-diacritics" <gnso-latin-diacritics@icann.org <mailto:gnso-latin-diacritics@icann.org>> wrote: Hi all, thanks to everybody who provided input to this topic. Really great discussions with interesting arguments. I won't join into the discussion with my own opinion, but just would like to mention two facts: Am 31.03.2026 um 18:37 schrieb Tapani Tarvainen via Gnso-latin-diacritics:
Dear all,
Let's consider three alternatives:
(1) Decomposable diacritics only.
This is technically easy, single, straightforward rule. The downside is that it excludes several diacritical letters.
(2) Include also "SMALL LATIN LETTER [A-Z] WITH ..."
Also technically easy (how long did it take Mark to add this to his tool?), only a bit more complicated in that we'd need two different rules.
We would not necessarily need to implement two rules. The rule for (2) would already include all cases of (1). All decomposable diacritics already have a corresponding name, e.g., "LATIN SMALL LETTER E WITH ACUTE", which is part of (1).
It provides unambiguous, machine-testable set of letters and relationships between the diacritics and their ASCII counterparts, and it adds 15 letters that are used in at least 35 languages.
(3) Add also letters that aren't real diacritics, like ŋ, æ, ð, þ etc, but that could still be handled the same way. These would need to be evaluated individually and automated processing would have to be based on tables. It would, however, cover even more languages.
The problem with these is not only that they aren't real diacritics (and we are restricted to diacritics), some of them don't map clearly to an ASCII letter. Take for example þ (U+00FE, LATIN SMALL LETTER THORN): would the base ASCII letter be "p" or "b"? Both choices are equally viable. Let's continue the discussion during our call. Cheers, Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de <mailto:Michael.Bauland@knipp.de> Software Development E-mail: Michael.Bauland@knipp.de <mailto:Michael.Bauland@knipp.de> Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp Certified according DIN ISO/IEC 27001:2017 _______________________________________________ Gnso-latin-diacritics mailing list -- gnso-latin-diacritics@icann.org <mailto:gnso-latin-diacritics@icann.org> To unsubscribe send an email to gnso-latin-diacritics-leave@icann.org <mailto:gnso-latin-diacritics-leave@icann.org>
On Wed, Apr 01, 2026 at 09:49:58AM +0200, Michael Bauland via Gnso-latin-diacritics (gnso-latin-diacritics@icann.org) wrote:
(1) Decomposable diacritics only.
(2) Include also "SMALL LATIN LETTER [A-Z] WITH ..."
The rule for (2) would already include all cases of (1). All decomposable diacritics already have a corresponding name, e.g., "LATIN SMALL LETTER E WITH ACUTE", which is part of (1).
Would that include letters with multiple diacritics like ễ, ữ &c? -- Tapani Tarvainen
On Sun, Mar 29, 2026 at 07:57:19AM -0300, Mark W. Datysgeld (mark@governanceprimer.com) wrote:
You are correct that Unicode does not define a diacritic solely by decomposability.
Unfortunately I misunderstood that point when we discussed this last year.
However, canonical decomposition is the only Unicode mechanism that encodes a relationship (and note that the word relationship is central here) between a precomposed character and a base-plus-mark sequence.
I beg to differ. Or maybe I misunderstand you. But the "SMALL LATIN LETTER ... WITH" makes the relationship between the named letter and its ASCII counterpart unambiguous and easy to determine automatically. It encodes the relationship in the name of the letter, but while that may sound, er, less technical, it is just as explicit and immutable in Unicode as decomposability, if not more so. The only apparent downside would be that we'd need two criteria instead of one, but it would still be simple enough. -- Tapani Tarvainen
participants (8)
-
Amadeu Abril i Abril (CORE) -
Bill Jouris -
Christian Dawson -
Emmanuel Vitus -
Mark W. Datysgeld -
Michael Bauland -
Pitinan Kooarmornpatana -
Tapani Tarvainen