How should combining diacritic marks be handled?
Dear all, MSR2 contains a number of combining diacritic marks, e.g. U+0323 COMBINING DOT BELOW. It might be that we find that some of the languages that should be supported requires that code point in combination with, say, "n", i.e. "U+006E U+0323". Let us assume that there is no pre-composed equivalent code point. We can then justify the inclusion of U+0323. Will then the Integration Panel accept that code point in any context, or just in the specific context? If the IP requires that we justify combining diacritic marks for every context it will be allowed for, then we have to go language by language to find all combinations to support. If the IP accepts to include a combining diacritic mark for any context as long as it is justified for one language, then we can go code point by code point as long as we can find justification for all Latin code points in MSR2 and we assume no more code points are needed. If the purpose of our work is to create a Latin IDN table that supports all listed languages (EGIDS value 4 or 5 as decided) then I cannot see how we can achieve that without inspecting all those languages. Yours, Mats --- Mats Dufberg DNS Specialist, IIS Mobile: +46 73 065 3899 https://www.iis.se/en/
Hi Mats, hi all, On 16.01.2017 17:21, Mats Dufberg wrote:
MSR2 contains a number of combining diacritic marks, e.g. U+0323 COMBINING DOT BELOW. It might be that we find that some of the languages that should be supported requires that code point in combination with, say, "n", i.e. "U+006E U+0323". Let us assume that there is no pre-composed equivalent code point.
We can then justify the inclusion of U+0323. Will then the Integration Panel accept that code point in any context, or just in the specific context?
I assume that we will need to define the context those combining marks are allowed. At least we did this for middle dot of the "ela geminada" in the Catalan language tables (see, e.g., http://www.iana.org/domains/idn-tables/tables/sap_ca_1.0.txt). But I guess Sarmad will know for certain.
If the IP requires that we justify combining diacritic marks for every context it will be allowed for, then we have to go language by language to find all combinations to support.
If the IP accepts to include a combining diacritic mark for any context as long as it is justified for one language, then we can go code point by code point as long as we can find justification for all Latin code points in MSR2 and we assume no more code points are needed.
If the purpose of our work is to create a Latin IDN table that supports all listed languages (EGIDS value 4 or 5 as decided) then I cannot see how we can achieve that without inspecting all those languages.
Going by language instead of going by character also has the advantage that we will be able to distribute the languages to members of the group. Then everybody can work with a certain sub-set of all languages. If we distributed the characters, everybody would have to get acquainted with every single language. Cheers, Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-284 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp
Michael, The Middle Dot U+00B7 is special since the IDNA standard requires it to be limited to the context between 'l' and 'l' ("U+006C U+00B7 U+006C"). (https://tools.ietf.org/html/rfc5892#page-16) Yours, Mats --- Mats Dufberg DNS Specialist, IIS Mobile: +46 73 065 3899 https://www.iis.se/en/ -----Original Message----- From: <latingp-bounces@icann.org> on behalf of Michael Bauland <Michael.Bauland@knipp.de> Date: Tuesday 17 January 2017 at 09:55 To: "latingp@icann.org" <latingp@icann.org> Subject: Re: [Latingp] How should combining diacritic marks be handled? Hi Mats, hi all, On 16.01.2017 17:21, Mats Dufberg wrote:
MSR2 contains a number of combining diacritic marks, e.g. U+0323 COMBINING DOT BELOW. It might be that we find that some of the languages that should be supported requires that code point in combination with, say, "n", i.e. "U+006E U+0323". Let us assume that there is no pre-composed equivalent code point.
We can then justify the inclusion of U+0323. Will then the Integration Panel accept that code point in any context, or just in the specific context?
I assume that we will need to define the context those combining marks are allowed. At least we did this for middle dot of the "ela geminada" in the Catalan language tables (see, e.g., http://www.iana.org/domains/idn-tables/tables/sap_ca_1.0.txt). But I guess Sarmad will know for certain.
If the IP requires that we justify combining diacritic marks for every context it will be allowed for, then we have to go language by language to find all combinations to support.
If the IP accepts to include a combining diacritic mark for any context as long as it is justified for one language, then we can go code point by code point as long as we can find justification for all Latin code points in MSR2 and we assume no more code points are needed.
If the purpose of our work is to create a Latin IDN table that supports all listed languages (EGIDS value 4 or 5 as decided) then I cannot see how we can achieve that without inspecting all those languages.
Going by language instead of going by character also has the advantage that we will be able to distribute the languages to members of the group. Then everybody can work with a certain sub-set of all languages. If we distributed the characters, everybody would have to get acquainted with every single language. Cheers, Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-284 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp _______________________________________________ Latingp mailing list Latingp@icann.org https://mm.icann.org/mailman/listinfo/latingp
Hi Michael, All,
I assume that we will need to define the context those combining marks are allowed. ... But I guess Sarmad will know for certain.
This discussion is published by IP in Section 4.5. Non-Spacing Combining Marks of the Overview and Rationale <https://www.icann.org/en/system/files/files/msr-2-overview-14apr15-en.pdf> document released as part of MSR-2 <https://www.icann.org/resources/pages/msr-2015-06-21-en> . I encourage you all to review it. It is noted in this section that: The actual set of combining marks allowable in the LGR will be smaller than the set included in the MSR, because it will be limited to those marks that are actually required for at least one combining sequence not expressible in NFC. In addition, where the number of such attested sequences is known and limited, GPs are encouraged to enumerate the sequences where feasible, rather than adding the "bare" combining mark to the repertoire. This would serve to prevent such marks from combining with every other allowed code point in the GP's repertoire. This suggests that where pre-composed form is not encoded directly, it is preferred to include the combining mark only with the desired code point(s) as a sequence <https://tools.ietf.org/html/rfc7940#section-5.1> to prevent over-generation. Regards, Sarmad -----Original Message----- From: latingp-bounces@icann.org <mailto:latingp-bounces@icann.org> [mailto:latingp-bounces@icann.org] On Behalf Of Michael Bauland Sent: Tuesday, January 17, 2017 1:55 PM To: latingp@icann.org <mailto:latingp@icann.org> Subject: Re: [Latingp] How should combining diacritic marks be handled? Hi Mats, hi all, On 16.01.2017 17:21, Mats Dufberg wrote:
MSR2 contains a number of combining diacritic marks, e.g. U+0323
COMBINING DOT BELOW. It might be that we find that some of the
languages that should be supported requires that code point in
combination with, say, "n", i.e. "U+006E U+0323". Let us assume that
there is no pre-composed equivalent code point.
We can then justify the inclusion of U+0323. Will then the Integration
Panel accept that code point in any context, or just in the specific
context?
I assume that we will need to define the context those combining marks are allowed. At least we did this for middle dot of the "ela geminada" in the Catalan language tables (see, e.g., <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.iana.org_domains_id n-2Dtables_tables_sap-5Fca-5F1.0.txt&d=DwICAg&c=FmY1u3PJp6wrcrwll3mSVzgfkbPS S6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=dq_XzWWgeHkF r9WfH9IXK7WNh6-cBi-Ri4_TEqaDJkc&s=reu-ntqXFnNrZeDRDBLwSxLIFdJo9q4_2kUN7aPCiP 8&e> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.iana.org_domains_idn -2Dtables_tables_sap-5Fca-5F1.0.txt&d=DwICAg&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS 6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=dq_XzWWgeHkFr 9WfH9IXK7WNh6-cBi-Ri4_TEqaDJkc&s=reu-ntqXFnNrZeDRDBLwSxLIFdJo9q4_2kUN7aPCiP8 &e= ). But I guess Sarmad will know for certain.
If the IP requires that we justify combining diacritic marks for every
context it will be allowed for, then we have to go language by
language to find all combinations to support.
If the IP accepts to include a combining diacritic mark for any
context as long as it is justified for one language, then we can go
code point by code point as long as we can find justification for all
Latin code points in MSR2 and we assume no more code points are needed.
If the purpose of our work is to create a Latin IDN table that
supports all listed languages (EGIDS value 4 or 5 as decided) then I
cannot see how we can achieve that without inspecting all those languages.
Going by language instead of going by character also has the advantage that we will be able to distribute the languages to members of the group. Then everybody can work with a certain sub-set of all languages. If we distributed the characters, everybody would have to get acquainted with every single language. Cheers, Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-284 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: <mailto:Michael.Bauland@knipp.de> Michael.Bauland@knipp.de Software Development E-mail: <mailto:Michael.Bauland@knipp.de> Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp _______________________________________________ Latingp mailing list <mailto:Latingp@icann.org> Latingp@icann.org <https://mm.icann.org/mailman/listinfo/latingp> https://mm.icann.org/mailman/listinfo/latingp
participants (3)
-
Mats Dufberg -
Michael Bauland -
Sarmad Hussain