Re: [Latingp] Homoglyphs within Latin script
Thanks Mats. A follow up question, is there evidence that these two code points are used interchangeably in the languages the repertoire team analyzed? I ask because the IP will ask for more evidence of a variant relationship besides visual appearance. Per the Procedure “Generation Panels should ignore cases where the relation is based exclusively on aspects of visual similarity”. -Dennis From: Mats Dufberg <mats.dufberg@iis.se> Date: Thursday, January 4, 2018 at 5:33 PM To: Dennis Tan Tanaka <dtantanaka@verisign.com> Cc: "latingp@icann.org" <latingp@icann.org> Subject: [EXTERNAL] Re: [Latingp] Homoglyphs within Latin script Dennis, My conclusion is that both should be included since some languages use one and some languages the other. I think it would be reasonable to have a variant rule for those. Yours, Mats --- Mats Dufberg DNS Specialist, IIS Mobile: +46 73 065 3899 https://www.iis.se/en/ From: "Tan Tanaka, Dennis" <dtantanaka@verisign.com> Date: Thursday 4 January 2018 at 21:45 To: Mats Dufberg <mats.dufberg@iis.se>, "latingp@icann.org" <latingp@icann.org> Subject: Re: [Latingp] Homoglyphs within Latin script Mats, Are you saying that the panel needs to do more analysis to determine which one to include in the repertoire or is the matter that both of them will be included but you are suggesting some variant rules? On the latter, IP briefly discussed about the two code points in the feedback regarding the Principles document. -Dennis From: Latingp <latingp-bounces@icann.org> on behalf of Mats Dufberg <mats.dufberg@iis.se> Date: Sunday, December 31, 2017 at 12:41 PM To: "latingp@icann.org" <latingp@icann.org> Subject: [EXTERNAL] [Latingp] Homoglyphs within Latin script There are many code points that are so similar that confusion is at risk, but there is at least one clear pair of homoglyphs: 0259 ə LATIN SMALL LETTER SCHWA 01DD ǝ LATIN SMALL LETTER TURNED E They are not homoglyphs in upper case. In two of the languages that attest either of the two, it is not possible to say which because there is no upper case letter. Mats --- Mats Dufberg DNS Specialist, IIS Mobile: +46 73 065 3899 https://www.iis.se/en/
Hi Dennis, hi Mats, On 05.01.2018 16:02, Tan Tanaka, Dennis via Latingp wrote:
Thanks Mats.
A follow up question, is there evidence that these two code points are used interchangeably in the languages the repertoire team analyzed? I ask because the IP will ask for more evidence of a variant relationship besides visual appearance. Per the Procedure “Generation Panels should ignore cases where the relation is based exclusively on aspects of visual similarity”.
I'm not sure we need additional evidence, because in this case it's not mere "visual similarity" but those two are actually the same. And I don't think we have another choice in the case of homoglyphs, but to make them variants. On the contrary, I think we had to argue if we wanted to not make them variants. Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp
Hi Michael, They are not the same character. They look alike in lower case, but are different in upper case (i.e. disunification by case property). The IP briefly discussed this case of 01DD and 0259 in their feedback to our Principles document and suggested that these two should not be variants. Hence my question about more evidence. -Dennis On 1/5/18, 10:10 AM, "Michael Bauland" <Michael.Bauland@knipp.de> wrote: Hi Dennis, hi Mats, On 05.01.2018 16:02, Tan Tanaka, Dennis via Latingp wrote: > Thanks Mats. > > > > A follow up question, is there evidence that these two code points are > used interchangeably in the languages the repertoire team analyzed? I > ask because the IP will ask for more evidence of a variant relationship > besides visual appearance. Per the Procedure “Generation Panels should > ignore cases where the relation is based exclusively on aspects of > visual similarity”. I'm not sure we need additional evidence, because in this case it's not mere "visual similarity" but those two are actually the same. And I don't think we have another choice in the case of homoglyphs, but to make them variants. On the contrary, I think we had to argue if we wanted to not make them variants. Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp
Hi Dennis, On 05.01.2018 16:13, Tan Tanaka, Dennis wrote:
Hi Michael,
They are not the same character. They look alike in lower case, but are different in upper case (i.e. disunification by case property). The IP briefly discussed this case of 01DD and 0259 in their feedback to our Principles document and suggested that these two should not be variants. Hence my question about more evidence.
sorry, I must have overlooked this. Which feedback are you talking about? Not the one from 2017-03-22 "GP Proposal Latin Script_Feedback_IP_V2F.docx", right? Is the document in the Latin GP drop box account? Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp
Michael, we received this feedback through email (attached). Look for this paragraph: “In the context of the Root Zone, the Procedure is quite clear in that it considers simple similarity of appearance to be outside the scope of the Root Zone LGR. In admitting exact homoglyphs, the IP has been making the argument that ‘e’ in Latin (U+0065) and ‘е’ in Cyrillic(U+0435) are not just visually indistinguishable, but that their distinct code points effectively represent a disunification by script property. (A disunification not unlike that of 01DD and 0259, which are disunified based on case, or the two sets of Arabic digits disunified largely on directional properties).” -Dennis On 1/5/18, 10:29 AM, "Michael Bauland" <Michael.Bauland@knipp.de> wrote: Hi Dennis, On 05.01.2018 16:13, Tan Tanaka, Dennis wrote: > Hi Michael, > > They are not the same character. They look alike in lower case, but are different in upper case (i.e. disunification by case property). The IP briefly discussed this case of 01DD and 0259 in their feedback to our Principles document and suggested that these two should not be variants. Hence my question about more evidence. sorry, I must have overlooked this. Which feedback are you talking about? Not the one from 2017-03-22 "GP Proposal Latin Script_Feedback_IP_V2F.docx", right? Is the document in the Latin GP drop box account? Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp
And yet, we sat there in Abu Dhabi and saw the Greek GP give a presentation where they listed as variants code points which were letter eta with a variety of diacritic marks. And the two members of the Integration Panel who were present made no complaint whatsoever about it. So maybe, just maybe, the real criteria are not anywhere near as narrow as that quote would suggest. Bill Jouris Inside Products bill.jouris@insidethestack.com 831-659-8360 925-855-9512 (direct) From: "Tan Tanaka, Dennis via Latingp" <latingp@icann.org> To: Michael Bauland <Michael.Bauland@knipp.de>; Mats Dufberg <mats.dufberg@iis.se> Cc: "latingp@icann.org" <latingp@icann.org> Sent: Friday, January 5, 2018 7:32 AM Subject: Re: [Latingp] Homoglyphs within Latin script Michael, we received this feedback through email (attached). Look for this paragraph: “In the context of the Root Zone, the Procedure is quite clear in that it considers simple similarity of appearance to be outside the scope of the Root Zone LGR. In admitting exact homoglyphs, the IP has been making the argument that ‘e’ in Latin (U+0065) and ‘е’ in Cyrillic(U+0435) are not just visually indistinguishable, but that their distinct code points effectively represent a disunification by script property. (A disunification not unlike that of 01DD and 0259, which are disunified based on case, or the two sets of Arabic digits disunified largely on directional properties).” -Dennis On 1/5/18, 10:29 AM, "Michael Bauland" <Michael.Bauland@knipp.de> wrote: Hi Dennis, On 05.01.2018 16:13, Tan Tanaka, Dennis wrote: > Hi Michael, > > They are not the same character. They look alike in lower case, but are different in upper case (i.e. disunification by case property). The IP briefly discussed this case of 01DD and 0259 in their feedback to our Principles document and suggested that these two should not be variants. Hence my question about more evidence. sorry, I must have overlooked this. Which feedback are you talking about? Not the one from 2017-03-22 "GP Proposal Latin Script_Feedback_IP_V2F.docx", right? Is the document in the Latin GP drop box account? Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp #yiv3502519930 #yiv3502519930 -- _filtered #yiv3502519930 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv3502519930 {panose-1:2 11 4 0 0 0 0 0 0 0;} _filtered #yiv3502519930 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv3502519930 {panose-1:2 11 4 0 0 0 0 0 0 0;}#yiv3502519930 #yiv3502519930 p.yiv3502519930MsoNormal, #yiv3502519930 li.yiv3502519930MsoNormal, #yiv3502519930 div.yiv3502519930MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:11.0pt;color:black;}#yiv3502519930 a:link, #yiv3502519930 span.yiv3502519930MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv3502519930 a:visited, #yiv3502519930 span.yiv3502519930MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv3502519930 p.yiv3502519930msonormal0, #yiv3502519930 li.yiv3502519930msonormal0, #yiv3502519930 div.yiv3502519930msonormal0 {margin-right:0in;margin-left:0in;font-size:11.0pt;color:black;}#yiv3502519930 span.yiv3502519930EmailStyle19 {color:windowtext;}#yiv3502519930 .yiv3502519930MsoChpDefault {font-size:10.0pt;} _filtered #yiv3502519930 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv3502519930 div.yiv3502519930WordSection1 {}#yiv3502519930 Dear All, Please find input from the Integration Panel in response for the call for comments on the principles documents. Regards, Sarmad The Integration Panel (IP) has reviewed "Principles for Inclusion and Exclusion of Code Points in Latin Script for the Root Zone (Latin LGR)e" and has the following comments: The IP congratulates the Latin GP on the formulation of its "Principles for Inclusion and Exclusion of Code Points in Latin Script for the Root Zone (Latin LGR)". They appear to cover the important considerations and will likely serve the GP well in arriving at at list of proposed candidate code points. The IP would like to caution that the final decision on whether to include or exclude a code point may not be possible by rote application of these (or any other set of) principles, and that additional factors may have to be considered in individual cases. The IP is looking forward to the next stage of the Latin GP's work and to reviewing actual examples of draft code points. Additional notes: The IP would like to note that all entries in an LGR need to be in Unicode Normalization Form C (see RFC 7940) and further that IDNA requires NFC, even if it doesn't agree with the native typing order, or conventions regarding precomposed, decomposed or mixed composed usage. RFC5890 states: "A "U-label" is an IDNA-valid string of Unicode characters, in Normalization Form C (NFC)". Because entries are normalized dual encoding cannot exist. In creating the repertoire each combining sequence needs to be individually justified and should be separately enumerated; combining marks should not be individually members of the repertoire. In applying these principles, attention must be paid to the foundational documents for this work as summarized in the "Guidelines for Developing Script-Specific Label Generation Rules for Integration into the Root Zone LGR". Further, the exclusion principles should mention explicitly that the LGR repertoire is constrained by MSR: « A code point not in the latest version of the MSR is excluded. If there is a clear need to add one, the GP will contact the Integration Panel to assess the possibility of adding one to the MSR ». The IP has reviewed "Analysis of Variants in the Latin Script for the Root Zone" and has the following comments: The actual guiding principle (contained in the second paragraph of the document) appears to cover the important considerations and will likely serve the GP well in arriving at at list of proposed candidate variants. The IP would like to caution that the final decision on whether to include or exclude a variant may not be possible by rote application of this (or any other) principle, and that additional factors may have to be considered in individual cases. The IP is looking forward to the next stage of the Latin GP's work and to reviewing actual examples of draft variants. Additional notes: The IP has some concerns about the remainder of the document. The procedure sets a very narrow limit on the kinds of cases that can be considered variants for the Root Zone; this is the basis of the statement by the IP that is quoted in a footnote. It might perhaps be better if this statement were incorporated into the definition of "scope". In that section, the opening remark about script mixing seem unconnected to the discussion that follows. A straight listing of which related scripts the GP will consider would be more useful. The IP would like to point out that the example given the document of Latin è (U+00E8) and Cyrillic ѐ (U+0450) may be moot because the final Cyrillic repertoire does not contain U+0450. In general, it is expected that the analysis of cross-script repertoires remain limited to code points that are in the respective scripts' LGRs or draft LGRs. The general discussion of "classes of variants" may be "of interest to the reader", but it isn't helpful in understanding which principles the Latin GP will follow in deciding whether something is a variant or not -- most of the items discussed are not applicable in the context of the Root Zone LGR. In the context of the Root Zone, the Procedure is quite clear in that it considers simple similarity of appearance to be outside the scope of the Root Zone LGR. In admitting exact homoglyphs, the IP has been making the argument that ‘e’ in Latin (U+0065) and ‘е’ in Cyrillic(U+0435) are not just visually indistinguishable, but that their distinct code points effectively represent a disunification by script property. (A disunification not unlike that of 01DD and 0259, which are disunified based on case, or the two sets of Arabic digits disunified largely on directional properties). In the context of other script LGRs for the Root Zone, the IP has argued strongly against embodying rules intended to deal with spelling issues. Therefore, any orthographic variation (spelling differences) would require a very compelling case being made; the examples given may not rise to that level. For instance, ‘ss’ (U+0073 U+0073) and ‘ß’ (U+00DF) are separately available on the second level, in the .de ccTLD (and presumably others). This would strongly argue against the claim that German usage would require them to be variants - in fact the opposite might be concluded. Consideration of established practice in existing Latin-based IDNs ought to be an important principle. The procedure makes reference to the "Least Astonishment Principle". This principle argues against solutions that produce unexpected or surprising behavior. Having the Root Zone exhibit fundamentally different design decisions with respect to variants than those found on the second level would have to be justified by strong arguments based on factors special to the Root Zone. Absent such factors, the expectation would be that the various levels are more or less compatible in their treatment of IDN labels for a given script. Finally, the claimed normalization exceptions appear based on a misunderstanding of the normalization algorithm. In normalizing to precomposed form (Normalization Form C), the first step is to fully decompose the input string and then to reorder all combining marks in a canonical order. Because of that, the two examples of e with grave and dot below would become identical at that stage of normalization. In the final stage of the algorithm, as much of the sequence as possible is composed. But because both inputs have the same fully decomposed and reordered form, their final NFC form is identical. Or, put differently, only one of the two forms is in NFC, the other is unnormalized and as such not admissible in the LGR. _______________________________________________ Latingp mailing list Latingp@icann.org https://mm.icann.org/mailman/listinfo/latingp
In lower case, they are equal. My interpretation of "security" is that we must include some variant or contextual rules that prevent two TLDs only differing on those two code points. I do not think we should try to interpret what IP thinks. We should propose a solution we think is correct. Mats --- Mats Dufberg DNS Specialist, IIS Mobile: +46 73 065 3899 https://www.iis.se/en/ -----Original Message----- From: "Tan Tanaka, Dennis" <dtantanaka@verisign.com> Date: Friday 5 January 2018 at 16:13 To: Michael Bauland <Michael.Bauland@knipp.de>, Mats Dufberg <mats.dufberg@iis.se> Cc: "latingp@icann.org" <latingp@icann.org> Subject: Re: [Latingp] Homoglyphs within Latin script Hi Michael, They are not the same character. They look alike in lower case, but are different in upper case (i.e. disunification by case property). The IP briefly discussed this case of 01DD and 0259 in their feedback to our Principles document and suggested that these two should not be variants. Hence my question about more evidence. -Dennis On 1/5/18, 10:10 AM, "Michael Bauland" <Michael.Bauland@knipp.de> wrote: Hi Dennis, hi Mats, On 05.01.2018 16:02, Tan Tanaka, Dennis via Latingp wrote: > Thanks Mats. > > > > A follow up question, is there evidence that these two code points are > used interchangeably in the languages the repertoire team analyzed? I > ask because the IP will ask for more evidence of a variant relationship > besides visual appearance. Per the Procedure “Generation Panels should > ignore cases where the relation is based exclusively on aspects of > visual similarity”. I'm not sure we need additional evidence, because in this case it's not mere "visual similarity" but those two are actually the same. And I don't think we have another choice in the case of homoglyphs, but to make them variants. On the contrary, I think we had to argue if we wanted to not make them variants. Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp
participants (4)
-
Bill Jouris -
Mats Dufberg -
Michael Bauland -
Tan Tanaka, Dennis