Re: [Latingp] Draft variants principles document - Dotless I
Bill, thanks for this. I have to question, though, the relevancy of your experiment. Is a visual test of running test relevant for internet identifiers? And on the subject of visual similarity, I believe this has been discussed extensively and this panel has agreed that visual similarity is outside the scope of our work. The case of the “small dotless I” and “small letter I” is interesting because of the treatment under different locale settings. The focus of our analysis should be on that, taking into account the needs and expectations of different internet users, including the Turkish community. -Dennis From: Bill Jouris <bill.jouris@insidethestack.com> Reply-To: Bill Jouris <bill.jouris@insidethestack.com> Date: Thursday, April 5, 2018 at 2:05 PM To: Dennis Tan Tanaka <dtantanaka@verisign.com>, Mats Dufberg <mats.dufberg@iis.se>, Michael Bauland <michael.bauland@knipp.de> Cc: Mirjana Tasić <mirjana.tasic@rnids.rs> Subject: [EXTERNAL] Re: [Latingp] Draft variants principles document - Dotless I I've given some more thought to the Dotless I question. It occurred to me that there are actually two approaches to the question: analysis and experiment. So I ran an experiment. Here are the results: A dozen subjects were tested. All were well-educated native speakers of English. Approximately 1/3 are involved in IT, but none are network experts and none are involved in ICANN. The subjects were given a paragraph to read (on the subject of variants). In one word, the lower case I was replaced by a dotless I. The number of subjects who noticed when reading the paragraph: Zero. The subjects were then told that the substitution had been made, that it was in the first sentence, and shown the dotless I for information. Half managed to locate the substitution in 1 or 2 re-reads of the sentence; half took 3 or more tries to spot the substitution – even though they knew what the substitution was and knew that it was there to find. In short, misreading is the expected result of a substitution. Accordingly, it is again recommended that U0069 and U0131 be determined to be blocked variants Happily, the results are the same as the analysis. I have updated the document with this information. Bill Jouris Inside Products bill.jouris@insidethestack.com 831-659-8360 925-855-9512 (direct) ________________________________ From: "Tan Tanaka, Dennis via Latingp" <latingp@icann.org> To: "Tan Tanaka, Dennis via Latingp" <latingp@icann.org> Sent: Thursday, April 5, 2018 9:41 AM Subject: [Latingp] Draft variants principles document Need assistance with developing the sections for special cases: https://docs.google.com/document/d/1IrT_kfildf1SumYUqjkaIkMT-TYx9IRqtuPMV4Yv... Thanks, Dennis _______________________________________________ Latingp mailing list Latingp@icann.org<mailto:Latingp@icann.org> https://mm.icann.org/mailman/listinfo/latingp
Dennis, I believe that a visual test of running text is relevant for Internet identifiers. Internet identifiers are often found in running text. What Bills test shows (which is of no surprise to us) is that people have a hard time noticing minute differences of no relevance. I think we all agree that homoglyph pairs should be handled by variant rules. And that is because they are visual similar to the extreme. At the ICANN 61 meeting, the IP actually proposed to the Japanese GP that some pairs of characters from different Unicode scripts (but from a Japanese perspective belonging to the Japanese script) be treated as variants even though those character pairs are not homoglyphs. Yes, SMALL LETTER I and SMALL LETTER DOTLESS I are interesting because of the complexity of up-casing and down-casing in different locales. But if we are allowed to take upper case into consideration, there are other interesting cases. The upper case of LATIN SMALL LETTER D WITH STROKE (U+0111), LATIN SMALL LETTER ETH (U+00F0) and LATIN SMALL LETTER D WITH TAIL (U+0256) are homoglyphs which opens up for injections of "false" domains. Mats --- Mats Dufberg DNS Specialist, IIS Mobile: +46 73 065 3899 https://www.iis.se/en/ From: "Tan Tanaka, Dennis" <dtantanaka@verisign.com> Date: Thursday 5 April 2018 at 20:27 To: Bill Jouris <bill.jouris@insidethestack.com>, Mats Dufberg <mats.dufberg@iis.se>, Michael Bauland <michael.bauland@knipp.de> Cc: Mirjana Tasić <mirjana.tasic@rnids.rs>, ICANN Latin GP <latingp@icann.org> Subject: Re: [Latingp] Draft variants principles document - Dotless I Bill, thanks for this. I have to question, though, the relevancy of your experiment. Is a visual test of running test relevant for internet identifiers? And on the subject of visual similarity, I believe this has been discussed extensively and this panel has agreed that visual similarity is outside the scope of our work. The case of the “small dotless I” and “small letter I” is interesting because of the treatment under different locale settings. The focus of our analysis should be on that, taking into account the needs and expectations of different internet users, including the Turkish community. -Dennis From: Bill Jouris <bill.jouris@insidethestack.com> Reply-To: Bill Jouris <bill.jouris@insidethestack.com> Date: Thursday, April 5, 2018 at 2:05 PM To: Dennis Tan Tanaka <dtantanaka@verisign.com>, Mats Dufberg <mats.dufberg@iis.se>, Michael Bauland <michael.bauland@knipp.de> Cc: Mirjana Tasić <mirjana.tasic@rnids.rs> Subject: [EXTERNAL] Re: [Latingp] Draft variants principles document - Dotless I I've given some more thought to the Dotless I question. It occurred to me that there are actually two approaches to the question: analysis and experiment. So I ran an experiment. Here are the results: A dozen subjects were tested. All were well-educated native speakers of English. Approximately 1/3 are involved in IT, but none are network experts and none are involved in ICANN. The subjects were given a paragraph to read (on the subject of variants). In one word, the lower case I was replaced by a dotless I. The number of subjects who noticed when reading the paragraph: Zero. The subjects were then told that the substitution had been made, that it was in the first sentence, and shown the dotless I for information. Half managed to locate the substitution in 1 or 2 re-reads of the sentence; half took 3 or more tries to spot the substitution – even though they knew what the substitution was and knew that it was there to find. In short, misreading is the expected result of a substitution. Accordingly, it is again recommended that U0069 and U0131 be determined to be blocked variants Happily, the results are the same as the analysis. I have updated the document with this information. Bill Jouris Inside Products bill.jouris@insidethestack.com 831-659-8360 925-855-9512 (direct) ________________________________ From: "Tan Tanaka, Dennis via Latingp" <latingp@icann.org> To: "Tan Tanaka, Dennis via Latingp" <latingp@icann.org> Sent: Thursday, April 5, 2018 9:41 AM Subject: [Latingp] Draft variants principles document Need assistance with developing the sections for special cases: https://docs.google.com/document/d/1IrT_kfildf1SumYUqjkaIkMT-TYx9IRqtuPMV4Yv... Thanks, Dennis _______________________________________________ Latingp mailing list Latingp@icann.org<mailto:Latingp@icann.org> https://mm.icann.org/mailman/listinfo/latingp
Hi Mats, hi all, sorry to drop in so late. I was sick for a week and then there was loads of work to catch up with. On 06.04.2018 11:21, Mats Dufberg wrote:
Dennis,
I believe that a visual test of running text is relevant for Internet identifiers. Internet identifiers are often found in running text.
Without looking at the specific example, I agree with Dennis that running text should not be relevant. The human mind is trained to look at whole words when reading and not looking at every single character. There are many tricks you can do with letters/words, while the reader is still able to read the sentence without problems, and might not even notice there is something "wrong." However, I disagree with Dennis, when he said:
this panel has agreed that visual similarity is outside the scope of our work.
As stated in the draft document: "A Latin code point will be deemed a variant with a cross-script code point when the two code points or sequence of code point are visually identical or nearly identical when the difference can be attributed to different font styles." "nearly identical" to me is visual similarity. But of course, we need to look at all the cases to decide whether something is nearly identical or not. For the dotless I I would vote they are not nearly identical (otherwise we could also have u and ü being nearly identical, and we shouldn't go that way). Still, the dotless I remains in discussion due to the fact of the complexity of up-casing and down-casing, as Mats said. Best regards, Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp
Hi Michael, Glad you are back. My remarks were made in the context of the I and dotless I, both of which are Latin. For within script, only exact match cases will be considered (or alternate-use cases). For cross-script, the IP advised that slight font differences may be enough to consider two cross-script code point homoglyphs. This is because font design is more consistent within a script than across other scripts. -Dennis On 4/19/18, 8:10 AM, "Michael Bauland" <Michael.Bauland@knipp.de> wrote: Hi Mats, hi all, sorry to drop in so late. I was sick for a week and then there was loads of work to catch up with. On 06.04.2018 11:21, Mats Dufberg wrote: > Dennis, > > > > I believe that a visual test of running text is relevant for Internet > identifiers. Internet identifiers are often found in running text. Without looking at the specific example, I agree with Dennis that running text should not be relevant. The human mind is trained to look at whole words when reading and not looking at every single character. There are many tricks you can do with letters/words, while the reader is still able to read the sentence without problems, and might not even notice there is something "wrong." However, I disagree with Dennis, when he said: > this panel has agreed that visual similarity is outside the scope of our work. As stated in the draft document: "A Latin code point will be deemed a variant with a cross-script code point when the two code points or sequence of code point are visually identical or nearly identical when the difference can be attributed to different font styles." "nearly identical" to me is visual similarity. But of course, we need to look at all the cases to decide whether something is nearly identical or not. For the dotless I I would vote they are not nearly identical (otherwise we could also have u and ü being nearly identical, and we shouldn't go that way). Still, the dotless I remains in discussion due to the fact of the complexity of up-casing and down-casing, as Mats said. Best regards, Michael -- ____________________________________________________________________ | | | knipp | Knipp Medien und Kommunikation GmbH ------- Technologiepark Martin-Schmeisser-Weg 9 44227 Dortmund Germany Dipl.-Informatiker Fon: +49 231 9703-0 Fax: +49 231 9703-200 Dr. Michael Bauland SIP: Michael.Bauland@knipp.de Software Development E-mail: Michael.Bauland@knipp.de Register Court: Amtsgericht Dortmund, HRB 13728 Chief Executive Officers: Dietmar Knipp, Elmar Knipp
participants (3)
-
Mats Dufberg -
Michael Bauland -
Tan Tanaka, Dennis