Dear GP members,
Please find enclosed the proposal of Agenda for our call. Any comments, corrections and suggestions are welcome.
Regards Mirjana
_______________________________________________________________________________________________
AGENDA for the GP call on May the 7th 2020, 16:00UTC
1. Latin Small Letter I () and Latin Small Letter Dotless I () the analysis is already documented, it only requires the final conclusion. Please find the relevant section in the proposal below.
2. Next tasks- discuss the priorities:
* Thorough reading and finalization of In-Script variant analysis
* Review proposed confusables for Similarity review
* Report finalization
* XML production and correction of possible errors
* Test data production
3. Next meeting May the 14th 2020, 16:00 UTC
4. AOB
D.5.2 Latin Small Letter Dotless I (ı) 0131
There are four Latin code points that have a special case (upper case/lower case) relationship:
· U+0069 Latin Small Letter I ("i")
· U+0049 Latin Capital Letter I ("I")
· U+0131 Latin Small Letter Dotless I ("ı")
· U+0130 Latin Capital Letter I with Dot Above ("İ")
In most locales Latin Small Letter I is lower case of Latin Capital Letter I, and reverse Latin Capital Letter I (U+0069) is upper case of Latin Small Letter I (U+0069). In those locales, Latin Capital Letter I (U+0049) is also upper case of Latin Small Letter Dotless I. It could be described as in the following chart:
Table D.8. Case Relationships for 0069, 0049, 0130, and 0131
Character
Process
Resulting Character
Process
Resulting Character
Latin Small Letter I
U+0069
up case →
Latin Capital Letter I
U+0049
down case →
Latin Small Letter I
U+0069
Latin Small Letter Dotless I
U+0131
up case →
Latin Capital Letter I
U+0049
down case →
Latin Small Letter I
U+0069
Latin Capital Letter I with Dot Above
U+0130
down case →
Latin Small Letter I
U+0069
up case →
Latin Capital Letter I
U+0049
In two locales, Turkish and Azeri, respectively, the case relationship is different. In those two, Latin Small Letter I and Latin Capital Letter I with Dot Above are in mutual upcase/downcase relationship to each other, as well as Latin Small Letter Dotless I and Latin Capital Letter I, which could be described as in the following chart:
Table D.9. Case Relationships in Turkish and Azeri Locales
Character
Process
Resulting Character
Process
Resulting Character
Latin Small Letter I
up case →
Latin Capital Letter I with Dot Above
down case →
Latin Small Letter I
Latin Small Letter Dotless I
up case →
Latin Capital Letter I
down case →
Latin Small Letter Dotless I
If we look at the repertoire of Latin code points for the root zone, as proposed by the Latin Generation Panel, Latin Small Letter I and Latin Small Letter Dotless I are included, whereas the capital letters are excluded. Capital letters are not even valid in IDNA2008, so the question is, is the case relationship described here a problem or even relevant?
Before IDNA2008, there was IDNA2003. Even though IDNA2003 has been replaced by IDNA2008 it is still implemented. For example, the web browser Google Chrome to date remains IDNA2003 compliant but not fully IDNA 2008 compliant. In IDNA2003 there is a pre-process, normalization, of domain names before conversion to Punycode. That normalization includes down casing of Latin characters. For ASCII labels there is already an equivalence between upper case and lower case letters. And this is what users, based on decades of experience, expect to happen.
In an IDNA2003-compliant web browser it is expected that "EXÄMPEL" and "EXAMPLE" are equivalent to "exämpel" and "example", respectively. In an IDNA2008 browser "EXAMPLE" must be accepted, but "EXÄMPLE" could be rejected since "Ä" is not valid, but that is not how e.g. Mozilla Firefox and Apple Safari have been designed to handle the problem. They also do down case before the formal IDNA2008 process.
Even though down casing is not part of the formal IDNA2008 process, one of the IDNA2008 documents, RFC 5894, states that the user interface of an application, before IDNA2008 processing, can do normalization. The down casing in IDNA2008 browsers should probably be seen in that light.
While "TÄT" will probably be down cased to "tät" in the browser, what should the browser do with "TIT"? Depending on the locale that the browser is running in, it may be down cased to either "tit" or "tıt" (with or without the dot).
The casing, in an application, is expected to go in one direction, from upper case to lower case. When domain names are presented in text, however, it is common that domain names are presented in upper or mixed case. So "ice" might become "Ice" or "İce".
It is quite obvious from the text above that case shift of dotted or dotless I could create erroneous lookup, but the question is how large threat it would be to the users. Since the applications are expected to go from upper case to lower case, when they handle domain names, we should consider a situation where down casing could result in different lower case letters, i.e. when CAPITAL LETTER I is down cased.
With a non-Turkish and non-Azeri locale, a Latin CAPITAL LETTER I in a domain name is either down cased to Latin Small Letter I (IDN label) or equivalent to Latin Small Letter I (ASCII label).
With a Turkish or Azeri locale, a Latin Capital Letter I is expected to be down cased to Latin Small Letter Dotless I, but in an ASCII label in a domain name, it is still expected to be equivalent with Latin Small Letter I, because that is what the DNS standards says.
There is an obvious risk that, in a Turkish or Azeri locale that the two letters are confused or mistreated due to the case folding, and this confusion could be misused. To be on the safe side Latin Small Letter I and Latin Small Letter Dotless I should be variants. Accordingly, the following variant set could be the optimal solution:
Table D.10. Possible Variant Relationships for 0069 and 0131
Group
Dotless i vs. i
Target
Source
Variant Candidate [Yes/No]
Disposition [Allocatable/
Blocked]
Rationale
Code Point
Glyph
Name
Code Point
Glyph
Name
0069
i
Latin Small Letter I
0131
ı
Latin Small Letter Dotless I
YES
Blocked
Risk of confusion due to inconsistent case folding
0131
ı
Latin Small Letter Dotless I
0069
i
Latin Small Letter I
YES
Blocked
Risk of confusion due to inconsistent case folding