FW: [latin-vip] Loudly spoken opinions - input for today's meeting
I wonder what others think about the ideas in this document posted to the latin team. Apologies to Harald for putting you in the spot. __ Francisco On 8/25/11 7:25 AM, "Harald Alvestrand" <harald@alvestrand.no> wrote:
I don't know if I can make today's meeting (conflicts), but I thought I'd write some opinions.
This is written as starkly as possible in order to make it very clear what I think - it may come across as more strident than is really warranted. Others may have different opinions.
Anyway, it's a starting point.
At 07:45 02/09/2011, Francisco Arias wrote:
I wonder what others think about the ideas in this document posted to the latin team. Apologies to Harald for putting you in the spot.
Dear Francis, I looked at Harald's contribution carefully. 1. I fully support his "DO NOT DO VARIANTS FOR LATIN SCRIPT" position when considering the issues from an internet technology point of view (i.e. IDNA2008, RFC 5890-5894). 2. However, I do not share his limitations when I consider the same issues from an IUI (Intelligent Use Interface) point of view (i.e. from what is very limitedly exemplified by RFC 5895 - a smart middle networkware between users' plug and socket - i.e. an intelligent fringe addition). The IESG has decided that the IUI was research, at least in part. The IAB made clear that it did not belong to the IETF area (it is actually multitechnology), but it is active in considering from the sole internet DNS and internet PoV some of the same issues that I am exploring in order to support them through the IUI in a universal perspective. This is why any ICANN debate or WG like this WG/VIP, IMHO, should: 2.1. refer to the IAB's RFCs and Drafts: - <http://wiki.tools.ietf.org/rfc/rfc5507>http://wiki.tools.ietf.org/rfc/rfc5507 - http://wiki.tools.ietf.org/rfc/rfc6055.txt - <http://tools.ietf.org/id/draft-iab-dns-applications-02.txt>http://tools.ietf.org/id/draft-iab-dns-applications-02.txt 2.2. position ICANN in the global semiotic area. In other words: internet domain names are usually names of domains/people, more or less adapted to the internet orthotypography (i.e. the script syntax). Depending on the linguistic and semantic context, there might be an interest or a need for variants, i.e. a pragmatic equivalence between different strings. This means four main needs: - an algorithm to avoid homography. - the (partial) support of the linguistic orthotypography within the internet orthotypography. - the aliasing of certain strings. - a mechanism to prevent alias/string conflicts. 3. The mission of the IETF (as defined by Harald - RFC 3935) is to influence those who design, use, and manage the Internet (i.e. ourselves) for it to work better. I am an Internet User (IUse). We IUsers have directly opposed at the WG/IDNABis on three issues, but we were able, however, to reach the IDNA2008 consensus as these issues were positively resolved. The issues were: 3.1. the location of the support of variants (mapping) by the DNS. The Unicode/IETF people wanted a limited mapping in the IDNA technology. This broke the neutrality of the Internet and opened a Pandora's box. The charter disfavored mapping. The RFC 5895's proposed text addressed the issue. 3.2. orthotypography. I wanted IDNA2008 to acknowledge the orthotypography issue and propose a way to address it. This was not even opposed, it was outright disregarded. The problem that I brought to everyone's attention was the French (and Latin) majuscules semantic impact. IDNA2008 does not even allude to them. It was not bad that I was defeated on this point because this makes the ML-DNS mandatory (i.e. multilayer/multilingual variant support). I reported why to the IESG which accepted IDNA2008 as a result (http://tools.ietf.org/id/draft-iucg-afra-reports-00.txt) 3.3. ICANN's role. The initial idea was that post-IDNA2008 users' issues would be addressed by ICANN. I opposed that (I have not experienced, ever, a situation when ICANN would want to represent me!), and ICANN did not indicate interest in spite of my mails to the BoD members and ICANN reps (Tina and Cary) at the WG. Therefore, the idea was forgotten. 3.4. Homographic confusion. We never were able to discuss it. The result is that the universal domain naming linguistic diversity (i.e. the designation of any intellectual, operational, industrial, commercial, physical, emotional, etc.) digital support is to be carried by external subsidiarity in a way meeting IDNA2008 rules, when the internet is involved. 4. As a result, one sees that: 4.1. syntax, semantic, pragmatic, scripting and multilinguistic issues are concerned, i.e. every semiotic ( atlarge) aspects. This belongs to what we call the Intersem (Semiotic Internet) layers. 4. 2. variants are by their essence related to users and not to the internet technology. This means that the same variants will be required by users of every technology as well as mean that they will be described, analyzed and documented with the same terminology. This means that the context will be at the end of the day: 4.2.1. either a Semiotic Digital Use Authority coordinating the cooperation of all the semiotic technology oriented endeavors and deciding about a user/digital ecosystem semiotic strategy to which everyone should adhere. 4.2.2. or each of us proceeds independently and trusts that people's use will stabilize the tools and usages. -- ICANN does it based upon IETF RFCs and its possible solutions. However, I fear ICANN has not the technological capacity to address the points it raises. -- IUsers do it based upon ICANN-ICP-3, IETF RFCs and the IUI architecture and in helping the development of a test technology. -- Without forgeting the IETF parallel needs on IRIs and protocols (WG/PRECIS). With a possible further cooperation to build a coordination? 5. What to do now? My suggestion would be that: 5.1. ICANN/VIP should expose the needs that it identifies in terms of variants from a user semiotic PoV, publishing its vision of Users' needs. It should carefully consider the IAB/IETF experience and recommendations, understanding that what the IETF cannot support (cf. Harald) would have to be supported by the ML-DNS within the IUI layers' framework. 5.2. In any case, on our IUser side, we will continue exploring the IUI architecture with plans for three research actions: 5.2.1. semiotic experimentation through Project.FRA, i.e. an open French ontology with the ".fra" TLD using its name space as a commonly drafted taxonomy. 5.2.2. consolidation, experimentation, and support of an IUI technology, including an ML-DNS and the use of the unique virtual root names matrix. 5.2.3. a documentation set of the IUI architecture based on that experimentation and a common multilingual internet terminology. We all may have our opinions, positions, pet ideas. However, I think this is really a case for the IAB to provide serious guidance or for a more global architecture to emerge for the whole digital ecosystem and being supported by an appropriatedly extended model. Best, jfc
One thing we also have are two words spelled the same, but pronounced differently that means different things. Example (in Swedish): kista [chi:sta] : A suburb of Stockholm, where lots of IT industry is located kista [chista] : A coffin That leads to "interesting" race conditions when people want to register domain names. We also in Sweden have two cities named Habo and Håbo. They have been in a domain dispute about the name habo.se, which was registered correctly by Håbo (before IDN existed). Both lost (they where told to use a shared webpage -- see http://habo.se/). Patrik On 2 sep 2011, at 07:45, Francisco Arias wrote:
I wonder what others think about the ideas in this document posted to the latin team. Apologies to Harald for putting you in the spot.
__ Francisco
On 8/25/11 7:25 AM, "Harald Alvestrand" <harald@alvestrand.no> wrote:
I don't know if I can make today's meeting (conflicts), but I thought I'd write some opinions.
This is written as starkly as possible in order to make it very clear what I think - it may come across as more strident than is really warranted. Others may have different opinions.
Anyway, it's a starting point.
<issues-my-notes.doc>
At 09:00 03/09/2011, Patrik Fältström wrote:
One thing we also have are two words spelled the same, but pronounced differently that means different things. Example (in Swedish):
kista [chi:sta] : A suburb of Stockholm, where lots of IT industry is located kista [chista] : A coffin
At 09:01 03/09/2011, Patrik Fältström wrote:
Let me also say explicitly that I support the findings in this document.
In French: State = Government state = status In the roman world at least: xxxx = word Xxxx = Family name. I reviewed again the response of Harald, considering the way he and Patrik seem to understand the ICANN questionnaire as engineers, when I try to understand it as a multilinguist (multilinguistics: the cybernetics of the linguistic diversity, i.e. how languages can and will technicaly coexist, conflict, or mutually assist). I also have had some feed-backs from the IUCG controverted Glossary effort. As a result, the main questions seem to be: 1. what is a variant? 2. how does it differs from the ICANN definition of alias? 3. How ICANN does name an IETF alias? 4. How does that relate with Webster alias (": otherwise called : otherwise known as"). The difficulty here seems to be "otherwise called/known" by who? Depending on inputs, it seems that this "who" can be the registry, the protocol, the UI, the IUI, the doxa, the multilinguistic rules or algorithm, ICANN, a global semiotic authority, the TM owners, the towns, the Governments, Courts of Justice, Chambers of Commerce, Academies, etc. etc. The main source of experience on this is to be Fast Track: are there already some information from its real registrations and registrant/user comments. jfc PS. In the ICANN document I would change "concept/idea" into "concept/notion" for clarity sake.
Quoting Patrik (under a different subject heading):
One thing we also have are two words spelled the same, but pronounced differently that means different things.
Example (in Swedish):
kista [chi:sta] : A suburb of Stockholm, where lots of IT industry is located
kista [chista] : A coffin
This is a perfect illustration of the concern that attaches to homographs. The textbook definition of that term is, "two different words in a language that are spelled the same." If we clarify that in the basic terms of our own discussion, we might add, "written with the same sequence of abstract characters and instantiated with the same sequence of glyphs." The two words Patrik uses are normally disambiguated by an upper-case initial letter in the first of them -- "Kista” -- but that device is not available in IDNA2008. It could otherwise be argued that the upper-case distinction means that "Kista” and "kista” are not true homographs in the textbook sense, but in the discussion of IDNs the extra degree of freedom is useful. Since a label has no intrinsic attribute of language and there is no protocol restriction on the number of scripts that may appear in it, it is also be possible to write "kista” using a Cyrillic, rather than a Latin final letter. That gives "kistа”, and since the CYRILLIC SMALL LETTER A and the LATIN SMALL LETTER A are commonly represented with the same glyph, the Swedish and the hybrid strings are visually identical. Not confusable -- identical. If we are comfortable in freeing the term "homograph” from the requirement that it applies to words in the same language, and are further willing to drop the requirement of the objects of comparison being words at all, then "kista” and "kistа” may also be termed homographs. This new sense of that term has become deeply entrenched in the discussion of IDNs but I would like to call it into question. By using it as a general descriptor for several different forms of the variance that we are addressing, we are obscuring pivotal distinctions among them. I urgently suggest that we expand our descriptive terminology with the term "homoglyph” to designate situations such as the one used in the Cyrillic/Latin illustration above. Two sequences of identical glyphs used to represent different sequences of code points can and do appear in the IDN space. There is no attribute either of script uniformity or language imposed on them. The separate labeling of them as homoglyphs allows for their immediate differentiation from cases where there really is a homographic concern in the accepted textbook sense. Establishing this distinction may prove of particular utility when focusing on what may be the most urgent issue confronting us. That is the one that arises when two users with identically labeled keyboards, typing the same sequence of abstract characters, producing the same sequence of displayed glyphs, have nonetheless generated two different sequences of code points. /Cary /Cary
At 13:08 03/09/2011, Cary Karp wrote:
I urgently suggest that we expand our descriptive terminology with the term "homoglyph" to designate situations such as the one used in the Cyrillic/Latin illustration above.
No! :-) Cary, I fully understand your point and I support it. But not at the price of an added confusion. Glyphs are definitely out of our scope. Characters are signs that are graphed. The way they are graphed is not an issue for computer protocols and registries. The problem we meet here is that we use Unicode/ISO 10646 which distinguish between the graphed signs on non sign, non graph related premises. ISO10646/Unicode have cons and pros. One of these cons is to introduce a confusion in the use of some signs. To address this "unisoconfusable" characters issue we need an anti-homographic canonalization algorithm. This algorithm may based on unigraph (graphic signs) or unisign (general semiotic) tables or correspondances or on any other idea you might have. In the current IUse work, we start from 63.000+ 16x16 or 16x8 bitmaps on an excel table. An immediate sort shows around 10.000 strictly unisoconfusable graphs (same bitmap). Our problem is to find a complete code point description table, fill it with bitmaps representations, work on their positionning (for exemple all of them locked in one of the four corners and centered), comparabilities from human indications, etc. and come-up with different tables corresponding to degrees of confusability and check the results from real operations experimentation. Then the confusability algorithm should be amended from the experimentation inputs. Once we have obtained this, string confusability should be added through human inputs to IANA. this is why the happiana mailing list is concerned. The resulting registry may be quite important in size (and therefore in term of traffic) and the registration/validation process will be an industry issue and probably a perpetual battle if confusables are not also displayed in a cultural appropriate manner. This method should then also be applicable to check logo confusability, etc. jfc
On 9/3/11 4:08 AM, "Cary Karp" <ck@nic.museum> wrote:
Since a label has no intrinsic attribute of language and there is no protocol restriction on the number of scripts that may appear in it, it is also be possible to write "kista” using a Cyrillic, rather than a Latin final letter. That gives "kistа”, and since the CYRILLIC SMALL LETTER A and the LATIN SMALL LETTER A are commonly represented with the same glyph, the Swedish and the hybrid strings are visually identical. Not confusable -- identical. If we are comfortable in freeing the term "homograph” from the requirement that it applies to words in the same language, and are further willing to drop the requirement of the objects of comparison being words at all, then "kista” and "kistа” may also be termed homographs.
Besides protocol restrictions there could also be policy restrictions. For example, there could be a policy prohibiting TLD labels that have code points from Latin and Cyrillic scripts. Thoughts? __ Francisco
Quoting Francisco:
Besides protocol restrictions there could also be policy restrictions. For example, there could be a policy prohibiting TLD labels that have code points from Latin and Cyrillic scripts.
ICANN requires that a gTLD adopts the following policy before freeing it from a general contractual prohibition on accepting IDN registration: "All code points in a single label will be taken from the same script as determined by the Unicode Standard Annex #24: Script Names <http://www.unicode.org/reports/tr24>. Exceptions to this guideline are permissible for languages with established orthographies and conventions that require the commingled use of multiple scripts. Even in the case of this exception, visually confusable characters from different scripts will not be allowed to co-exist in a single set of permissible code points unless a corresponding policy and character table is clearly defined." Of course, if there were any language that is so regularly written with mixed Cyrillic and Latin letters that a justifiable TLD label could be proposed with them, root policies might require additional constraint. Asking again -- is there any such everyday writing system? If not, putting this on our agenda is redundant at the very least. /Cary
Francisco, This is a truly existential question for ICANN. This ICANN/WG/VIP can consider itself as being: 1. either an ICANN/WG with the ICANN policy in the background, considering the variants in this sole framework. In such a case, this would only impact the ICANN documents and permit other root management systems to propose solutions that ICANN would in turn not accept. This can be a guarantee of better quality of service that is favorable to the ICANN world, or a limitations set that will lead TLD candidates to use a system other than the ICANN/NTIA root. 2. or an Internet Users leading WG, trying to "influence (mission of the IETF [RFC 3935] in terms of protocols) those who design, use, and manage the Internet for the IDNA to work better". In that case, the target would be to show ICANN as an open common sense leader, proposing a naming netiquette that everyone will want to respect. That will be a source of trust for TLD candidates who will think: ICANN is more serious and secure, if they lead the consensus they will probably also lead the process' quality. I know that this is a vision of the DNS and naming that is different than ICANN's vision. However, this is the DNS vision that is actually documented by the RFCs. Jon Postel, ICANN, and simpler management have all led to constrain the reading of these RFCs and the resulting Internet architectural deployment. IDNA2008's consensus was reached because its RFCs do not constrain the true nature of the internet, i.e. subsidiary unique virtual root and its distributed structure. The true issue is not to sell TLDs. The true issue is a stable, technically sound, legally acceptable, operationally proficient and conflict free set of systems/services permitting one to freely operate their root name/TLD (software, machine, control, registration, new services, lawyers, etc. ) on a VPN, an externet (e.g. class supported vision of the Internet), the entire Internet, or throughout the whole digital ecosystem. When ICANN decides to sell a TLD for $185,000, it must correctly understand what it is selling and the resulting presentation strategy that it must adopt. What ICANN is selling at $185,000 is not a TLD that can be freely operated on the Internet. It is selling TLDs that have been technically, financially, legally screened, stamped and sponsored by ICANN and the USG. There are people and corporations that are obviously ready for many reasons to pay that $185,000 for the ICANN label on their TLD. However, there are many more who are not willing or cannot pay ICANN when cheaper commercial offers and free FLOSS systems are available. The entire necessary program set, except for the installation and some NIC management SQLite routines, are already in operation - no big deal for some investor to take over the name space, or for Google in using their public DNS service (IP 4.4.4.4 and 8.8.4.4.). ICANN must not confuse its TLD business plan and the decline of the wrong open root solution. Open roots propose (sell) alternative domain names. Here, what is at stake is the market acceptance of crossing the 300 TLDs threshold. Less than 300 TLD was in some way a stable perception of the Internet domain names system: a few global TLD names to memorize, and one per country. ITLDs and gTLD sales are going to switch usage from a very limited set of known TLDs being used by each user to the common experience that TLDs are just another part of the domain name with thousands of them. Either ICANN will influence the acceptance of a certain TLD netiquette that the users will all be familiar with (distrusting the non conformant TLDs) or will not. In the later case, TLDs will be totally free form because no one is able to visually recognize an ICANN sponsored TLD from a non ICANN sponsored one when reading a URL. The decision is yours. This is the future of ICANN. jfc At 09:17 07/09/2011, Cary Karp wrote:
Quoting Francisco:
Besides protocol restrictions there could also be policy restrictions. For example, there could be a policy prohibiting TLD labels that have code points from Latin and Cyrillic scripts.
ICANN requires that a gTLD adopts the following policy before freeing it from a general contractual prohibition on accepting IDN registration:
"All code points in a single label will be taken from the same script as determined by the Unicode Standard Annex #24: Script Names <http://www.unicode.org/reports/tr24>. Exceptions to this guideline are permissible for languages with established orthographies and conventions that require the commingled use of multiple scripts. Even in the case of this exception, visually confusable characters from different scripts will not be allowed to co-exist in a single set of permissible code points unless a corresponding policy and character table is clearly defined."
Of course, if there were any language that is so regularly written with mixed Cyrillic and Latin letters that a justifiable TLD label could be proposed with them, root policies might require additional constraint. Asking again -- is there any such everyday writing system? If not, putting this on our agenda is redundant at the very least.
/Cary
Let me also say explicitly that I support the findings in this document. Patrik On 2 sep 2011, at 07:45, Francisco Arias wrote:
I wonder what others think about the ideas in this document posted to the latin team. Apologies to Harald for putting you in the spot.
__ Francisco
On 8/25/11 7:25 AM, "Harald Alvestrand" <harald@alvestrand.no> wrote:
I don't know if I can make today's meeting (conflicts), but I thought I'd write some opinions.
This is written as starkly as possible in order to make it very clear what I think - it may come across as more strident than is really warranted. Others may have different opinions.
Anyway, it's a starting point.
<issues-my-notes.doc>
Quoting Francisco:
I wonder what others think about the ideas in this document posted to the latin team. Apologies to Harald for putting you in the spot.
I also agree with Harald’s emphatic conclusion. There is, however, a single case in the Latin code point repertoire where different abstract characters at separate code points are commonly represented with the same glyph, thus properly requiring special consideration in the VIP study. This is: U+01DD ; ǝ # LATIN SMALL LETTER TURNED E U+0259 ; ə # LATIN SMALL LETTER SCHWA Although there is no everyday writing system including the one of them that also includes the other, since they are both PVALID it is to be expected that some registries will also support them both. (Remember that gTLDs rarely have any specific language nexus and serve a polyglot registrant base.) They are, however, conceptually distinct and cannot be treated as variants of each other in any algorithmic (or, for that matter, linguistic) sense. To be sure, this necessitates particularly rigorous consideration before adding either of them to a local repertoire of supported code points. That need can be flagged in any exhaustive tabulation of PVALID code points for the Latin script. This would require an informational annotation, and likely some form of narrative recommendation, but there is no stronger "variant" issue that needs to be accommodated. The set of counterarguments that can be typified by, "Yes, but I sure think that 'æ' and 'ae' are confusingly similar and therefore expect that other users would, as well", all require contextual modulation. The 'æ' is an atomic element of the Norwegian alphabet (and others) and any suggestion that the VIP study might make of need for its canonical equation to 'ae' would be met with a dismissive giggle at best, and taken as sheer cultural effrontery at worst. In either case, it would risk our output being disregarded. /Cary
This is a PS to my previous note in this thread and I hope it's spotted before anyone pounces on me for the slip I'm now correcting --
'æ' is an atomic element of the Norwegian alphabet (and others) and any suggestion that the VIP study might make of need for its canonical equation to 'ae' would be met with a dismissive giggle at best, and taken as sheer cultural effrontery at worst. In either case, it would risk our output being disregarded.
The reaction would be within the language communities to which the distinction is important. I'm not predicting global response to this particular case, although corresponding risk does exist elsewhere and could be similarly illustrated. /Cary
participants (4)
-
Cary Karp -
Francisco Arias -
JFC Morfin -
Patrik Fältström