Ram, Agreed. Indeed, we quite deliberately made absolutely no attempt to define a machine-readable format when the idea of these tables was introduced. One reason was that we saw no chance of getting consensus on a format, partially for the reasons Ram identifies. But the more important reason was that we didn't want to encourage anyone to casually import a table without studying it and, indeed, having enough script (and language?) expertise around to be sure that they understood it. I think many of us can remember the degree of upset and concern that arose when one registry decided to allow a script that they didn't understand well because they knew that a language that used the script was used in that country. It is not clear whether importing a table that was equally poorly understood who make that situation worse, but it seems to me it would be very unlikely to make it better. Jaap's recollection is correct and a machine-readable form would be useful in some cases. However, the risk of mindless import of tables actually creates a tradeoff between "...Having a machine readable format that allows the tables to be imported and repurposed aids this greatly" and the possibility of real harm. My sense would be that, especially if the intent is to capture not just character lists but variant concepts, we would be much better off with a list of characters and explanations. I believe that the character lists and explanations that we would be likely to see fall into three categories: (1) Rather short lists, under a hundred or so characters. This would be the case for most alphabetic scripts, including Latin. It is not clear that a standardized table format helps with that case because anyone trying to import a table is really going to need considerable explanation about what is going on (e.g., the ability to represent "oe" as a variant of o-dieresis does not imply that importing that relationship would be wise). (2) Lists that involved a great deal of complexity because of concerns about look-alike characters, characters that are sometimes used interchangeably, or distinctly different uses (and rendering styles) for the same script. All three of those situations have been illustrated in the ASIWG work on Arabic (some of which is incorporated into the Arabic variant information project team), but I have every confidence that there are other scripts out there with one or more of those issues (some would argue that the Japanese - Chinese situation poses problems not much different from the [Western] Arabic - Perso-Arabic one). The complexity of those situations cannot be easily represented in a simple table, whether in XML or otherwise. (3) Chinese (Han) script. The tables there are large enough to make automatic import useful, but the CNDC table requires a specialized table because of the paired preferred variants and, as we know, neither Japanese nor Korean require variant treatment (although they may have other issues). It would make lots of sense to have a standardized machine-readable format for the Chinese use of Han script, but that involves a different set of issues than such a format for other scripts and may or may not be the correct format for a Japanese or Korean table. As a general comment, it seems to me that people on this mailing list and the technical and script communities have repeatedly told ICANN, its staff, and various "constituencies" that Han script is fundamentally different from the collection of alphabet-phonetic scripts and that trying to treat them as the same just yields one problem after another (for one group or the other). That advise has been consistently ignored in ICANN's efforts, including this one. I don't know if the reason is determined ignorance (which I certainly wouldn't not expect from Kim) or that the advice is politically inconvenient but, from my point of view, even decision made on the basis that they were really the same just makes ICANN look silly and puts the predictability and stability of the Internet at risk. Certainly it would be more convenient if they were really the same --same variant issues, same order magnitude of relevant characters, same relationships of "character" to sounds and meanings, and so on-- but the odds of everyone who thinks they would like one being awarded a pony and the wherewithal to keep it are much higher. best, john --On Wednesday, March 14, 2012 10:45 -0400 Ram Mohan <rmohan@afilias.info> wrote:
Kim,
I am not certain registries would want to use an automated/machine-readable mechanism for importing tables from other registry IDN implementations. Implementations vary widely from one registry to another, each IDN implementation often requires hand-verification of allowed code points combined with business policies.
Let's try an example:
CDNC regularly publishes and updates the set of valid codepoints in the Han script, along with contextual and variant generation rules/guidelines. Publishing those set of rules in this rich format would be useful; however, registries will still need to manually verify both the codepoints and the rules for variant generation, and create business rules for allocation and delegation of applied for strings.
Feels like a nice to have, not a must have.
-Ram
-------------------------------------------------------------- -------------------
Ram Mohan
(o) +1.215.706.5700 x103 (m) +1.215.431.0958 (f) +1.215.706.5701
rmohan@afilias.info | Skype: gliderpilot30 | Twitter @rmohan123
-------------------------------------------------------------- -------------------
*From:* Kim Davies [mailto:kim.davies@icann.org] *Sent:* Monday, March 12, 2012 9:24 AM *To:* Jaap Akkerhuis *Cc:* idna-update@alvestrand.no; Dillon, Chris; Abdulrahman I. ALGhadir; vip@icann.org *Subject:* Re: Draft on IDN Tables in XML
Hi Jaap,
On Mar 12, 2012, at 3:34 AM, Jaap Akkerhuis wrote:
As far as I know, the idea of the tables have always been to provide a public central place where the registries could list which characters they support and which they don't in their registrer plolicies. It is an "for your information only" registry and meant for human consumption.
And Kim, do correct me if this isn't the case anymore.
It is certainly correct that the notion of tables is to publicly share registry policy as it relates to code points that are accepted for registration. I think, however, it is a bit beyond merely informational, as a key driver in publishing them has been to allow sharing and re-use by other registries. Having a machine readable format that allows the tables to be imported and repurposed aids this greatly.
One of the reasons I feel this is a good initiative is an increasing number of tables appear to be published as PDF files with various contextual rules described in paragraphs of normative text. It would be nice to reverse this trend and have a format rich enough that it can express most if not all registry policies in a common way using a set of agreeable primitives.
kim