Draft on IDN Tables in XML
Colleagues, I have posted a first draft regarding a format that could be used for representing IDN Tables in XML to the I-D Repository: https://datatracker.ietf.org/doc/draft-davies-idntables/ After discussion with a number of folks that felt this would be good work to undertake, I've put together a first cut which is not comprehensive, but I think goes some way toward a potential format. Unless there is interest in this being a more formal activity, my assumption is to aim to publish the final result independently as an Informational RFC. However, the mechanism of publication is secondary to coming up with something useful that would benefit TLD registries and other implementors. A list of design goals, from the document, is as follows: • MUST be in a format that can be implemented in a reasonably straightforward manner in software; • The format SHOULD be able to be checked for formatting errors, such that common mistakes can be caught; • An IDN Table MUST be able to express the set of valid code points that are allowed for registration under a specific zone administrator's policies; • MUST be able to express computed alternatives to a given domain name based on a one-to-one, or one-to-many relationship. These computed alternatives are commonly known as "IDN variants"; • IDN Variants SHOULD be able to be tagged with specific categories, such that the categories can be used to support registry policy (such as whether to list the computed variant in the zone, or to merely block it from registration); • IDN Variants MUST be able to stipulated based on contextual information. For example, specific variants may only be applicable when they follow another specific code point, or when the code point is displayed in a specific presentation form; • The data contained within the table MUST be unambiguous, such that independent implementations that utilise the contents will arrive at the same results; • IDN Tables SHOULD be suitable for comparison and re-use, such that one could easily compare the contents of two or more to see the differences, to merge them, and so on. • As many existing IDN Tables are practicable SHOULD be able to be migrated to the new format with all applicable logic retained. It is explicitly NOT the goal of this format to: • Stipulate what code points should be listed in an IDN Table by a zone administrator. What registration policies are used for a particular zone is outside the scope of this memo. • Stipulate what a consumer of an IDN Table must do when they determine a particular domain is valid or invalid; or arrive at a set of computed IDN variants. IDN Tables are only used to describe rules for computing code points, but does not prescribe how registries and other parties utilise them. I'd appreciate any feedback. cheers, kim
Dear colleagues, I have been reading this correspondence with Chinese in mind and would like to raise some questions. This is a case where there are two major forms of a script (Simplified Chinese used, for example in mainland China and Singapore) and Traditional Chinese, used elsewhere. Many characters are the same everywhere but a subset of characters have been abbreviated in the case of Simplified Chinese. There is an additional complication in the form of a small number of characters that are only used e.g. in Hong Kong (for Cantonese) or Singapore. What would be the best way to include those? In the RFC3743-style tables at http://www.iana.org/domains/idn-tables/ typically Simplified Chinese Preferred Variants and Traditional Chinese Preferred Variants have their own columns. http://tools.ietf.org/html/rfc5646 gives the following example tags for Chinese; which should be standard for Chinese in this XML-based system? == Language subtag plus Script subtag: zh-Hant (Chinese written using the Traditional Chinese script) zh-Hans (Chinese written using the Simplified Chinese script) Extended language subtags and their primary language subtag counterparts: zh-cmn-Hans-CN (Chinese, Mandarin, Simplified script, as used in China) cmn-Hans-CN (Mandarin Chinese, Simplified script, as used in China) zh-yue-HK (Chinese, Cantonese, as used in Hong Kong SAR) yue-HK (Cantonese Chinese, as used in Hong Kong SAR) Language-Script-Region: zh-Hans-CN (Chinese written using the Simplified script as used in mainland China) == A problem that many tables share is that one sees only Unicode numbers, no characters, and so when humans work with the tables, they often need to turn Unicode codes into characters or characters into Unicode codes. Is there any way that the XML could contain both (I think there are Unicode fonts containing nearly all the characters)? I would be grateful for the answers to any or all of these questions. Regards, Chris Dillon. == Research Associate in Linguistic Computing, Dept of Information Studies, UCL, Gower St, London WC1E 6BT Tel +44 20 7679 1599 (int 31599) ucl.ac.uk/dis/people/chrisdillon -----Original Message----- From: vip-bounces@icann.org [mailto:vip-bounces@icann.org] On Behalf Of Kim Davies Sent: 01 March 2012 19:15 To: vip@icann.org; idna-update@alvestrand.no Subject: [vip] Draft on IDN Tables in XML Colleagues, I have posted a first draft regarding a format that could be used for representing IDN Tables in XML to the I-D Repository: https://datatracker.ietf.org/doc/draft-davies-idntables/ After discussion with a number of folks that felt this would be good work to undertake, I've put together a first cut which is not comprehensive, but I think goes some way toward a potential format. Unless there is interest in this being a more formal activity, my assumption is to aim to publish the final result independently as an Informational RFC. However, the mechanism of publication is secondary to coming up with something useful that would benefit TLD registries and other implementors. A list of design goals, from the document, is as follows: * MUST be in a format that can be implemented in a reasonably straightforward manner in software; * The format SHOULD be able to be checked for formatting errors, such that common mistakes can be caught; * An IDN Table MUST be able to express the set of valid code points that are allowed for registration under a specific zone administrator's policies; * MUST be able to express computed alternatives to a given domain name based on a one-to-one, or one-to-many relationship. These computed alternatives are commonly known as "IDN variants"; * IDN Variants SHOULD be able to be tagged with specific categories, such that the categories can be used to support registry policy (such as whether to list the computed variant in the zone, or to merely block it from registration); * IDN Variants MUST be able to stipulated based on contextual information. For example, specific variants may only be applicable when they follow another specific code point, or when the code point is displayed in a specific presentation form; * The data contained within the table MUST be unambiguous, such that independent implementations that utilise the contents will arrive at the same results; * IDN Tables SHOULD be suitable for comparison and re-use, such that one could easily compare the contents of two or more to see the differences, to merge them, and so on. * As many existing IDN Tables are practicable SHOULD be able to be migrated to the new format with all applicable logic retained. It is explicitly NOT the goal of this format to: * Stipulate what code points should be listed in an IDN Table by a zone administrator. What registration policies are used for a particular zone is outside the scope of this memo. * Stipulate what a consumer of an IDN Table must do when they determine a particular domain is valid or invalid; or arrive at a set of computed IDN variants. IDN Tables are only used to describe rules for computing code points, but does not prescribe how registries and other parties utilise them. I'd appreciate any feedback. cheers, kim
Hi Chris, On Mar 5, 2012, at 4:22 AM, Dillon, Chris wrote: In the RFC3743-style tables at http://www.iana.org/domains/idn-tables/ typically Simplified Chinese Preferred Variants and Traditional Chinese Preferred Variants have their own columns. http://tools.ietf.org/html/rfc5646 gives the following example tags for Chinese; which should be standard for Chinese in this XML-based system? I would assume simply "zh" would be sufficient. It is not a requirement to stipulate the script in a language tag. Also, the entire tag is discretionary — if, for example, you created a fictitious table that had no bearing on any specific language or script, you would not be required to specify one. A problem that many tables share is that one sees only Unicode numbers, no characters, and so when humans work with the tables, they often need to turn Unicode codes into characters or characters into Unicode codes. Is there any way that the XML could contain both (I think there are Unicode fonts containing nearly all the characters)? Creating a tool that takes the code points and turns them into something readable should be a trivial exercise, precisely because of the standardised format. I think it would be best to avoid superfluous descriptions of the individual codepoints in the spec itself, and would rather encourage tools that present the XML file in such a way as to be readable (as a web page, etc.) For example, I can print human-readable representations from the XML table as follows very simply: kim@gumleaf:idntables[master*]$ python Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin Type "help", "copyright", "credits" or "license" for more information.
import idntables, unicodedata table = idntables.load("samples/nz_Latn_1.0.xml") for char in sorted(table._codepoints): ... print "%s [U+%04X] %s" % (unichr(char), char, unicodedata.name(unichr(char))) ... 0 [U+0030] DIGIT ZERO 1 [U+0031] DIGIT ONE 2 [U+0032] DIGIT TWO 3 [U+0033] DIGIT THREE 4 [U+0034] DIGIT FOUR 5 [U+0035] DIGIT FIVE 6 [U+0036] DIGIT SIX 7 [U+0037] DIGIT SEVEN 8 [U+0038] DIGIT EIGHT 9 [U+0039] DIGIT NINE a [U+0061] LATIN SMALL LETTER A b [U+0062] LATIN SMALL LETTER B c [U+0063] LATIN SMALL LETTER C d [U+0064] LATIN SMALL LETTER D e [U+0065] LATIN SMALL LETTER E f [U+0066] LATIN SMALL LETTER F g [U+0067] LATIN SMALL LETTER G h [U+0068] LATIN SMALL LETTER H i [U+0069] LATIN SMALL LETTER I j [U+006A] LATIN SMALL LETTER J k [U+006B] LATIN SMALL LETTER K l [U+006C] LATIN SMALL LETTER L m [U+006D] LATIN SMALL LETTER M n [U+006E] LATIN SMALL LETTER N o [U+006F] LATIN SMALL LETTER O p [U+0070] LATIN SMALL LETTER P q [U+0071] LATIN SMALL LETTER Q r [U+0072] LATIN SMALL LETTER R s [U+0073] LATIN SMALL LETTER S t [U+0074] LATIN SMALL LETTER T u [U+0075] LATIN SMALL LETTER U v [U+0076] LATIN SMALL LETTER V w [U+0077] LATIN SMALL LETTER W x [U+0078] LATIN SMALL LETTER X y [U+0079] LATIN SMALL LETTER Y z [U+007A] LATIN SMALL LETTER Z ā [U+0101] LATIN SMALL LETTER A WITH MACRON ē [U+0113] LATIN SMALL LETTER E WITH MACRON ī [U+012B] LATIN SMALL LETTER I WITH MACRON ō [U+014D] LATIN SMALL LETTER O WITH MACRON ū [U+016B] LATIN SMALL LETTER U WITH MACRON
kim
At 01:56 07/03/2012, Kim Davies wrote:
Ä [U+0101] LATIN SMALL LETTER A WITH MACRON Ä" [U+0113] LATIN SMALL LETTER E WITH MACRON Ä« [U+012B] LATIN SMALL LETTER I WITH MACRON Å [U+014D] LATIN SMALL LETTER O WITH MACRON Å« [U+016B] LATIN SMALL LETTER U WITH MACRON
kim
Seems there is a problem with the display. jfc
Hi Jean-François, On Mar 6, 2012, at 5:32 PM, JFC Morfin wrote:
At 01:56 07/03/2012, Kim Davies wrote:
Ä™ [U+0101] LATIN SMALL LETTER A WITH MACRON Ä" [U+0113] LATIN SMALL LETTER E WITH MACRON Ä« [U+012B] LATIN SMALL LETTER I WITH MACRON ÅŸ [U+014D] LATIN SMALL LETTER O WITH MACRON Å« [U+016B] LATIN SMALL LETTER U WITH MACRON
kim
Seems there is a problem with the display. jfc
Or your email program. It looks like it treats the text encoding as ISO 8859-1, rather than UTF-8 as it appears in the headers. e.g. U+012B is represented as the two octets 0xC4 0xAB in UTF-8. 0xC4 = Ä and 0x81 = « in ISO 8859-1. kim
On 7 mar 2012, at 02:32, JFC Morfin wrote:
At 01:56 07/03/2012, Kim Davies wrote:
Ä™ [U+0101] LATIN SMALL LETTER A WITH MACRON Ä" [U+0113] LATIN SMALL LETTER E WITH MACRON Ä« [U+012B] LATIN SMALL LETTER I WITH MACRON ÅŸ [U+014D] LATIN SMALL LETTER O WITH MACRON Å« [U+016B] LATIN SMALL LETTER U WITH MACRON
kim
Seems there is a problem with the display.
Yes, problem with the display, but it was a correctly encoded email message: Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Patrik
Dear Kim, Good to hear that there is a tool which will mass-convert from Unicode U+ notation to characters! (Incidentally, in case anyone is interested in converting ad hoc codes, this is the easiest method I have discovered: To go from U+ notation to character, type the code without U+ in a recent version of Microsoft Word and press Alt-x (at least in the PC version). To go from character to U+ notation, type the character in the Characters field (below Mixed input) in the following website, click Convert and it gives you at least every code known to mankind: http://www.rishida.net/tools/conversion ) In this/these XML table(s), it would be good to require that zh was stipulated, but also something is needed to indicate where Simplified Chinese and Traditional Chinese Preferred Variants are stored and, possibly, the relatively small number characters that may only be used e.g. in Singapore, in Hong Kong or in Taiwan. Regards, Chris. == Research Associate in Linguistic Computing, Dept of Information Studies, UCL, Gower St, London WC1E 6BT Tel +44 20 7679 1599 (int 31599) ucl.ac.uk/dis/people/chrisdillon From: Kim Davies [mailto:kim.davies@icann.org] Sent: 07 March 2012 00:56 To: Dillon, Chris Cc: vip@icann.org; idna-update@alvestrand.no Subject: Re: Draft on IDN Tables in XML Hi Chris, On Mar 5, 2012, at 4:22 AM, Dillon, Chris wrote: In the RFC3743-style tables at http://www.iana.org/domains/idn-tables/ typically Simplified Chinese Preferred Variants and Traditional Chinese Preferred Variants have their own columns. http://tools.ietf.org/html/rfc5646 gives the following example tags for Chinese; which should be standard for Chinese in this XML-based system? I would assume simply "zh" would be sufficient. It is not a requirement to stipulate the script in a language tag. Also, the entire tag is discretionary — if, for example, you created a fictitious table that had no bearing on any specific language or script, you would not be required to specify one. A problem that many tables share is that one sees only Unicode numbers, no characters, and so when humans work with the tables, they often need to turn Unicode codes into characters or characters into Unicode codes. Is there any way that the XML could contain both (I think there are Unicode fonts containing nearly all the characters)? Creating a tool that takes the code points and turns them into something readable should be a trivial exercise, precisely because of the standardised format. I think it would be best to avoid superfluous descriptions of the individual codepoints in the spec itself, and would rather encourage tools that present the XML file in such a way as to be readable (as a web page, etc.) For example, I can print human-readable representations from the XML table as follows very simply: kim@gumleaf:idntables[master*]$ python Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin Type "help", "copyright", "credits" or "license" for more information.
import idntables, unicodedata table = idntables.load("samples/nz_Latn_1.0.xml") for char in sorted(table._codepoints): ... print "%s [U+%04X] %s" % (unichr(char), char, unicodedata.name(unichr(char))) ... 0 [U+0030] DIGIT ZERO 1 [U+0031] DIGIT ONE 2 [U+0032] DIGIT TWO 3 [U+0033] DIGIT THREE 4 [U+0034] DIGIT FOUR 5 [U+0035] DIGIT FIVE 6 [U+0036] DIGIT SIX 7 [U+0037] DIGIT SEVEN 8 [U+0038] DIGIT EIGHT 9 [U+0039] DIGIT NINE a [U+0061] LATIN SMALL LETTER A b [U+0062] LATIN SMALL LETTER B c [U+0063] LATIN SMALL LETTER C d [U+0064] LATIN SMALL LETTER D e [U+0065] LATIN SMALL LETTER E f [U+0066] LATIN SMALL LETTER F g [U+0067] LATIN SMALL LETTER G h [U+0068] LATIN SMALL LETTER H i [U+0069] LATIN SMALL LETTER I j [U+006A] LATIN SMALL LETTER J k [U+006B] LATIN SMALL LETTER K l [U+006C] LATIN SMALL LETTER L m [U+006D] LATIN SMALL LETTER M n [U+006E] LATIN SMALL LETTER N o [U+006F] LATIN SMALL LETTER O p [U+0070] LATIN SMALL LETTER P q [U+0071] LATIN SMALL LETTER Q r [U+0072] LATIN SMALL LETTER R s [U+0073] LATIN SMALL LETTER S t [U+0074] LATIN SMALL LETTER T u [U+0075] LATIN SMALL LETTER U v [U+0076] LATIN SMALL LETTER V w [U+0077] LATIN SMALL LETTER W x [U+0078] LATIN SMALL LETTER X y [U+0079] LATIN SMALL LETTER Y z [U+007A] LATIN SMALL LETTER Z ā [U+0101] LATIN SMALL LETTER A WITH MACRON ē [U+0113] LATIN SMALL LETTER E WITH MACRON ī [U+012B] LATIN SMALL LETTER I WITH MACRON ō [U+014D] LATIN SMALL LETTER O WITH MACRON ū [U+016B] LATIN SMALL LETTER U WITH MACRON
kim
participants (4)
-
Dillon, Chris -
JFC Morfin -
Kim Davies -
Patrik Fältström