[lgr] RootLGR questions
Dear IP members and RootLGR staff, I have two questions regarding RootLGR. (1) Why mixing Japanese scripts and alphabet is denied in RootLGR? During RoogLGR Workshop on Oct 15th, I asked that if Japanese rule can consist from jpan(hani+hira+kata) + alphabet. The answer from Asmus was "No", but I couldn't get the reason at that time. Would you please give me the reason why jpan + alphabet (latn) can't mix? (2) How does language tag work in RootLGR? Language LGR specifies its language tag as <language> element. How does it work in RootLGR? For example, if CGP defined variants for U+767C and U+73FE as follows: <language>und-Hani</language> <char cp="767C" tag="sc:Hani"> <var cp="53D1" type="simp" /> <var cp="5F42" type="block" /> <var cp="767C" type="trad" comment="identity" /> <var cp="9AEA" type="block" /> <var cp="9AEE" type="block" /> </char> <char cp="73FE" tag="sc:Hani"> <var cp="73B0" type="simp" /> <var cp="73FE" type="trad" comment="identity" /> </char> and if JGP defined variants for U+7670 and U+73FE as follows: <language>unt-Jpan</language> <char cp="767C" tag="sc:Jpan"> <var cp="767C" type="alloc" comment="identity" /> <var cp="767A" type="alloc" /> </char> <char cp="73FE" tag="sc:Jpan"> <var cp="73FE" type="alloc" comment="identity" /> </char> How does integrated RootLGR look like? How does language tag in RootLGR work to generate variant labels? I assume that if an applicant applied-for U+767C U+73FF as und-Hani, allocatable variant labels are U+767C U+73FE and U+53D1 U+73B0, and blocked variant labels are others. On the other hand, I assume that if an applicant applied-for U+767C U+73FF as und-Jpan, allocatable variant labels are U+767C U+73FE and U+767A U+73FE, and blocked variant labels are others. I read draft-davies-idntables-08, but I couldn't understand how to merge two LGRs into one LGR, and how to select variants depend on language tag. Regards, -- Yoshiro YONEYA <yoshiro.yoneya@jprs.co.jp> _______________________________________________ lgr mailing list lgr@icann.org https://mm.icann.org/mailman/listinfo/lgr
On 11/5/2014 5:05 PM, Yoshiro YONEYA wrote:
Dear IP members and RootLGR staff,
I have two questions regarding RootLGR.
(1) Why mixing Japanese scripts and alphabet is denied in RootLGR?
During RoogLGR Workshop on Oct 15th, I asked that if Japanese rule can consist from jpan(hani+hira+kata) + alphabet. The answer from Asmus was "No", but I couldn't get the reason at that time. Would you please give me the reason why jpan + alphabet (latn) can't mix? Dear Yoneya-san,
I am discussing the reply to your first question with the other IP members, so let me get back to you separately on that. Just out of interest, can you give me an example of a label where jpan+latn needs to be mixed? On your second question, let me try to clarify the issue. I hope I will succeed. Feel free to ask any follow-up questions:
(2) How does language tag work in RootLGR?
Language LGR specifies its language tag as <language> element. How does it work in RootLGR? For example, if CGP defined variants for U+767C and U+73FE as follows:
<language>und-Hani</language> <char cp="767C" tag="sc:Hani"> <var cp="53D1" type="simp" /> <var cp="5F42" type="blocked" /> <var cp="767C" type="trad" comment="identity" /> <var cp="9AEA" type="blocked" /> <var cp="9AEE" type="blocked" /> </char> <char cp="73FE" tag="sc:Hani"> <var cp="73B0" type="simp" /> <var cp="73FE" type="trad" comment="identity" /> </char>
First, this definition of variants is incomplete. An actual LGR would need to have <char> entries for all the variants, with their own sets of mappings so that the full set of mappings is both symmetric and transitive. So let's assume you left these out for simplicity but that they are, in fact, specified in the LGR for und-Hani, as required.
and if JGP defined variants for U+7670 and U+73FE as follows:
I assume this is a typo for U+767C
<language>unt-Jpan</language> <char cp="767C" tag="sc:Jpan"> <var cp="767C" type="alloc" comment="identity" /> <var cp="767A" type="alloc" /> </char> <char cp="73FE" tag="sc:Jpan"> <var cp="73FE" type="alloc" comment="identity" /> </char>
First of all, in the integrated LGR, the full set of mappings would have to be present, that is, there are many mappings of type "block" (or blocked) that would have to be added because of integration with the Chinese LGR. <language>und-Jpan</language> <char cp="767C" tag="sc:Jpan"> <var cp="53D1" type="blocked" /> <var cp="5F42" type="blocked" /> * <var cp="767C" type="alloc" comment="identity" /> <!-- Jpan --> <var cp="767A" type="alloc" />* *<!-- Jpan -->* <var cp="9AEA" type="blocked" /> <var cp="9AEE" type="blocked" /> </char> <char cp="73FE" tag="sc:Jpan"> <var cp="73B0" type="blocked" /> * <var cp="73FE" type="alloc" comment="identity" />***<!-- Jpan -->*** * </char> I have highlighted the Japanese-specific entries. However, if <var cp="767A" type="alloc" /> is added, then the Chinese LGR would have an additional entry: <var cp="767A" type="blocked" /> (BTW, in the root zone LGR we will use "blocked" instead of "block", even though the XML-LGR draft uses "block" - they mean the same thing).
How does integrated RootLGR look like?
The integrated LGR contains the additional variant mappings needed to a) make the set transivite and symmetric b) ensure that each "variant cluster" (or set of code points that are mutually variant) is the same in each script LGR. While the clusters must be the same, the type values can be chosen based on the tag.
How does language tag in RootLGR work to generate variant labels? I assume that if an applicant applied-for U+767C U+73FF as und-Hani, that appears to be a type for U+767C U+73FE allocatable variant labels are U+767C U+73FE and U+53D1 U+73B0, and blocked variant labels are others.
To achieve this, the following <action> elements must be defined in the XML <action disp="blocked" any-variant="blocked" /> <action disp="allocatable" only-variants="simp both" /> <action disp="allocatable" only-variants="trad both" /> <action disp="blocked" any-variant="simp trad" /> <action disp="allocatable" comment="catch-all" /> The top-most action that matches the condition on variant mappings is the one that will set the disposition. Any variant label being created from any "blocked" variant will be blocked in the first action. Any variant label being created from only "simp" (or "both" which we don't have in the example) will be set to allocatable in the second action. (U+53D1 U+73B0) Same for "trad" in the third action (U+767C U+73FE) The fourth action would block mixed labels such as U+767C U+73B0 and U+53D1 U+73FE. And finally the last action exists to allow allocatable labels that do not use reflexive (identity) mappings - we don't have them in this example (but see below).
On the other hand, I assume that if an applicant applied-for U+767C U+73FF as that appears to be a type for U+767C U+73FE again und-Jpan, allocatable variant labels are U+767C U+73FE and U+767A U+73FE, and blocked variant labels are others. Correct.
In fact, it is not necessary for the Japanese example to use reflexive mappings. As long as there is only two types meaning "blocked" and "allocatable", these two <action> statements are sufficient: <action disp="blocked" any-variant="blocked" /> <action disp="allocatable" comment="catch-all" /> (The other three actions could be defined, but wouldn't get triggered). So the example would simplify to: <language>und-Jpan</language> <char cp="767C" tag="sc:Jpan"> <var cp="53D1" type="blocked" /> <var cp="5F42" type="blocked" /> * <var cp="767A" type="alloc" />* *<!-- Jpan -->* <var cp="9AEA" type="blocked" /> <var cp="9AEE" type="blocked" /> </char> <char cp="73FE" tag="sc:Jpan"> <var cp="73B0" type="blocked" /> </char>
I read draft-davies-idntables-08, but I couldn't understand how to merge two LGRs into one LGR, and how to select variants depend on language tag.
If you assume that it is possible to create a single XML file for the *integrated* LGR, it will be hard to understand. That is because it is indeed not possible to create such a single, integrated file, because in the XML the selection by <language> tag is a per-file issue. Instead, what the IP will produce is a set of files that are guaranteed to produce consistent results. Each application will be processed against a single file (selected by <language> tag, such as "und-Jpan"). The files will contain the full variant clusters so that the full permutation of variant labels can be generated. This full set of permuted labels can then be checked against all existing registrations (no matter whether they were registered under "und-Jpan" or under "und-Hani"). If no conflicts are found, the application can proceed to the next stage and the subset of allocatable variants will be determined. If a label U+767A U+73FE has been allocated (which is possible under und-Jpan) then under und-Hani it is not possible to apply for U+767C U+73FE or U+53D1 U+73B0. By ensuring that all files in the LGR are mutually consistent, one only needs a single file to process the application. The consistent files need the additional "blocked" variant mappings, but they also need entries for the target code points of these mappings. Such code points are not part of the repertoire. They are identified with special reflexive variant mappings. For example, assume 53D1 as simplified ideograph is not part of the Japanese repertoire. Because of consistency, the LGR for und-Jpan would need this entry <language>und-Jpan</language> <char cp="53D1" > <var cp="53D1" type="out-of-repertoire-var" comment="identity" /> <var cp="767C" type="blocked" <var cp="5F42" type="blocked" ..... </char> The MSR already contains a default action <action disposition="invalid" any-variant="out-of-repertoire-var" comment="any variant label with a code point out of repertoire is invalid"/> which would eliminate any original label containing U+53D1, so that one could not apply for U+53D1 U+73FE, for example, in the "und-Jpan" case. Anyway, this is the technical meat of creating the set of mutually consistent files for integration. A./
Regards,
_______________________________________________ lgr mailing list lgr@icann.org https://mm.icann.org/mailman/listinfo/lgr
Asmus-san,
Just out of interest, can you give me an example of a label where jpan+latn needs to be mixed?
Followings are existing Japanese JP domain names including Alphabets. JR東日本.jp 日本IBM.jp 株式会社NTTドコモ.jp Some companies are using such mixed script name. I'm not sure if they want to register such name on TLD, but I don't have reason to prohibit it. I'll reply to your answer to my 2nd question separately. Regards, -- Yoshiro YONEYA <yoshiro.yoneya@jprs.co.jp> On Tue, 11 Nov 2014 10:58:41 -0800 Asmus Freytag <asmusf@ix.netcom.com> wrote:
On 11/5/2014 5:05 PM, Yoshiro YONEYA wrote:
Dear IP members and RootLGR staff,
I have two questions regarding RootLGR.
(1) Why mixing Japanese scripts and alphabet is denied in RootLGR?
During RoogLGR Workshop on Oct 15th, I asked that if Japanese rule can consist from jpan(hani+hira+kata) + alphabet. The answer from Asmus was "No", but I couldn't get the reason at that time. Would you please give me the reason why jpan + alphabet (latn) can't mix? Dear Yoneya-san,
I am discussing the reply to your first question with the other IP members, so let me get back to you separately on that.
Just out of interest, can you give me an example of a label where jpan+latn needs to be mixed?
On your second question, let me try to clarify the issue. I hope I will succeed. Feel free to ask any follow-up questions:
(2) How does language tag work in RootLGR?
Language LGR specifies its language tag as <language> element. How does it work in RootLGR? For example, if CGP defined variants for U+767C and U+73FE as follows:
<language>und-Hani</language> <char cp="767C" tag="sc:Hani"> <var cp="53D1" type="simp" /> <var cp="5F42" type="blocked" /> <var cp="767C" type="trad" comment="identity" /> <var cp="9AEA" type="blocked" /> <var cp="9AEE" type="blocked" /> </char> <char cp="73FE" tag="sc:Hani"> <var cp="73B0" type="simp" /> <var cp="73FE" type="trad" comment="identity" /> </char>
First, this definition of variants is incomplete. An actual LGR would need to have <char> entries for all the variants, with their own sets of mappings so that the full set of mappings is both symmetric and transitive. So let's assume you left these out for simplicity but that they are, in fact, specified in the LGR for und-Hani, as required.
and if JGP defined variants for U+7670 and U+73FE as follows:
I assume this is a typo for U+767C
<language>unt-Jpan</language> <char cp="767C" tag="sc:Jpan"> <var cp="767C" type="alloc" comment="identity" /> <var cp="767A" type="alloc" /> </char> <char cp="73FE" tag="sc:Jpan"> <var cp="73FE" type="alloc" comment="identity" /> </char>
First of all, in the integrated LGR, the full set of mappings would have to be present, that is, there are many mappings of type "block" (or blocked) that would have to be added because of integration with the Chinese LGR.
<language>und-Jpan</language> <char cp="767C" tag="sc:Jpan"> <var cp="53D1" type="blocked" /> <var cp="5F42" type="blocked" /> * <var cp="767C" type="alloc" comment="identity" /> <!-- Jpan --> <var cp="767A" type="alloc" />* *<!-- Jpan -->* <var cp="9AEA" type="blocked" /> <var cp="9AEE" type="blocked" /> </char> <char cp="73FE" tag="sc:Jpan"> <var cp="73B0" type="blocked" /> * <var cp="73FE" type="alloc" comment="identity" />***<!-- Jpan -->*** * </char>
I have highlighted the Japanese-specific entries.
However, if <var cp="767A" type="alloc" /> is added, then the Chinese LGR would have an additional entry: <var cp="767A" type="blocked" />
(BTW, in the root zone LGR we will use "blocked" instead of "block", even though the XML-LGR draft uses "block" - they mean the same thing).
How does integrated RootLGR look like?
The integrated LGR contains the additional variant mappings needed to a) make the set transivite and symmetric b) ensure that each "variant cluster" (or set of code points that are mutually variant) is the same in each script LGR.
While the clusters must be the same, the type values can be chosen based on the tag.
How does language tag in RootLGR work to generate variant labels? I assume that if an applicant applied-for U+767C U+73FF as und-Hani, that appears to be a type for U+767C U+73FE allocatable variant labels are U+767C U+73FE and U+53D1 U+73B0, and blocked variant labels are others.
To achieve this, the following <action> elements must be defined in the XML
<action disp="blocked" any-variant="blocked" />
<action disp="allocatable" only-variants="simp both" />
<action disp="allocatable" only-variants="trad both" />
<action disp="blocked" any-variant="simp trad" />
<action disp="allocatable" comment="catch-all" />
The top-most action that matches the condition on variant mappings is the one that will set the disposition.
Any variant label being created from any "blocked" variant will be blocked in the first action.
Any variant label being created from only "simp" (or "both" which we don't have in the example) will be set to allocatable in the second action. (U+53D1 U+73B0)
Same for "trad" in the third action (U+767C U+73FE)
The fourth action would block mixed labels such as U+767C U+73B0 and U+53D1 U+73FE.
And finally the last action exists to allow allocatable labels that do not use reflexive (identity) mappings - we don't have them in this example (but see below).
On the other hand, I assume that if an applicant applied-for U+767C U+73FF as that appears to be a type for U+767C U+73FE again und-Jpan, allocatable variant labels are U+767C U+73FE and U+767A U+73FE, and blocked variant labels are others. Correct.
In fact, it is not necessary for the Japanese example to use reflexive mappings. As long as there is only two types meaning "blocked" and "allocatable", these two <action> statements are sufficient:
<action disp="blocked" any-variant="blocked" />
<action disp="allocatable" comment="catch-all" />
(The other three actions could be defined, but wouldn't get triggered).
So the example would simplify to:
<language>und-Jpan</language> <char cp="767C" tag="sc:Jpan"> <var cp="53D1" type="blocked" /> <var cp="5F42" type="blocked" /> * <var cp="767A" type="alloc" />* *<!-- Jpan -->* <var cp="9AEA" type="blocked" /> <var cp="9AEE" type="blocked" /> </char> <char cp="73FE" tag="sc:Jpan"> <var cp="73B0" type="blocked" /> </char>
I read draft-davies-idntables-08, but I couldn't understand how to merge two LGRs into one LGR, and how to select variants depend on language tag.
If you assume that it is possible to create a single XML file for the *integrated* LGR, it will be hard to understand. That is because it is indeed not possible to create such a single, integrated file, because in the XML the selection by <language> tag is a per-file issue.
Instead, what the IP will produce is a set of files that are guaranteed to produce consistent results.
Each application will be processed against a single file (selected by <language> tag, such as "und-Jpan"). The files will contain the full variant clusters so that the full permutation of variant labels can be generated. This full set of permuted labels can then be checked against all existing registrations (no matter whether they were registered under "und-Jpan" or under "und-Hani"). If no conflicts are found, the application can proceed to the next stage and the subset of allocatable variants will be determined.
If a label U+767A U+73FE has been allocated (which is possible under und-Jpan) then under und-Hani it is not possible to apply for U+767C U+73FE or U+53D1 U+73B0.
By ensuring that all files in the LGR are mutually consistent, one only needs a single file to process the application.
The consistent files need the additional "blocked" variant mappings, but they also need entries for the target code points of these mappings. Such code points are not part of the repertoire. They are identified with special reflexive variant mappings. For example, assume 53D1 as simplified ideograph is not part of the Japanese repertoire.
Because of consistency, the LGR for und-Jpan would need this entry
<language>und-Jpan</language> <char cp="53D1" > <var cp="53D1" type="out-of-repertoire-var" comment="identity" /> <var cp="767C" type="blocked" <var cp="5F42" type="blocked" ..... </char>
The MSR already contains a default action
<action disposition="invalid" any-variant="out-of-repertoire-var" comment="any variant label with a code point out of repertoire is invalid"/>
which would eliminate any original label containing U+53D1, so that one could not apply for U+53D1 U+73FE, for example, in the "und-Jpan" case.
Anyway, this is the technical meat of creating the set of mutually consistent files for integration.
A./
Regards,
_______________________________________________ lgr mailing list lgr@icann.org https://mm.icann.org/mailman/listinfo/lgr
On 11/11/2014 1:16 PM, Yoshiro YONEYA wrote:
Asmus-san,
Just out of interest, can you give me an example of a label where jpan+latn needs to be mixed? Followings are existing Japanese JP domain names including Alphabets.
JR東日本.jp 日本IBM.jp 株式会社NTTドコモ.jp
Some companies are using such mixed script name. I'm not sure if they want to register such name on TLD, but I don't have reason to prohibit it.
I'll reply to your answer to my 2nd question separately.
Regards,
Yoneya-san, The text in B.3.2 of the Procedure anticipates only two cases where script-mixing is allowed. They are: a) mixing of Common/Inherited with certain scripts, and b) the mixture of East Asian scripts such as t the Hiragana+Katakana+Kanji trio for "und-Jpan". Beyond that, including the common "usage" of mixing ASCII in various cultures, no mixing is anticipated. This describes the default position. The Integration Panel is aware of the Japanese practices using Romaji, but is not taking an action towards making an exception for this. The bar for the IP panel to make any exception to the status quo on script mixing suggested in the procedure would be very high. The absence of such an exception does make the root more restrictive than second level domains; it also affects other cultures' more occasional use of ASCII letters. However, the restriction of not allowing digits in the roots affects some writing systems already more severely than the absence of ASCII mixing (disallowing plurals, for example). Overall, this restrictiveness is built firmly into the suggested course of action by the Procedure and and in turn aligned with the provision in RFC 6912 that zones higher in the DNS tree tend to have more restrictive rules. That is the best I can do at this point to give you more background on my short reply during the ICANN51 meeting. Perhaps you get further understanding of this by discussing it with Marc Blanchet, who, I believe is at the same IETF meeting with you. Cheers, A./ _______________________________________________ lgr mailing list lgr@icann.org https://mm.icann.org/mailman/listinfo/lgr
Asmus-san, Thank you for your explanation. I also talked with Marc-san about why not ASCII mixing. I personally understood IP's opinion and thought rational. Perhaps most of Japanese TLD applicant will not apply string mixed with Japanese and ASCII. But, on the other hand, there will be an argument that what is the criteria to overcome the bar for making exception. RFC 6192 doesn't prohibit mixed script explicitly. B.3.2 of procedure document also doesn't provide criteria for permitting mixed script explicitly. It is very appreciated if IP declared that ASCII mixing is permitted only for Latin script. It is very hard to persuade Japanese community without external evidence. My aplogize for difficult request. Best regards, -- Yoshiro YONEYA <yoshiro.yoneya@jprs.co.jp> On Thu, 13 Nov 2014 14:18:40 -0800 Asmus Freytag <asmusf@ix.netcom.com> wrote:
On 11/11/2014 1:16 PM, Yoshiro YONEYA wrote:
Asmus-san,
Just out of interest, can you give me an example of a label where jpan+latn needs to be mixed? Followings are existing Japanese JP domain names including Alphabets.
JR東日本.jp 日本IBM.jp 株式会社NTTドコモ.jp
Some companies are using such mixed script name. I'm not sure if they want to register such name on TLD, but I don't have reason to prohibit it.
I'll reply to your answer to my 2nd question separately.
Regards,
Yoneya-san,
The text in B.3.2 of the Procedure anticipates only two cases where script-mixing is allowed. They are: a) mixing of Common/Inherited with certain scripts, and b) the mixture of East Asian scripts such as t the Hiragana+Katakana+Kanji trio for "und-Jpan". Beyond that, including the common "usage" of mixing ASCII in various cultures, no mixing is anticipated. This describes the default position.
The Integration Panel is aware of the Japanese practices using Romaji, but is not taking an action towards making an exception for this. The bar for the IP panel to make any exception to the status quo on script mixing suggested in the procedure would be very high.
The absence of such an exception does make the root more restrictive than second level domains; it also affects other cultures' more occasional use of ASCII letters. However, the restriction of not allowing digits in the roots affects some writing systems already more severely than the absence of ASCII mixing (disallowing plurals, for example). Overall, this restrictiveness is built firmly into the suggested course of action by the Procedure and and in turn aligned with the provision in RFC 6912 that zones higher in the DNS tree tend to have more restrictive rules.
That is the best I can do at this point to give you more background on my short reply during the ICANN51 meeting.
Perhaps you get further understanding of this by discussing it with Marc Blanchet, who, I believe is at the same IETF meeting with you.
Cheers,
A./
_______________________________________________ lgr mailing list lgr@icann.org https://mm.icann.org/mailman/listinfo/lgr
Dear Asmus-san, Sorry for my delayed response to my second question. I read this thread (QiChao's Q and your answers) and understood.
I read draft-davies-idntables-08, but I couldn't understand how to merge two LGRs into one LGR, and how to select variants depend on language tag.
If you assume that it is possible to create a single XML file for the *integrated* LGR, it will be hard to understand. That is because it is indeed not possible to create such a single, integrated file, because in the XML the selection by <language> tag is a per-file issue.
Instead, what the IP will produce is a set of files that are guaranteed to produce consistent results.
Oh, I see that there will be no integrated (single file) LGR! "Integrated" means "integrated file per language tag"! This was the most confusing point to me. I understand that each GPs have to generate two (phased) LGR. 1) LGR which reflects the GP's original repertoire and variants, then 2) LGR which contains repertoire and variants that overlap with other GPs'. IP will check if 2) is consistent with other GPs' LGR, and confirm it as one of set of files. Following questions are cited from your answer to QiChao.
Do you think that JGP will have technical difficulties converting the second file into the third?
No, I don't. But it depends on coordination between other GPs' (especially CGP and JPG) second file. Each GP's LGR will affect each other, so the definition of second file will take a while.
Do you think it is useful for the Japanese community to see the effect of integration directly expressed like this?
Yes, I do. But I think it is too complicated to understand by Japanese community. Because Japanese community has to understand two different LGR simultaneously. One is original Japanese LGR, which reflects Japanese opinion. The other is merged (integrated) Japanese LGR, which includes overlap with other GPs' (mostly CGP's) repertoire and variants.
Do you think it would be useful for JCP to publish both the "core" file (2nd example) and the "full" file (3rd file example) ?
Ditto. "core" file is mandatory to express JGP's concept, and "full" file is mandatory to express how "und-jpan" works. It will be very difficult to explain difference between them to Japanese community... Best regards, -- Yoshiro YONEYA <yoshiro.yoneya@jprs.co.jp> On Tue, 11 Nov 2014 10:58:41 -0800 Asmus Freytag <asmusf@ix.netcom.com> wrote:
On 11/5/2014 5:05 PM, Yoshiro YONEYA wrote:
Dear IP members and RootLGR staff,
I have two questions regarding RootLGR.
(1) Why mixing Japanese scripts and alphabet is denied in RootLGR?
During RoogLGR Workshop on Oct 15th, I asked that if Japanese rule can consist from jpan(hani+hira+kata) + alphabet. The answer from Asmus was "No", but I couldn't get the reason at that time. Would you please give me the reason why jpan + alphabet (latn) can't mix? Dear Yoneya-san,
I am discussing the reply to your first question with the other IP members, so let me get back to you separately on that.
Just out of interest, can you give me an example of a label where jpan+latn needs to be mixed?
On your second question, let me try to clarify the issue. I hope I will succeed. Feel free to ask any follow-up questions:
(2) How does language tag work in RootLGR?
Language LGR specifies its language tag as <language> element. How does it work in RootLGR? For example, if CGP defined variants for U+767C and U+73FE as follows:
<language>und-Hani</language> <char cp="767C" tag="sc:Hani"> <var cp="53D1" type="simp" /> <var cp="5F42" type="blocked" /> <var cp="767C" type="trad" comment="identity" /> <var cp="9AEA" type="blocked" /> <var cp="9AEE" type="blocked" /> </char> <char cp="73FE" tag="sc:Hani"> <var cp="73B0" type="simp" /> <var cp="73FE" type="trad" comment="identity" /> </char>
First, this definition of variants is incomplete. An actual LGR would need to have <char> entries for all the variants, with their own sets of mappings so that the full set of mappings is both symmetric and transitive. So let's assume you left these out for simplicity but that they are, in fact, specified in the LGR for und-Hani, as required.
and if JGP defined variants for U+7670 and U+73FE as follows:
I assume this is a typo for U+767C
<language>unt-Jpan</language> <char cp="767C" tag="sc:Jpan"> <var cp="767C" type="alloc" comment="identity" /> <var cp="767A" type="alloc" /> </char> <char cp="73FE" tag="sc:Jpan"> <var cp="73FE" type="alloc" comment="identity" /> </char>
First of all, in the integrated LGR, the full set of mappings would have to be present, that is, there are many mappings of type "block" (or blocked) that would have to be added because of integration with the Chinese LGR.
<language>und-Jpan</language> <char cp="767C" tag="sc:Jpan"> <var cp="53D1" type="blocked" /> <var cp="5F42" type="blocked" /> * <var cp="767C" type="alloc" comment="identity" /> <!-- Jpan --> <var cp="767A" type="alloc" />* *<!-- Jpan -->* <var cp="9AEA" type="blocked" /> <var cp="9AEE" type="blocked" /> </char> <char cp="73FE" tag="sc:Jpan"> <var cp="73B0" type="blocked" /> * <var cp="73FE" type="alloc" comment="identity" />***<!-- Jpan -->*** * </char>
I have highlighted the Japanese-specific entries.
However, if <var cp="767A" type="alloc" /> is added, then the Chinese LGR would have an additional entry: <var cp="767A" type="blocked" />
(BTW, in the root zone LGR we will use "blocked" instead of "block", even though the XML-LGR draft uses "block" - they mean the same thing).
How does integrated RootLGR look like?
The integrated LGR contains the additional variant mappings needed to a) make the set transivite and symmetric b) ensure that each "variant cluster" (or set of code points that are mutually variant) is the same in each script LGR.
While the clusters must be the same, the type values can be chosen based on the tag.
How does language tag in RootLGR work to generate variant labels? I assume that if an applicant applied-for U+767C U+73FF as und-Hani, that appears to be a type for U+767C U+73FE allocatable variant labels are U+767C U+73FE and U+53D1 U+73B0, and blocked variant labels are others.
To achieve this, the following <action> elements must be defined in the XML
<action disp="blocked" any-variant="blocked" />
<action disp="allocatable" only-variants="simp both" />
<action disp="allocatable" only-variants="trad both" />
<action disp="blocked" any-variant="simp trad" />
<action disp="allocatable" comment="catch-all" />
The top-most action that matches the condition on variant mappings is the one that will set the disposition.
Any variant label being created from any "blocked" variant will be blocked in the first action.
Any variant label being created from only "simp" (or "both" which we don't have in the example) will be set to allocatable in the second action. (U+53D1 U+73B0)
Same for "trad" in the third action (U+767C U+73FE)
The fourth action would block mixed labels such as U+767C U+73B0 and U+53D1 U+73FE.
And finally the last action exists to allow allocatable labels that do not use reflexive (identity) mappings - we don't have them in this example (but see below).
On the other hand, I assume that if an applicant applied-for U+767C U+73FF as that appears to be a type for U+767C U+73FE again und-Jpan, allocatable variant labels are U+767C U+73FE and U+767A U+73FE, and blocked variant labels are others. Correct.
In fact, it is not necessary for the Japanese example to use reflexive mappings. As long as there is only two types meaning "blocked" and "allocatable", these two <action> statements are sufficient:
<action disp="blocked" any-variant="blocked" />
<action disp="allocatable" comment="catch-all" />
(The other three actions could be defined, but wouldn't get triggered).
So the example would simplify to:
<language>und-Jpan</language> <char cp="767C" tag="sc:Jpan"> <var cp="53D1" type="blocked" /> <var cp="5F42" type="blocked" /> * <var cp="767A" type="alloc" />* *<!-- Jpan -->* <var cp="9AEA" type="blocked" /> <var cp="9AEE" type="blocked" /> </char> <char cp="73FE" tag="sc:Jpan"> <var cp="73B0" type="blocked" /> </char>
I read draft-davies-idntables-08, but I couldn't understand how to merge two LGRs into one LGR, and how to select variants depend on language tag.
If you assume that it is possible to create a single XML file for the *integrated* LGR, it will be hard to understand. That is because it is indeed not possible to create such a single, integrated file, because in the XML the selection by <language> tag is a per-file issue.
Instead, what the IP will produce is a set of files that are guaranteed to produce consistent results.
Each application will be processed against a single file (selected by <language> tag, such as "und-Jpan"). The files will contain the full variant clusters so that the full permutation of variant labels can be generated. This full set of permuted labels can then be checked against all existing registrations (no matter whether they were registered under "und-Jpan" or under "und-Hani"). If no conflicts are found, the application can proceed to the next stage and the subset of allocatable variants will be determined.
If a label U+767A U+73FE has been allocated (which is possible under und-Jpan) then under und-Hani it is not possible to apply for U+767C U+73FE or U+53D1 U+73B0.
By ensuring that all files in the LGR are mutually consistent, one only needs a single file to process the application.
The consistent files need the additional "blocked" variant mappings, but they also need entries for the target code points of these mappings. Such code points are not part of the repertoire. They are identified with special reflexive variant mappings. For example, assume 53D1 as simplified ideograph is not part of the Japanese repertoire.
Because of consistency, the LGR for und-Jpan would need this entry
<language>und-Jpan</language> <char cp="53D1" > <var cp="53D1" type="out-of-repertoire-var" comment="identity" /> <var cp="767C" type="blocked" <var cp="5F42" type="blocked" ..... </char>
The MSR already contains a default action
<action disposition="invalid" any-variant="out-of-repertoire-var" comment="any variant label with a code point out of repertoire is invalid"/>
which would eliminate any original label containing U+53D1, so that one could not apply for U+53D1 U+73FE, for example, in the "und-Jpan" case.
Anyway, this is the technical meat of creating the set of mutually consistent files for integration.
A./
Regards,
_______________________________________________ lgr mailing list lgr@icann.org https://mm.icann.org/mailman/listinfo/lgr
participants (2)
-
Asmus Freytag -
Yoshiro YONEYA