Dear
Qichao,
I would say
that the tables with compatible variant mapping for C, J and K are a sort of
general rule, although the mappings will be more complicated than the tables
that describe the mappings only for one language. As you say, the CN table will
not be the general rule, and not just because Korean may define variants.
Presuming Japanese does not define variants, relatively minor changes (e.g.
adding 国字
[characters
like 辻,made in
Japan], 栞
etc. which
exist in the JP table but not in CN) could be made to make the system work
slightly better for Japanese.
I think if
.XX壱 is delegated,
.XX一 is blocked as
a separate TLD, irrespective of where the application is coming from, as the key
tables are the compatible ones.
I’m also
looking forward to discussing these things in Singapore!
Regards,
Chris.
--
Research Associate in
Linguistic Computing, Centre for Digital Humanities, UCL, Gower St, London WC1E
6BT Tel +44 20 7679 1599 (int 31599) www.ucl.ac.uk/dis/people/chrisdillon
From: 齐超
[mailto:qichao@cnnic.cn]
Sent: 13 January 2015 08:16
To:
Dillon, Chris; Sarmad Hussain
Cc: Yoshiro YONEYA; ChineseGP@icann.org;
hotta@jprs.co.jp
Subject: Re: RE: [ChineseGP] proposal to eliminate
the divergence between us
Dear Chris,
Thank
you for your feedback.
I don‘t think the
CN rule will become the general rule, especially if the 'rule' is Label
Generation Rule for C, J or K.
'Variant mapping' is a format for
LGR Tool but not a general rule for allocated/blocked.
C, J , K may define respective
rule for allocated/blocked. Variants in language CN may not block the
labels with variants for language KR or JP.
With my understanding, a Chinese
user has applied tld '.XX壱' ,
then '.XX一' is blocked to other
Chinese applicant, but if a Japanese user has applied tld '.XX壱' , whether '.XX一' would be blocked to
Japanese users is determined by language JP rule.
So J or K can be define
rule respectively but accept unified variant mapping form.
All above is
my understanding , and it should be clarified by IP.
齐超
via foxmail
发件人: Dillon, Chris
发送时间: 2015年1月9日(星期五) 下午8:16
收件人: 齐超; Sarmad
Hussain
主题: RE:
Re: [ChineseGP] proposal to eliminate the divergence between
us
Dear
Qichao,
Effectively the CN rule becomes
a general rule once the mappings are made compatible. Variants in any language
LGR may block labels even if they come from an LGR where they would not be
blocked.
I like
that particular example especially, as it indicates a first come first served
situation where a label including a low frequency character (and which is not a
variant in all the language LGRs) blocks a label with a high frequency
character.
Again
this is presuming the “each language has a descriptive LGR of its own situation
(with or without variant mappings) and then another LGR (with possibly adopted*
variant mappings) which is compatible with other LGRs for the script”
model.
*I have been using the word
“inherited” for this, but “adopted” (i.e. from other languages using the script)
may be better.
Regards,
Chris.
--
Research Associate in
Linguistic Computing, Centre for Digital Humanities, UCL, Gower St, London WC1E
6BT Tel +44 20 7679 1599 (int 31599) www.ucl.ac.uk/dis/people/chrisdillon
From: 齐超 [mailto:qichao@cnnic.cn]
Sent: 09
January 2015 03:21
To: Dillon, Chris; Sarmad Hussain
Cc:
Yoshiro YONEYA; ChineseGP@icann.org; hotta@jprs.co.jp
Subject: Re: Re:
[ChineseGP] proposal to eliminate the divergence between
us
Dear
Chris,
I agree with your understanding: if there is a variant mapping in any language, there will be the mapping in all language. So the
mappings are formally unified, but C J and K still
could apply different rule for allocated/blocked
case.
Your case is a nice example, but this is a minor error that the situation of
'where .XX壱 was allocated, disallowing the separate allocation of . XX一'
is a rule only
for language tag CN,
while '.XX壱' and '.XX一 'could be both allocated for language tag JP or KR, and these
are in a mapping similar to tag CN but
different disposition rule.
Maybe my understanding is wrong, and I
hope IP can give an common example for C, J and
K.
Best
Regards
齐超 via
foxmail
发件人: Dillon, Chris
发送时间: 2015年1月8日(星期四) 下午9:04
收件人: Sarmad
Hussain
主题: Re:
[ChineseGP] proposal to eliminate the divergence between
us
Dear
Sarmad,
That
is certainly an example of what I mean. The possible issue with CJK is that
there will be many such examples. We will only know how many when we have tables
for SC, TC, J and K.
The
CJK tables with compatible mappings* (i.e. not the tables which describe the
situation in SC, TC, J and K) will probably end up with:
“If
it’s a variant in any language, it will be a variant in the tables with
compatible mappings*.”
*If my understanding is right, there will be no single amalgamated
table.
From a
Japanese (Traditional Chinese and Korean) perspective, it Is weird to have, for
example, 髮 ‘hair’ and 發
‘send’ as variants (as they are both 发 in Simplified Chinese) and on
slide 8 of Professor Kim’s presentation, there was a possible situation where
.XX壱 was
allocated, disallowing the separate allocation of . XX一,
although 一 is a
much more basic character. However, I am yet to discover cases (and there would
need to be quite a few of them) which could be
party-stoppers.
The
Procedure is very well made and I believe it will be possible to follow
it.
Regards,
Chris.
--
Research Associate in
Linguistic Computing, Centre for Digital Humanities, UCL, Gower St, London WC1E
6BT Tel +44 20 7679 1599 (int 31599) www.ucl.ac.uk/dis/people/chrisdillon
From: Sarmad Hussain [mailto:sarmad.hussain@icann.org]
Sent: 08 January 2015 11:01
To: Dillon, Chris
Cc:
ChineseGP@icann.org; Wang Wei; yoshiro.yoneya@jprs.co.jp; hotta@jprs.co.jp
Subject: RE:
[ChineseGP] proposal to eliminate the divergence between
us
Dear
Chris,
If I
understand this correctly, here is an example from Arabic script which could be
relevant:
U+06A9
(ک) and U+06AA (ڪ) are distinct letters (not variants) in Sindhi language (see http://www.omniglot.com/writing/sindhi.htm).
However,
these code points are considered variants of U+0643 (ك) by other language communities (e.g. Arabic language).
Therefore,
they are being considered as variants by the ArabicGP (see Table 3 in https://community.icann.org/download/attachments/47253587/Arabic%20Variant%20Analysis%20for%20LGR%200.8.pdf?version=2&modificationDate=1419700233000&api=v2).
Regards,
Sarmad
From: Dillon, Chris [mailto:c.dillon@ucl.ac.uk]
Sent:
Monday, January 05, 2015 3:09 PM
To: Wang Wei; yoshiro.yoneya@jprs.co.jp; hotta@jprs.co.jp
Cc: ChineseGP@icann.org; Sarmad
Hussain
Subject: RE: [ChineseGP] proposal to eliminate the divergence
between us
Dear
colleagues,
新年快樂
明けましておめでとうございます
새 해 福 많이 바드세요
Or/. Happy New Year!
I am
wondering whether there may be a way of making the proposal below work, without
the JGP’s having to define variant sets and mappings (well, only a small number
in scenario 2).
Scripts used by many languages,
for example Cyrillic and Arabic (I’ll leave out Latin as it is used by so many
languages it may cause confusion) may be in a situation where some
implementations of the script define variants (cf. SC and TC) and some don’t
(cf. Japanese). One possible approach could be that languages which don’t define
variants inherit the variant sets and mappings from the languages using the
script that do define variants. I’m copying Sarmad in on this email, as this is
a phenomenon which may have occurred in the work of one of the other
GPs.
I
reckon this approach would work for cases 1, 3 and 4 below. (Actually 5 too as
long as there are no examples of it…)
That
only leaves us with cases in scenario 2 such as 栞 (a
variant which only exists in the Japanese table) for which a mapping to
刊 and 刋 would
need to be created. For all other cases, the SC/TC mappings would be
inherited.
Regards,
Chris.
--
Research Associate in
Linguistic Computing, Centre for Digital Humanities, UCL, Gower St, London WC1E
6BT Tel +44 20 7679 1599 (int 31599) www.ucl.ac.uk/dis/people/chrisdillon
From: chinesegp-bounces@icann.org [mailto:chinesegp-bounces@icann.org]
On Behalf Of Wang Wei
Sent: 29 December 2014
07:54
To: yoshiro.yoneya@jprs.co.jp; hotta@jprs.co.jp
Cc: ChineseGP@icann.org
Subject:
[ChineseGP] proposal to eliminate the divergence between
us
Dear Yoneya San
and Hotta San
Please kindly accept my belated but best wishes for the Christmas and new year.
Recently, we carried out the following works and I outlined them here for your comments:
For any Hanzi in
CGP repertoire, it belong to a variant mapping set (minimum set size is 1 which
means there is no variant for the code point) under the current rules borrowed
from CDNC; and for any Kanji code point in JGP repertoire, it may also belong to
some variant mapping set (we acknowledge that there is
no variant in JPRS practice so far, but we assume that there will be a kind of
variant mapping definition in JGP repertoire).
All the variant
mapping sets can be divided into FIVE scenarios:
1.
the variant
mapping set in JPRS ∈ variant mapping
set in CDNC
|
In CGP 愛 611B (0);爱(86),愛(886);愛(0),爱(0); 爱 7231 (0);爱(86),愛(886);愛(0),爱(0); In JGP 愛611B(2,3);611B(2,3); |
2.
the variant
mapping set in CDNC ∈
the
variant mapping set in JPRS
|
In CGP: 刊520A (0);刊520A(86),刊520A(886);刊(0),刋(0); 刋520B (0);刊520A(86),刊520A(886);刊(0),刋(0); In JGP: 刊 520A(2,3);520A(2,3); 刋 520B(2,3);520B(2,3); 栞 681E(2,3);681E(2,3); *:
this example is ONLY an assumption |
3.
the variant
mapping set in CDNC = the variant mapping set in JPRS
|
In CGP 一4E00 (0); 一4E00(86),一4E00(886); 一(0),壱(0),壹(0),弌(0); 壱58F1 (0); 壹58F9(86),壹58F9(886); 一(0),壱(0),壹(0),弌(0); 壹58F9 (0); 壹58F9(86),壹58F9(886); 一(0),壱(0),壹(0),弌(0); 弌5F0C (0); 一4E00(86),一4E00(886); 一(0),壱(0),壹(0),弌(0); In JGP: 一
4E00(2,3);4E00(2,3); 壱
58F1(2,3);58F1(2,3); 壹
58F9(2,3);58F9(2,3); 弌
5F0C(2,3);5F0C(2,3); *:
this example is ONLY an assumption |
4.
the variant
mapping set in CDNC ∩
the
variant mapping set in JPRS = 0
|
The code point UNIQUELY
exists in JGP table 辻8FBB(2,3);8FBB(2,3);
|
5.
the variant
mapping set in CDNC ∩ the variant
mapping set in JPRS ≠
0
and
the variant
mapping set in CDNC ≠ the variant
mapping set in JPRS
|
No
specified example so far |
\
In the past, we discussed the variants problem for many times, but mainly based on the two types: allocatable and blocked. However, we think another type ("out-of-
repertoire") in the XML draft,
may help the conflicted issue between JGP and CGP, which was
recommended by Asmus' mail.
The basic principle is "any variant label with a code point out-of-repertoire is invalid".
We think this “out-of-repertoire” type and consequent “invalid” action will
tremendously decrease the complexity of variant mapping coordination
between us.
For scenario
1:
In CGP
愛 611B (0);爱(86),愛(886);愛(0),爱(0);
爱 7231 (0);爱(86),愛(886);愛(0),爱(0);
In JGP
愛 611B(2,3);611B(2,3);
JGP
take爱 7231
into variant mapping set, but mark it as “out-of-repertoire” and take “invalid”
action for WLG process, which means, 爱 7231
will never be generated into the labels.
JGP LGR:
<language>und-Jpan</language>
<char cp="611B" tag="sc:Hani">
<var cp="611B" type="alloc" comment="identity" />
<var cp="7231" type="out-of-repertoire-var" /> <!--Hans, JGP should exist.-->
</char>
WLE rules:
<action disposition="invalid" any-variant="out-of-repertoire-var"
comment="any variant label with a code point out of repertoire is invalid"/>
<action disp="allocatable" all-variant="alloc" />
CGP LGR:
<language>und-Hani</language>
<char cp="611B" tag="sc:Hani">
<var cp="611B" type="trad" comment="identity" /> <!-- Jpan -->
<var cp="7231" type="simp" />
</char>
<char cp="7231" tag="sc:Hani">
<var cp="611B" type="trad" /> <!-- Jpan -->
<var cp="7231" type="simp" comment="identity" />
</char>
WLE rules:
<action disp="blocked" any-variant="block" />
<action disp="allocatable" only-variants="simp both" />
<action disp="allocatable" only-variants="trad both" />
<action disp="blocked" any-variant="simp trad" />
<action disp="allocatable" comment="catch-all" />
For scenario
2:
In CGP:
刊520A (0);刊520A(86),刊520A(886);刊(0),刋(0);
刋520B (0);刊520A(86),刊520A(886);刊(0),刋(0);
In JGP:
刊 520A(2,3);520A(2,3);
刋 520B(2,3);520B(2,3);
栞 681E(2,3);681E(2,3);
Now
it is CGP’s turn to take栞 681E
into variant mapping set, but mark it as “out-of-repertoire” and take “invalid”
action for WLG process, which means, 栞 681E
will never be generated into the labels.
CGP LGR
<language>und-Hani</language>
<char cp="520A" tag="sc:Hani">
<var cp="520A" type="both" comment="identity" />
<var cp="520B" type="block" />
<var cp="681E" type="out-of-repertoire-var" /> <!-- Jpan -->
</char>
<char cp="520B" tag="sc:Hani">
<var cp="520A" type="both" />
<var cp="520B" type="block" comment="identity" />
<var cp="681E" type="out-of-repertoire-var" /> <!-- Jpan -->
</char>
<char cp="681E" tag="sc:Hani"> <!-- Jpan -->
<var cp="520A" type="block" />
<var cp="520B" type="block" />
<var cp="681E" type="out-of-repertoire-var" comment="identity"/>
</char>
WLE rules:
<action disp="invalid" any-variant="out-of-repertoire-var"
comment="any variant label with a code point out of repertoire is invalid"/>
<action disp="blocked" any-variant="block" />
<action disp="allocatable" only-variants="simp both" />
<action disp="allocatable" only-variants="trad both" />
<action disp="blocked" any-variant="simp trad" />
<action disp="allocatable" comment="catch-all" />
JGP LGR:
<language>und-Jpan</language>
<char cp="520A" tag="sc:Hani">
<var cp="520A" type="alloc" comment="identity" />
<var cp="520B" type="block" />
<var cp="681E" type="block" />
</char>
<char cp="520B" tag="sc:Hani">
<var cp="520A" type="block" />
<var cp="520B" type="alloc" comment="identity" />
<var cp="681E" type="block" />
</char>
<char cp="681E" tag="sc:Hani">
<var cp="520A" type="block" />
<var cp="520B" type="block" />
<var cp="681E" type="alloc" comment="identity"/>
</char>
WLE rules:
<action disp="blocked" any-variant="block" />
<action disp="allocatable" all-variant="alloc" />
For Scenario
3:
In CGP
一4E00 (0); 一4E00(86),一4E00(886); 一(0),壱(0),壹(0),弌(0);
壱58F1 (0); 壹58F9(86),壹58F9(886); 一(0),壱(0),壹(0),弌(0);
壹58F9 (0); 壹58F9(86),壹58F9(886); 一(0),壱(0),壹(0),弌(0);
弌5F0C (0); 一4E00(86),一4E00(886); 一(0),壱(0),壹(0),弌(0);
In JGP:
一
4E00(2,3);4E00(2,3);
壱
58F1(2,3);58F1(2,3);
壹
58F9(2,3);58F9(2,3);
弌
5F0C(2,3);5F0C(2,3);
JGP
needs to create its own mapping set including all above 4 code points and
corresponding rules, otherwise, it will fall into scenario
1..
For Scenario
4:
Like UNIQUE
code point ONLY exists in JGP table
辻8FBB(2,3);8FBB(2,3);
CGP
probably will not include this code point into its
repertoire.
No
extra work or rule are needed.
For Scenario
5:
Actually, we
have not find the code points which fit into this
scenario.
But the solution
will refer to scenario 1 or 2, like:
For
JGP, “C” will be included but marked as
“out-of-repertoire”
For
CGP, “A” will be included but marked as
“out-of-repertoire”
In conclusion,
“out-of –repertoire type” and “invalid action” provide us a conservative and
simple way to reach a consensus for the variant mapping and
rules.
According to our
analysis on CGP table and JPRS table
There are 4983
code points fit for Scenario 1
There are 840
code points fit for Scenario 3
There are 170
code points fit for Scenario 4
Since JGP has
not decided yet if variant relationship exist in JGP repertoire, we don’t have
analytical number about scenario 3 and scenario 5. But what we believe is that
the above solution can also be applied for scenario 3 and 5 no matter what kind
of variant mapping JGP will produce.
All above is our
proposal for settle the divergence at minimum cost for both of
us.
What do you
think about it?
Looking forward for your reply.
Best Regards,
Wei Wang