T/T Call 11 Materials, Unicode discussion

April 4, 2017

      Dear T/T IRT,

Please find attached the latest version of our decision tree. You¹ll note
that, in addition to updating the language in the boxes, the script
discussion has been ³greened² to indicate IRT support for the final
provisions we discussed in Copenhagen. There¹s surely a few details to iron
out on these, but we¹ll work them out later in the interest of moving on to
the language discussion.

RE: automated script detection: we received some helpful feedback from an
expert on Unicode. I¹ve copy-pasted an abridged version of our discussion
below. 

For our call on Thursday, let¹s work through the language discussion as much
as we can. We can also discuss options should we hit the implementation
obstacles we¹ve been anticipating given this project¹s complexity and
nuance. 

All best,

Brian 

Unicode Discussion
...
Q: 
The IRT has suggested that a minimum or standard version of Unicode should be
mandated for parties using automated script detection. In your expert opinion,
does this sound like a reasonable provision for this aspect of our policy? Do
you have any suggestions or advice to pass to them?
...
A: 
Unlike the detection of a language, determining the script for a run of
characters is relatively straightforward -- as long as the data are words or
names.
...
The vast majority of characters is used with a single script.
Combining marks essentially must always follow another combining mark or a
base letter. Their script is formally "inherited", that is, the same as that
of the base letter.
Finally, some characters, mostly punctuation, digits and marks will have a
script value of "Common". These can be used with any script. For your
purposes, you might treat some or all of them as if they had the script of a
preceding character, unless you would like to treat them on their own.
Unicode defines a property called "script extension" that, for many
characters, gives a list of all scripts that they are normally used with. This
might allow you to more confidently assign a given script to a "common"
character.
There might be the question of how fine-grained your script detection needs to
be. Is it on the field level, the word level, or the run level? That would
determine how you handle transitions between script runs.
For versioning, there is no reason for any implementation to not support the
latest version of the Unicode Standard at the time of publication of your
requirements. (Unicode 10.0 is currently in preparation and the data set will
be final around June).
[Failing that, I would pick the latest version supported by key tool/library
vendors, whatever that is.]
I would recommend to you UAX#24 Script Names.
(http://unicode.org/reports/tr24).
Q: 
What you present does solve many of the cases - addressing how to deal with
common and inherited code points.
A challenge would be to detect Japanese (or Korean) written in Han script,
unless it mixes Hiragana/Katakana (or Hangul).  Otherwise it becomes a
language identification problem (no different from Persian and Urdu both using
Arabic script or French and Spanish using Latin script) which can be solved by
the many statistical LangID methods.  Would this be possible to do with
Unicode code point script method?
A: 
There are two definitions of script. The Unicode definition (see UAX#24) does
not separately identify "composite writing systems" as scripts, the way ISO
15924 does.
To know whether some field is in one of the composite writing systems ("Jpan"
or "Kore") would mean detecting combinations of element scripts.
Han only - not enough information, could be Hani, Jpan or Kore
Hangul (with or without Han present) - Kore
Hiragana or Katakana (with or without Han present) - Jpan
For certain Han only fields, there is no way of identifying whether the field
is in Chinese, Korean Hanja, or Japanese Kanji, because the Han characters are
from the shared subset. (Just like the word "also" is a word in more than one
Language, albeit with different meanings).
However, if a field contains only Han character, that field is *compatible*
with being Jpan or Kore. So, if some related fields are specifically Jpan or
Kore, such compatible fields could be assigned a more specific composite
writing system identifier to match.
The best approach would depend on the purpose of the script identification.
For labels, or proposed labels, it is possible to use the repertoire defined
for Japanese or Korean label generation rules. Any label that's not within the
given subset would not be compatible with being part of the composite writing
system.

Brian Aitchison

tags

participants (1)