Dear T/T IRT, Please find attached the latest version of our decision tree. You¹ll note that, in addition to updating the language in the boxes, the script discussion has been ³greened² to indicate IRT support for the final provisions we discussed in Copenhagen. There¹s surely a few details to iron out on these, but we¹ll work them out later in the interest of moving on to the language discussion. RE: automated script detection: we received some helpful feedback from an expert on Unicode. I¹ve copy-pasted an abridged version of our discussion below. For our call on Thursday, let¹s work through the language discussion as much as we can. We can also discuss options should we hit the implementation obstacles we¹ve been anticipating given this project¹s complexity and nuance. All best, Brian Unicode Discussion
Q: The IRT has suggested that a minimum or standard version of Unicode should be mandated for parties using automated script detection. In your expert opinion, does this sound like a reasonable provision for this aspect of our policy? Do you have any suggestions or advice to pass to them?
A: Unlike the detection of a language, determining the script for a run of characters is relatively straightforward -- as long as the data are words or names.
The vast majority of characters is used with a single script.
Combining marks essentially must always follow another combining mark or a base letter. Their script is formally "inherited", that is, the same as that of the base letter.
Finally, some characters, mostly punctuation, digits and marks will have a script value of "Common". These can be used with any script. For your purposes, you might treat some or all of them as if they had the script of a preceding character, unless you would like to treat them on their own.
Unicode defines a property called "script extension" that, for many characters, gives a list of all scripts that they are normally used with. This might allow you to more confidently assign a given script to a "common" character.
There might be the question of how fine-grained your script detection needs to be. Is it on the field level, the word level, or the run level? That would determine how you handle transitions between script runs.
For versioning, there is no reason for any implementation to not support the latest version of the Unicode Standard at the time of publication of your requirements. (Unicode 10.0 is currently in preparation and the data set will be final around June).
[Failing that, I would pick the latest version supported by key tool/library vendors, whatever that is.]
I would recommend to you UAX#24 Script Names. (http://unicode.org/reports/tr24).
Q: What you present does solve many of the cases - addressing how to deal with common and inherited code points.
A challenge would be to detect Japanese (or Korean) written in Han script, unless it mixes Hiragana/Katakana (or Hangul). Otherwise it becomes a language identification problem (no different from Persian and Urdu both using Arabic script or French and Spanish using Latin script) which can be solved by the many statistical LangID methods. Would this be possible to do with Unicode code point script method?
A: There are two definitions of script. The Unicode definition (see UAX#24) does not separately identify "composite writing systems" as scripts, the way ISO 15924 does.
To know whether some field is in one of the composite writing systems ("Jpan" or "Kore") would mean detecting combinations of element scripts.
Han only - not enough information, could be Hani, Jpan or Kore Hangul (with or without Han present) - Kore Hiragana or Katakana (with or without Han present) - Jpan
For certain Han only fields, there is no way of identifying whether the field is in Chinese, Korean Hanja, or Japanese Kanji, because the Han characters are from the shared subset. (Just like the word "also" is a word in more than one Language, albeit with different meanings).
However, if a field contains only Han character, that field is *compatible* with being Jpan or Kore. So, if some related fields are specifically Jpan or Kore, such compatible fields could be assigned a more specific composite writing system identifier to match.
The best approach would depend on the purpose of the script identification.
For labels, or proposed labels, it is possible to use the repertoire defined for Japanese or Korean label generation rules. Any label that's not within the given subset would not be compatible with being part of the composite writing system.
participants (1)
-
Brian Aitchison