The "Invisible Separator Characters" Issue

older
Path towards a multilingalization...

Nicholas Ostler

July 28, 2011

4:06 p.m.

First of all, I apologize if I am using non-standard terminology. What I mean specifically is this: Unicode code-points that have no specific rendering of their own, but may affect the rendering of neighbouring codepoints in the string. Specifically we have been talking about U+200D ZWJ zero-width joiner and U+200C ZWNJ zero-width non-joiner both of which are actively used for some languages that use the Arabic and the Devanagari scripts. http://en.wikipedia.org/wiki/Zero-width_non-joiner shows that they can also make a difference in German spelling, and in Hebrew. Unicode calls these "special characters", which is not very helpful. People have been talking explicitly about their necessity for Persian and some other Indo-Aryan languages written in Arabic, e.g. behnam@esfahbod.info - [arabic-vip] Typographical Complexity of Arabic Script - 19/07/2011 and for Nepali, among languages written in Devanagari, e.g. bkbal@ltk.org.np - Re: [Devanagari-vip] Document for Hindi Language based on Policy for ccTLD .bharat - 03/07/2011 By their very nature - these characters, when placed where they would have no effect on rendering, allow identical renderings to be associated with the distinct strings of abstract characters. They go beyond the use of other "combining" codepoints (such as accents and cedillas, which (taken with other neighbouring codepoints) also sometimes result in a glyph indistinguishable from another codepoint), in that they can have this effect on any string of codepoints. So U+0301 [´] COMBINING ACUTE ACCENT combined with U+0065 [e] LATIN SMALL LETTER E can result in a glyph indistinguishable from U+00E9 [é] LATIN SMALL LETTER E WITH ACUTE; but ZWNJ or ZWJ placed anywhere in a string (as long as its neighbours are not combining charactyers) will always give a string outwardly indistinguishable from one without it. Seeing this danger of their use in URL identifiers, (essentially an open invitation to spoofing), the Indian approach was been simply to outlaw them: akshatj@cdac.in - Document for Hindi Language based on Policy for ccTLD .bharat - 01/07/2011 However, a less radical, but more complicated, approach, which allows ZWJ and ZWNJ to be used where necessary, is laid out by Unicode at http://unicode.org/review/pr-96.html Public Review Issue #96 - Allowing Special Characters in Identifiers - Revision 3 - 04-19-2007 (It generalizes the issue slightly, to bring in Mongolian separators.) This has involved looking at the use of the characters in a wide variety of languages (going beyond our 5 case-studies) and trying to characterize objectively the environments where ZWJ or ZWNJ could make a difference to rendering, and allow them in identifiers only in these environments. Those concerned about this issue for their languages (notably Nepali, Persian etc.) may wish to consider this approach as a concrete option. -- Nicholas Ostler nicholas@ostler.net +44 (0)1225-852865, (0)7720-889319 Chairman: Foundation for Endangered Languages www.ogmios.org Author: Empires of the Word (2005), Ad Infinitum (2007), The Last Lingua Franca (2010) www.nicholasostler.com

Show replies by date

Cary Karp

July 2011

4:16 p.m.

Quoting Nicholas:

...

U+200D ZWJ zero-width joiner and U+200C ZWNJ zero-width non-joiner ... Unicode calls these "special characters", which is not very helpful.

They are often referred to as "join controls". /Cary

Patrik Fältström

6:37 a.m.

On 28 jul 2011, at 18.06, Nicholas Ostler wrote:

...

Those concerned about this issue for their languages (notably Nepali, Persian etc.) may wish to consider this approach as a concrete option.

This is also what is recommended in RFC 5892 section 2.8 and appendix A.1 and A.2. What is problematic is how to handle the case when one can *not* tie the label to a language (because of implicit contexts) as the dns does not have language context. The DNS only have three parameters in a query: {owner, type, class}. Patrik

Nicholas Ostler

9:22 a.m.

On 29/07/2011 07:37, Patrik Fältström wrote:

...

On 28 jul 2011, at 18.06, Nicholas Ostler wrote:

...
Those concerned about this issue for their languages (notably Nepali, Persian etc.) may wish to consider this approach as a concrete option. This is also what is recommended in RFC 5892 section 2.8 and appendix A.1 and A.2. Thanks for that. I note that RFC 5892 is restricted to the Arabic and Devanagari script cases, whereas the Unicode reference I gave (http://unicode.org/review/pr-96.html ) is more general (including other Brahmi-derived scripts of India and South-east Asia, notably Sinhalese, Khmer, and Malayalam, and the Mongolian "Uyghur" script, U+1800-18AF.) What is problematic is how to handle the case when one can *not* tie the label to a language (because of implicit contexts) as the dns does not have language context.

The DNS only have three parameters in a query: {owner, type, class}. In fact, this may not be an impossible probem. The rules given by Unicode and RFC 5892 refer to properties of characters (e.g. Canonical_Combining_Class, Joining_Type:{L,D}, Joining_Type:{R,D} not to languages as such. So, in principle, the rules would apply universally, regardless of local registries' languages.

Evidently, the radical solution of simply banning their use (as per the Indian approach to ccTLD in Devanagari) could be followed by some registries (if they chose, or were so constrained by national authorities). But they would have to reconcile themselves to finding the characters in other registries' TLDs. Nicholas -- Nicholas Ostler nicholas@ostler.net +44 (0)1225-852865, (0)7720-889319

Andrew Sullivan

August 2011

3:11 p.m.

New subject: [arabic-vip] The "Invisible Separator Characters" Issue

Dear colleagues, On Fri, Jul 29, 2011 at 10:22:26AM +0100, Nicholas Ostler wrote:

...

On 29/07/2011 07:37, Patrik Fältström wrote:

...

...
The DNS only have three parameters in a query: {owner, type, class}. In fact, this may not be an impossible probem. The rules given by Unicode and RFC 5892 refer to properties of characters (e.g. Canonical_Combining_Class, Joining_Type:{L,D}, Joining_Type:{R,D} not to languages as such. So, in principle, the rules would apply universally, regardless of local registries' languages.

While this is true, it misses the core of Patrik's point. There is no practical way to perform these checks at lookup time in all but the most specialized zones (and certainly, they can't be performed at the root). Recall that the actual lookup is not for a U-label, but for an A-label. In order to perform the necessary context checks, a DNS server would need to perform the A-label/U-label transformation, and then perform the context check. Now, DNS lookups need to happen in milliseconds. These days, in order to deal with things like IPv6 transition mechanisms (the so-called "happy eyeballs" approach), DNS timeouts by clients have been scaled back practically to nothing. The old one or two second long timeout is a thing of the past. So the overhead would be too great for clients to wait for it. In addition, in order to get this benefit, any DNS authority server (and maybe any cache, but let's say not) that was serving a zone with these joiners would also need to be upgraded (either with a shim in front or else as part of the software itself) in order to perform this transform-and-check step. But the whole point of IDNA is supposed to be that we can do the job without touching the DNS servers or protocol. If we can open that can of worms, we could surely come up with something better than IDNA. Regardless, it would be asking far too much that the root server infrastructure be modified in this way, at least in the near term. So, in short, resolution-time checks will not be possible. This leaves only strong rules restricting registration of labels with the joiner code points. In respect of registration rules, in large, delegation-only or delegation-mostly zones (like the root and most TLDs), it is always preferable to start with the most conservative policy and gradually relax the rule. This is because it is very hard to remove a working delegation, but it is always easy to add a new one. Consider that removal of delegation from the root is incredibly rare. Consider, too, that the ICANN-agreement gTLDs all have a large number of grace policies designed to ensure the maximal probabilty of a registration being made correct (and properly paid for) instead of having the delegation get lost. These are both evidence for the proposition that it is easier to add a new delegation than to remove an existing one, and therefore reasons for caution in expanding registration rules at the root (or indeed, in any TLD). Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com

JFC Morfin

1:41 p.m.

At 11:22 29/07/2011, Nicholas Ostler wrote:

...

...
The DNS only have three parameters in a query: {owner, type, class}.

Incorrect. It also has the prefix and the TLD that can be used to support presentation elements. I think it is time to introduce the concept of "zonale" definition file that document the parameters of a TLD relational space. For example, the .FRA zonale will document the sensitivity of .FRA domain names to majuscule. jfc

Andrew Sullivan

1:58 p.m.

On Tue, Aug 02, 2011 at 03:41:04PM +0200, JFC Morfin wrote:

...

At 11:22 29/07/2011, Nicholas Ostler wrote:

...
...
The DNS only have three parameters in a query: {owner, type, class}.

Incorrect. It also has the prefix and the TLD that can be used to support presentation elements.

What do you mean by "prefix" and "TLD"? I hope you're not trying to suggest that every DNS lookup is of the form [prefix].TLD. Certainly, that is false unless you have a very ideosyncratic interpretation of the DNS. Zone cuts in the DNS are there for the administrative convenience _of the DNS_, and are not in themselves any kind of information about administrative boundaries for policy. The misunderstanding of this distinction, for instance, is a primary reason that http cookies are subject to so many woeful security problems, and why we have ended up with preposterous mechanisms like publicsuffix.org. The reason policy is important and unusual at or near the root is not because those zones are somehow special, but because they mostly do delegation out to other operators, so innovations at those points in the tree are places that can affect a large number of other zones. Therefore,

...

I think it is time to introduce the concept of "zonale" definition file that document the parameters of a TLD relational space. For example, the .FRA zonale will document the sensitivity of .FRA domain names to majuscule.

if what you are suggesting is that it needs to be possible to track down certain policy rules about a zone by looking up the location of such a policy in the DNS, along with rules about what to do if a zone doesn't come with a policy, then I might agree (and indeed, I committed during IETF week to put out a draft along these lines). If you're suggesting instead that we use the mere fact of the fully qualified domain name's labels to entail different formatting conventions, then I predict widespread failure from such simple inferences. Best, A -- Andrew Sullivan ajs@anvilwalrusden.com

JFC Morfin

4:37 p.m.

At 15:58 02/08/2011, Andrew Sullivan wrote:

...

On Tue, Aug 02, 2011 at 03:41:04PM +0200, JFC Morfin wrote:

...
At 11:22 29/07/2011, Nicholas Ostler wrote:

...
...
The DNS only have three parameters in a query: {owner, type, class}.

Incorrect. It also has the prefix and the TLD that can be used to support presentation elements.

Dear Andrew, I am afraid my proposition is not as tricky as you are afraid. But better I explain it so, you can criticize it if I was wrong. I am also interested in your draft. It only relates to few things I propose and we tested for more than a decade as "quiest" and also my point that unicode is inadequate to neworking. None has any impact with the DNS. By TLD I mean TLD and by prefix I mean whatever label(s) someone may want to conventinally introduce prior to the domain name he/she uses. I quote them together because both can be used together. 1. "zonale" (and netlocale about the . This only means that a TLD manager should maintain a possibly empty information center at zonale.tld, as ut uses to operate a nic.tld or a www.tld. And that this ZIC provides presentation oriented information and IDN tables. This will be simpler, faster and more equal to everyone than the IANA where only $ 187.000+ TLDs and ISO 3166 ccTLDs can be documented (please remember that my logic for 1/3 of century is ICANN/ICP-3-Part 5) 2. "prefix". I personally hate the WHOIS which violates every privacy law throughout the world and is of very low interest in terms of possibly included data. For years I advocate the quiest.domain.name optional formula for people to say what they want about themselves, or nothing. There is no difficulty in having a list of such registries, sub-hosts, or pages that can be used to give formated information on the host, its relational spaces, its policy, its zonale, etc Obviously the prefix.tld or prefix.zic.tld or prefix.nic.tld or prefix.www.tld can be used to give TLD zone information.

...

Zone cuts in the DNS are there for the administrative convenience _of the DNS_, and are not in themselves any kind of information about administrative boundaries for policy.

:-) this is something you should explain to ICANN@ 185.000 per zone. Actually I would suggest that every TLD supports an icann.tld name where it would document its relations with ICANN, the amount it pays to ICANN, etc. And obviously a variants.TLD if they are not documented in the zonale file.

...

The misunderstanding of this distinction, for instance, is a primary reason that http cookies are subject to so many woeful security problems, and why we have ended up with preposterous mechanisms like publicsuffix.org.

However, a very interesting initiative in terms of relation to the DNS orgnanization and monetary value.

...

The reason policy is important and unusual at or near the root is not because those zones are somehow special, but because they mostly do delegation out to other operators, so innovations at those points in the tree are places that can affect a large number of other zones.

Sure, but you know that I have an heterarchical vision of the top zone. So, it is technically slightly less important to me than to you. But it can certainly be highly confusing for many. This is why variants should be carefully treated in coordination with all the other use of names, in every technology.

...

Therefore,

...
I think it is time to introduce the concept of "zonale" definition file that document the parameters of a TLD relational space. For example, the .FRA zonale will document the sensitivity of .FRA domain names to majuscule.

if what you are suggesting is that it needs to be possible to track down certain policy rules about a zone by looking up the location of such a policy in the DNS, along with rules about what to do if a zone doesn't come with a policy, then I might agree (and indeed, I committed during IETF week to put out a draft along these lines).

This is an interesting news. The question is about the DNS itself beeing used as a data repository, or a DDDS or a linked registry, or some SQLite engine. I am personally interested in investing some effort this summer time in ISO 11169 and see how to amalgamate the concepts involved in all this. I am considering building from scratch a multi-registry server as part of the ML-DNS support.

...

If you're suggesting instead that we use the mere fact of the fully qualified domain name's labels to entail different formatting conventions, then I predict widespread failure from such simple inferences.

I am not sure I understand what you mean, unless you are afraid that I try to infer some information from the simple, common way a FQDN looks like. This is not the case. Best jfc

Patrik Fältström

8:40 a.m.

On 2 aug 2011, at 18:37, JFC Morfin wrote:

...

By TLD I mean TLD and by prefix I mean whatever label(s) someone may want to conventinally introduce prior to the domain name he/she uses.

A domain name is whatever set of one or more labels you for example can use in a DNS query that together with type and class form the triple on which matching is based. What you seem to talk about is one label in a domain name, but not a domain name. Patrik

JFC Morfin

3:33 p.m.

At 10:40 22/08/2011, Patrik Fältström wrote:

...

On 2 aug 2011, at 18:37, JFC Morfin wrote:

...
By TLD I mean TLD and by prefix I mean whatever label(s) someone may want to conventinally introduce prior to the domain name he/she uses.

A domain name is whatever set of one or more labels you for example can use in a DNS query that together with type and class form the triple on which matching is based. What you seem to talk about is one label in a domain name, but not a domain name.

Dear Patrick, Please remember that when we started the WG/IDNA2008 work, I asked Vint if the intent was to address the needs of the users or the needs of the Internet, as per the charter. The response from James Seng confirmed by Vint was: the needs of the Internet. I then said I will develop the ML-DNS to suit the needs of the users based upon the IDNA2008 outcome if this was possible. At a given time the debate shown it would not be possible and we opposed for that reason. I was therefore fired with some "Jefsey's disciples", etc. At the end of the day we surprisingly were able to reach a consensus because it turned out that the Internet technology was able, in an "unusual" way (RFC 5895): 1. to support IDNA2008 on the network side - as per the WG. Hence the consensus with our mental restriction concerning othotypographic metadata.(e.g. French majuscules) 2. to support IDNA2008 on the user side through what I call "IUI" (Internet/Intelligent Use Interface) with the users' machine and its applications. Work to be carried somewhere. I asked IESG/IAB where. They made clear was that the possible IUI definition, experimentation, etc. was of potential interest to the IETF but was no part of its scope. Hence the need to initiate a place to carry that work (IUTF, etc.) and a way to liaise with the IETF as an "IETF client". The IUCG which helped the IDNA2008 consensus on our side seems to be the proper place for that since it is a user's need. Now, I have a personal difficulty. 1. The IUI is an architectural component which has to be neutrally defined on a general architectural basis. 2. Then I have my own architectural project (I name Internet PLUS) to develop and support an IUI. There may be others. However, the IUI concept being a new one, my own experimental implementation will validate it. I am therefore to be careful at not confusing what is intrinsic to a subsidiary extension of the Internet architecture (I call the SDNS [subsidiary DNS] in the DNS case) and what belongs to my own SDNS implementation, I carry through ALFA. In *my* implementation, the necessary dialog between the user and its network piece of intelligence located at the participating IUIs (local, host, cybship - i.e. all the other processes controlled by the user whatever the machine) can simply be carried by a "netix" command set also delivered by what I name "prefixes". This being said. In my personal project, if we take the "xyz.icann.org" domain name, if I can differentiate "xyz", I can consider "icann.org" as the registered domain name of ICANN and "xyz" as a netix "prefix" which has been introduced prior to that domain name - however it looks as being a part of it so it goes through to the ML-DNS. The way this prefix is to be used is no part of the Internet and of the Internet DNS, but will be filtered and applied by the ML-DNS. For example, I can say that "9--2.icann.org" is to be understood by the ML-DNS as to stand for a class 2 "icann.org" FQDN, the ML-DNS will make the DNS resolve as such. This means that "9--2.xn--jean-franois-sdb.jefsey.com" will be "jean-françois.jefsey.com" in CLASS 2. However, please remember that this is only because the ML-DNS acts as a smart middleware IDNApplication (cf. the scheme in http://iucg.org/wiki/IDNS_Common_Glossary). And with an ML-DNS, Internet PLUS experimental implementation. Why is this important to the IETF? For consistency reasons. Post IDNA2008 solutions should be consistent (RFC 1958) to be stable. For the time being we have at least seven entities considering them: - IAB - UNICODE - PRECIS at protocol level - ICANN at DNS governance level - emerging IUTF at IUI level - IUCG at user satisfaction level - and me at research and experimental level. I am sure there are others. Hence, my effort to get a common glossary and communication to help problems mutual understanding, jfc

Patrik Fältström

6:06 p.m.

On 22 aug 2011, at 17.33, JFC Morfin wrote:

...

This being said. In my personal project, if we take the "xyz.icann.org" domain name, if I can differentiate "xyz", I can consider "icann.org" as the registered domain name of ICANN and "xyz" as a netix "prefix" which has been introduced prior to that domain name - however it looks as being a part of it so it goes through to the ML-DNS.

My only point, which I reiterate, is that xyz.icann.org is a domain name, as is icann.org and org. Three different domain names. Please do not use the term "domain name" for anything else. Patrik

JFC Morfin

12:06 a.m.

At 20:06 22/08/2011, Patrik Fältström wrote:

...

On 22 aug 2011, at 17.33, JFC Morfin wrote:

...
This being said. In my personal project, if we take the "xyz.icann.org" domain name, if I can differentiate "xyz", I can consider "icann.org" as the registered domain name of ICANN and "xyz" as a netix "prefix" which has been introduced prior to that domain name - however it looks as being a part of it so it goes through to the ML-DNS.

My only point, which I reiterate, is that xyz.icann.org is a domain name, as is icann.org and org. Three different domain names. Please do not use the term "domain name" for anything else.

Dear Patrik, however happy I would be to deffer to your wishes, I am lost at understanding what they are? 1. xyz.icann.org is a domain name, as is icann.org and org. Three different domain names. 2. in these domain names "org" is their TLD or suffix or first level label. 3. xyz, icann, and org are respectively their upper level label. I claim the capacity to call them their prefixes. If you have a problem with that I can also call them their ULL, and their TLD their FLL. However, this sounds less multilinguistic and unecessarily sophisticated? Best. jfc

5390

Age (days ago)

5416

Last active (days ago)

List overview

Download

11 comments

5 participants

participants (5)

Andrew Sullivan
Cary Karp
JFC Morfin
Nicholas Ostler
Patrik Fältström

The "Invisible Separator Characters" Issue

tags

participants (5)