First of all, I apologize if I am using non-standard terminology. What I mean specifically is this: Unicode code-points that have no specific rendering of their own, but may affect the rendering of neighbouring codepoints in the string. Specifically we have been talking about U+200D ZWJ zero-width joiner and U+200C ZWNJ zero-width non-joiner both of which are actively used for some languages that use the Arabic and the Devanagari scripts. http://en.wikipedia.org/wiki/Zero-width_non-joiner shows that they can also make a difference in German spelling, and in Hebrew. Unicode calls these "special characters", which is not very helpful. People have been talking explicitly about their necessity for Persian and some other Indo-Aryan languages written in Arabic, e.g. behnam@esfahbod.info - [arabic-vip] Typographical Complexity of Arabic Script - 19/07/2011 and for Nepali, among languages written in Devanagari, e.g. bkbal@ltk.org.np - Re: [Devanagari-vip] Document for Hindi Language based on Policy for ccTLD .bharat - 03/07/2011 By their very nature - these characters, when placed where they would have no effect on rendering, allow identical renderings to be associated with the distinct strings of abstract characters. They go beyond the use of other "combining" codepoints (such as accents and cedillas, which (taken with other neighbouring codepoints) also sometimes result in a glyph indistinguishable from another codepoint), in that they can have this effect on any string of codepoints. So U+0301 [´] COMBINING ACUTE ACCENT combined with U+0065 [e] LATIN SMALL LETTER E can result in a glyph indistinguishable from U+00E9 [é] LATIN SMALL LETTER E WITH ACUTE; but ZWNJ or ZWJ placed anywhere in a string (as long as its neighbours are not combining charactyers) will always give a string outwardly indistinguishable from one without it. Seeing this danger of their use in URL identifiers, (essentially an open invitation to spoofing), the Indian approach was been simply to outlaw them: akshatj@cdac.in - Document for Hindi Language based on Policy for ccTLD .bharat - 01/07/2011 However, a less radical, but more complicated, approach, which allows ZWJ and ZWNJ to be used where necessary, is laid out by Unicode at http://unicode.org/review/pr-96.html Public Review Issue #96 - Allowing Special Characters in Identifiers - Revision 3 - 04-19-2007 (It generalizes the issue slightly, to bring in Mongolian separators.) This has involved looking at the use of the characters in a wide variety of languages (going beyond our 5 case-studies) and trying to characterize objectively the environments where ZWJ or ZWNJ could make a difference to rendering, and allow them in identifiers only in these environments. Those concerned about this issue for their languages (notably Nepali, Persian etc.) may wish to consider this approach as a concrete option. -- Nicholas Ostler nicholas@ostler.net +44 (0)1225-852865, (0)7720-889319 Chairman: Foundation for Endangered Languages www.ogmios.org Author: Empires of the Word (2005), Ad Infinitum (2007), The Last Lingua Franca (2010) www.nicholasostler.com