On 4/26/2019 8:07 PM, John Levine wrote:
 * How to handle illegal sequences in complex scripts?
 Don't Do That.

Guidance might consist of telling people that there are things like illegal sequences.

Sure. I don't want to be cruel, I want to encourage people to know when something isn't likely to work so reject it early.


I don't think cruelty enters the discussion here. What does enter is that most people lack the background to understand that they are deep out on the moors and to know where to find secure footing.


 we can avoid a lot of pain by starting with the assumption that EAI
 addresses will be something that looks plausible in some natural
 language.

Tricky one, that. You don't want to go into spelling rules. Like some languages don't use "w", but then, people who speak that language may have names that start with W. For complex scripts, certain pairings make no sense, because they don't correspond to any realistic sound in the languages that use it, but are in fact no worse than "nonsense" words in ASCII. I'm not sure whether I would call "phhhbzz@example.com"  plausible in any language, but we can agree it still works as mnemonic for users of English.

Plausible may not be the right word, but I think you get the idea. Perhaps say it's something that a speaker of a language would be able to type and to read or spell over the phone without undue effort.

That's not a bad description. The "structural nonsense" that I have in mind is something that would either not render properly or be worse than unpronounceable. What, for example, would be the effect of applying a vowel modifier on a tone mark? (That example, actually, could result in corrective display in some systems, where they'd assume you meant it in the correct order - which leads to bad aliasing for identifiers).


ph)xxt@example.com is indeed more marginal.

Actually, it's invalid because local parts that contain parens need to be quoted.  On the other hand, braces don't need to be quoted so this is valid but I wouldn't be too upset if a web form didn't accept it:

 ph}xxt@example.com

Worse if you plant an accent mark on it.

Now we're into fuzzy match territory.  I think that for robustness an MTA should accept local part characters with wrong or missing diacriticals, and mail system admins shouldn't assign names that differ only in diacriticals.

That's possible, but remember, what looks like a diacritic in one language, is treated as an independent letter in others. A-with-ring is sorted after Z in all Scandinavian languages. While A is sorted before B. If you equate them, where do you show them in a user list?

Nukta, in the Indic scripts, is considered a mark that is "closest" to an actual diacritic - as opposed to a distinct (modifier) letter or independent mark that simply combines graphically.

Nukta happens to be the only one that is treated as "optional" in some scripts. Mainly because it's not very prominent, and it happens to be used in unexpected ways in some minority languages - risking  misinterpretation (or being overlooked) by the majority.

What feels natural, or what strikes the right balance between security and behaving as (naively) expected, isn't something that can be covered by a simplistic approach - some differentiation as to target audience will be required.

For example, if you ignored the vowel signs in Indic scripts (because graphically, they behave like diacritics) then you would conflate words that differ the way "bat", "bet", "bit", "bot" and "but" differ in English! That may be robust, but the behavior would be rather unexpected and not at all the same as accepting "naive" both with and without the diaeresis.

In some Indic scripts the vowel signs (at least some of them) are rather prominent, but in others, some or all of them look more like decorations.

So the challenge would be to create guidelines that are useful.

The example of the bracket + accent was intended in a different direction: most accent marks (in European scripts) should simply be forbidden --- 99% of the combinations needed are precomposed. A dozen or so explicitly enumerated combinations covers the remainder of what's needed in modern languages.

Everything else is not "plausible" and should simply not be admitted.

Yes, it makes it impossible for people to write usernames in IPA. But why support something, to tickle the vanity of a vanishingly small set of users fluent in that particular notation while making the experience worse for everyone else.

Those are tradeoffs that are worthy of some analysis and description.

A./