Re: [UA-EAI] UA-EAI@icann.org mailing list reminder
Following on from recent discussions in the UA-Discuss list, I’d like to build a Good Practice Guide for Mailbox Administrators so these administrators can see some of the issues that they should consider. Topics might include:
* Transformation of mailbox name from non-ascii to an ascii alias when encountering a non-EAI ready MTA * How to display IDNs * How to display non-ASCII mailbox names * How to address fuzzy matching when receiving non-exact match emails * How to manage case folding * How to manage variants in mailbox names. Are korero@ and kōrero@ going to the same mailbox? * Do you allow mixing of scripts in mailbox names? If yes, sometimes, what are the criteria. * And more
Your suggestions are most welcome.
Don
Don Hollander Secretary General – UASG
Making a start: * How to handle illegal sequences in complex scripts? * How to handle non-standard sequences in complex scripts? A./
In article <2189ae2a-adcd-119d-9d0d-d19d16544110@ix.netcom.com> you write:
Making a start:
* How to handle illegal sequences in complex scripts?
Don't Do That.
* How to handle non-standard sequences in complex scripts?
Don't Do That, either. We know that it is technically valid to put a close paren in an ASCII mailbox name, but no sensible person expects that to work, and I can't get too upset at web forms or whatever that don't allow them.* I think we can avoid a lot of pain by starting with the assumption that EAI addresses will be something that looks plausible in some natural language. R's, John * - on the other hand, hyphens and + signs are fine
On 4/26/2019 5:43 PM, John Levine wrote:
In article <2189ae2a-adcd-119d-9d0d-d19d16544110@ix.netcom.com> you write:
Making a start:
* How to handle illegal sequences in complex scripts? Don't Do That.
Guidance might consist of telling people that there are things like illegal sequences. (For examples see chapter 12 of the Unicode Standard and look for "Do not use" - those particular ones are cut&dried)
* How to handle non-standard sequences in complex scripts? Don't Do That, either.
Guidance might consist to telling people what such sequences are. (Beyond the "do not use" there is another layer of context rules for complex scripts like the ones you find in RZ-LGR-3 that would rule out additional sequences that structurally invalid, of if you want, by definition meaningless for a given script.) In some cases, alternative sequences exist for legacy reasons and for mail boxes, might be appropriate for fuzzy matching. Guidance might consist of identifying these (or give examples plus sources for more complete info).
We know that it is technically valid to put a close paren in an ASCII mailbox name, but no sensible person expects that to work, and I can't get too upset at web forms or whatever that don't allow them.* I think we can avoid a lot of pain by starting with the assumption that EAI addresses will be something that looks plausible in some natural language.
Tricky one, that. You don't want to go into spelling rules. Like some languages don't use "w", but then, people who speak that language may have names that start with W. For complex scripts, certain pairings make no sense, because they don't correspond to any realistic sound in the languages that use it, but are in fact no worse than "nonsense" words in ASCII. I'm not sure whether I would call "phhhbzz@example.com" plausible in any language, but we can agree it still works as mnemonic for users of English. ph)xxt@example.com is indeed more marginal. Worse if you plant an accent mark on it. That kind of thing approaches the structural issues seen in complex scripts. A./
R's, John
* - on the other hand, hyphens and + signs are fine
* How to handle illegal sequences in complex scripts? Don't Do That.
Guidance might consist of telling people that there are things like illegal sequences.
Sure. I don't want to be cruel, I want to encourage people to know when something isn't likely to work so reject it early.
we can avoid a lot of pain by starting with the assumption that EAI addresses will be something that looks plausible in some natural language.
Tricky one, that. You don't want to go into spelling rules. Like some languages don't use "w", but then, people who speak that language may have names that start with W. For complex scripts, certain pairings make no sense, because they don't correspond to any realistic sound in the languages that use it, but are in fact no worse than "nonsense" words in ASCII. I'm not sure whether I would call "phhhbzz@example.com" plausible in any language, but we can agree it still works as mnemonic for users of English.
Plausible may not be the right word, but I think you get the idea. Perhaps say it's something that a speaker of a language would be able to type and to read or spell over the phone without undue effort.
ph)xxt@example.com is indeed more marginal.
Actually, it's invalid because local parts that contain parens need to be quoted. On the other hand, braces don't need to be quoted so this is valid but I wouldn't be too upset if a web form didn't accept it: ph}xxt@example.com
Worse if you plant an accent mark on it.
Now we're into fuzzy match territory. I think that for robustness an MTA should accept local part characters with wrong or missing diacriticals, and mail system admins shouldn't assign names that differ only in diacriticals. Regards, John Levine, john.levine@standcore.com Standcore LLC
On 4/26/2019 8:07 PM, John Levine wrote:
* How to handle illegal sequences in complex scripts? Don't Do That.
Guidance might consist of telling people that there are things like illegal sequences.
Sure. I don't want to be cruel, I want to encourage people to know when something isn't likely to work so reject it early.
I don't think cruelty enters the discussion here. What does enter is that most people lack the background to understand that they are deep out on the moors and to know where to find secure footing.
we can avoid a lot of pain by starting with the assumption that EAI addresses will be something that looks plausible in some natural language.
Tricky one, that. You don't want to go into spelling rules. Like some languages don't use "w", but then, people who speak that language may have names that start with W. For complex scripts, certain pairings make no sense, because they don't correspond to any realistic sound in the languages that use it, but are in fact no worse than "nonsense" words in ASCII. I'm not sure whether I would call "phhhbzz@example.com" plausible in any language, but we can agree it still works as mnemonic for users of English.
Plausible may not be the right word, but I think you get the idea. Perhaps say it's something that a speaker of a language would be able to type and to read or spell over the phone without undue effort.
That's not a bad description. The "structural nonsense" that I have in mind is something that would either not render properly or be worse than unpronounceable. What, for example, would be the effect of applying a vowel modifier on a tone mark? (That example, actually, could result in corrective display in some systems, where they'd assume you meant it in the correct order - which leads to bad aliasing for identifiers).
ph)xxt@example.com is indeed more marginal.
Actually, it's invalid because local parts that contain parens need to be quoted. On the other hand, braces don't need to be quoted so this is valid but I wouldn't be too upset if a web form didn't accept it:
ph}xxt@example.com
Worse if you plant an accent mark on it.
Now we're into fuzzy match territory. I think that for robustness an MTA should accept local part characters with wrong or missing diacriticals, and mail system admins shouldn't assign names that differ only in diacriticals.
That's possible, but remember, what looks like a diacritic in one language, is treated as an independent letter in others. A-with-ring is sorted after Z in all Scandinavian languages. While A is sorted before B. If you equate them, where do you show them in a user list? Nukta, in the Indic scripts, is considered a mark that is "closest" to an actual diacritic - as opposed to a distinct (modifier) letter or independent mark that simply combines graphically. Nukta happens to be the only one that is treated as "optional" in some scripts. Mainly because it's not very prominent, and it happens to be used in unexpected ways in some minority languages - risking misinterpretation (or being overlooked) by the majority. What feels natural, or what strikes the right balance between security and behaving as (naively) expected, isn't something that can be covered by a simplistic approach - some differentiation as to target audience will be required. For example, if you ignored the vowel signs in Indic scripts (because graphically, they behave like diacritics) then you would conflate words that differ the way "bat", "bet", "bit", "bot" and "but" differ in English! That may be robust, but the behavior would be rather unexpected and not at all the same as accepting "naive" both with and without the diaeresis. In some Indic scripts the vowel signs (at least some of them) are rather prominent, but in others, some or all of them look more like decorations. So the challenge would be to create guidelines that are useful. The example of the bracket + accent was intended in a different direction: most accent marks (in European scripts) should simply be forbidden --- 99% of the combinations needed are precomposed. A dozen or so explicitly enumerated combinations covers the remainder of what's needed in modern languages. Everything else is not "plausible" and should simply not be admitted. Yes, it makes it impossible for people to write usernames in IPA. But why support something, to tickle the vanity of a vanishingly small set of users fluent in that particular notation while making the experience worse for everyone else. Those are tradeoffs that are worthy of some analysis and description. A./
On Fri, 26 Apr 2019, Asmus Freytag (c) wrote:
Now we're into fuzzy match territory. I think that for robustness an MTA should accept local part characters with wrong or missing diacriticals, and mail system admins shouldn't assign names that differ only in diacriticals.
That's possible, but remember, what looks like a diacritic in one language, is treated as an independent letter in others. A-with-ring is sorted after Z in all Scandinavian languages. While A is sorted before B. If you equate them, where do you show them in a user list? ...
Fuzzy matching isn't case folding or variants. You can think of it as spelling correction or aliases. My point is that if a system has an address bøb, it should accept mail addressed to bob and böb as well, even if those are different letters in the local language. It would be a good idea not to assign bób or bòb and maybe also to allow them as aliases for bøb. The user's address is still bøb which sorts however it sorts. Regards, John Levine, john.levine@standcore.com Standcore LLC
27 апр. 2019 г., в 7:10, John Levine <john.levine@standcore.com> написал(а):
On Fri, 26 Apr 2019, Asmus Freytag (c) wrote:
Now we're into fuzzy match territory. I think that for robustness an MTA should accept local part characters with wrong or missing diacriticals, and mail system admins shouldn't assign names that differ only in diacriticals.
That's possible, but remember, what looks like a diacritic in one language, is treated as an independent letter in others. A-with-ring is sorted after Z in all Scandinavian languages. While A is sorted before B. If you equate them, where do you show them in a user list? ...
Fuzzy matching isn't case folding or variants. You can think of it as spelling correction or aliases.
My point is that if a system has an address bøb, it should accept mail addressed to bob and böb as well, even if those are different letters in the local language. It would be a good idea not to assign bób or bòb and maybe also to allow them as aliases for bøb. The user's address is still bøb which sorts however it sorts.
I suspect anyone who’s fluent in any Nordic language would strongly disagree with this guidance, which is what Asmus is hinting at. Paul
Regards, John Levine, john.levine@standcore.com Standcore LLC _______________________________________________ UA-EAI mailing list UA-EAI@icann.org https://mm.icann.org/mailman/listinfo/ua-eai
On Monday 29 April 2019 04:15:36 CEST, Paul Borokhov via UA-EAI wrote:
I suspect anyone who’s fluent in any Nordic language would strongly disagree with this guidance, which is what Asmus is hinting at.
I am fluent, and I don't strongly disagree with the guidance. It's not marvellously good, but it's not terrible either. Committee stuff. For example ø conventionally maps to oe, not o, but that's hardly a mistake worth getting excited about. Minor enough that the (very capable) .no TLD authority requires domain registrants to do the same, while acknowledging that it is unfortunate. Now, accepting mail fuzzily may be a bad idea, or a good idea, but if you disregard that general question, then the application to the Nordic letters isn't awful from a native-speaker perspective. The advice can be applied for other purposes where it's neutral or even a very minor feature. Recipient completion in a UI. Arnt
In article <f18ac70a-da1a-4b10-8a20-d0699617cf65@gulbrandsen.priv.no> you write:
On Monday 29 April 2019 04:15:36 CEST, Paul Borokhov via UA-EAI wrote:
I suspect anyone who’s fluent in any Nordic language would strongly disagree with this guidance, which is what Asmus is hinting at.
I am fluent, and I don't strongly disagree with the guidance.
It's not marvellously good, but it's not terrible either. Committee stuff.
The right rules depend on who the users of a mail system are and who they correspond with. Mail addresses are very different from IDNs in that only the system that handles the mail does the address matching so every system has different matching rules. (This is how it is now, EAI doesn't change that.) If you had a mail system where you expected all of the users and their correspondents to be writing in Nordic languages, it could make sense to match vowels strictly since everyone would know the difference between å and ä or ø and ö. On the other hand, if the users correspond with people elsewhere in Europe or the Americas who know the Latin script but not the Nordic languages, it would make sense to do more fuzzing for, e.g., people who copied the address off a business card but got the accents wrong. I think we can express general principles like don't have two addresses that differ only in confusables that go to different mailboxes, while leaving the degree of fuzz up to the individual system managers. It doesn't seem to confuse people too much that Gmail ignores dots in its ASCII addresses while nobody else does. That's the degree of variation we're already used to. R's, John
On Monday 29 April 2019 16:10:30 CEST, John Levine wrote:
The right rules depend on who the users of a mail system are [...]
Yes, and your guidance is acceptable IMO. But you, as author of such a document, should know and use the right terminology. The Nordic languages don't have accents, they have additional letters (ignoring weird cases like the author who spells his name with an accent I've never seen on another word). If an example or sentence mentions Swedish, then ö is a letter, if German, then ö is an accented letter and the letter is o. Getting these things right in prose is a challenge, but if you get them wrong, you'll make people assume that your substance is as unreliable as your phrasing. Arnt
On 4/29/2019 9:02 AM, Arnt Gulbrandsen wrote:
On Monday 29 April 2019 16:10:30 CEST, John Levine wrote:
The right rules depend on who the users of a mail system are [...]
Yes, and your guidance is acceptable IMO.
But you, as author of such a document, should know and use the right terminology. The Nordic languages don't have accents, they have additional letters (ignoring weird cases like the author who spells his name with an accent I've never seen on another word). If an example or sentence mentions Swedish, then ö is a letter, if German, then ö is an accented letter and the letter is o.
While this is true for the three letters that sort after "z", it is actually incorrect to say that there is no orthographic use of accents in the Scandinavian languages. You will find a certain amount of usage for "é", without that being considered a distinct letter. This is getting in the weeds, but it is indicative of how hard it is to make generalized statements.
Getting these things right in prose is a challenge, but if you get them wrong, you'll make people assume that your substance is as unreliable as your phrasing.
A fine line between getting the general idea across to people without knowledge of the details while retaining your credibility with those that do. A./
Arnt
_______________________________________________ UA-EAI mailing list UA-EAI@icann.org https://mm.icann.org/mailman/listinfo/ua-eai
participants (5)
-
Arnt Gulbrandsen -
Asmus Freytag -
Asmus Freytag (c) -
John Levine -
Paul Borokhov