Regular Expression

older
Re: [UA-discuss] Re : Re: Regular...

Don Hollander

Sept. 12, 2017

6:44 p.m.

Please note that this is a Geeky post - so carry on if that’s not you. Email validation is an area where many websites fall short as we found in our study on Website UA Readiness (nearing publication) The technologies behind these websites generally use a Regular Expression as their first line of defence against rubbish data. The issue is that most of these RegExs are overly restrictive. As an appendix to the Website review, we looked at some of the technologies behind the websites to see if there were common denominators for good and bad experiences. One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion. /^.+@(?:[^.]+\.)+(?:[^.]{2,})$ Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer. Your thoughts? Don Hollander Universal Acceptance Steering Group Skype: don_hollander

Attachments:

attachment.html (text/html — 2.5 KB)
smime.p7s (application/pkcs7-signature — 3.8 KB)

Show replies by date

David Conrad

September 2017

6:52 p.m.

Don, Perhaps also pointing people at https://davidcel.is/posts/stop-validating-email-addresses-with-regex/? Regards, -drc

...

On Sep 12, 2017, at 11:44 AM, Don Hollander <don.hollander@icann.org> wrote:

Please note that this is a Geeky post - so carry on if that’s not you.

Email validation is an area where many websites fall short as we found in our study on Website UA Readiness (nearing publication)

The technologies behind these websites generally use a Regular Expression as their first line of defence against rubbish data. The issue is that most of these RegExs are overly restrictive.

As an appendix to the Website review, we looked at some of the technologies behind the websites to see if there were common denominators for good and bad experiences.

One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion.

/^.+@(?:[^.]+\.)+(?:[^.]{2,})$

Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer.

Your thoughts?

Don Hollander Universal Acceptance Steering Group Skype: don_hollander

Rubens Kuhl

8:31 p.m.

...

On Sep 12, 2017, at 3:44 PM, Don Hollander <don.hollander@icann.org> wrote:

Please note that this is a Geeky post - so carry on if that’s not you.

Email validation is an area where many websites fall short as we found in our study on Website UA Readiness (nearing publication)

The technologies behind these websites generally use a Regular Expression as their first line of defence against rubbish data. The issue is that most of these RegExs are overly restrictive.

As an appendix to the Website review, we looked at some of the technologies behind the websites to see if there were common denominators for good and bad experiences.

One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion.

/^.+@(?:[^.]+\.)+(?:[^.]{2,})$

Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer.

Your thoughts?

Single IDN TLDs for some scripts is something being considered for subsequent procedures, so I would think of 1 or more and prevent the same UA challenges previous rounds TLDs are suffering. Rubens

Don Hollander

8:43 p.m.

Thanks Rubens. Which raises the question as to when the validation takes place. Before or after a punycode transformation. And David, thanks for the article. The UASG has long advocated turning validation off - but very few active practitioners seem willing think outside that box. D

...

On 13/09/2017, at 8:31 AM, Rubens Kuhl <rubensk@nic.br> wrote:

...
On Sep 12, 2017, at 3:44 PM, Don Hollander <don.hollander@icann.org <mailto:don.hollander@icann.org>> wrote:

Please note that this is a Geeky post - so carry on if that’s not you.

Email validation is an area where many websites fall short as we found in our study on Website UA Readiness (nearing publication)

The technologies behind these websites generally use a Regular Expression as their first line of defence against rubbish data. The issue is that most of these RegExs are overly restrictive.

As an appendix to the Website review, we looked at some of the technologies behind the websites to see if there were common denominators for good and bad experiences.

One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion.

/^.+@(?:[^.]+\.)+(?:[^.]{2,})$

Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer.

Your thoughts?

Single IDN TLDs for some scripts is something being considered for subsequent procedures, so I would think of 1 or more and prevent the same UA challenges previous rounds TLDs are suffering.

Rubens

Don Hollander Universal Acceptance Steering Group Skype: don_hollander

Rubens Kuhl

8:59 p.m.

...

On Sep 12, 2017, at 5:43 PM, Don Hollander <don.hollander@icann.org> wrote:

Thanks Rubens. Which raises the question as to when the validation takes place. Before or after a punycode transformation.

Since you mentioned Unicode I thought before punycode transformation... if after, then at least 2 ASCII characters are indeed to be expected, since even a 1 Unicode char TLD like . 點 would map into .xn--c1y . Rubens

Chaals McCathie Nevile

10:01 p.m.

On Tue, 12 Sep 2017 22:43:09 +0200, Don Hollander <don.hollander@icann.org> wrote:

...

Thanks Rubens. Which raises the question as to when the validation takes place. Before or after a >punycode transformation.

I would generally like validation to take place after punycode conversion. First because there are strings that match the regex bu not punycode constraints. Likewise I agree with Rubens that assuming TLDs are not domains and email must go to a subdomain seems less than prescient with hindsight.

...

And David, thanks for the article. The UASG has long advocated turning validation off - but very >few active practitioners seem willing think outside that box.

I'm not entirely convinced by that approach either. I think there is value in validation - first, to determine whether an email address is real - if it isn't, you are probably better off getting a warning than trying to send it. Second, I find it very helpful, including as a protection against phishing emails, to be told if an email is not recognised as a contact to whom I have *sent* an email, which is a stricter validation check. Applications that do that for me - especially for scripts I don't read fluently like Chinese - are common, and I would be upset if they were to stop validating. On the other hand, incorrect validation, e.g. of an address in a form, with no punycode conversion run first and no reason not to accept an internationalised email is clearly a bad idea - largely since it fails to actually validate whether something is a valid email address. A given application or toolchain may be incapable of handling some valid email addresses, but I think a campaign to convince developers to produce a statement like "this application is second-rate and obsolete" would face significant challenges. Whether it is worth pushing for such applications to state that they do not yet support appropriate standards may be worth considering... cheers Chaals

...

D

...
On 13/09/2017, at 8:31 AM, Rubens Kuhl <rubensk@nic.br> wrote:

...
On Sep 12, 2017, at 3:44 PM, Don Hollander <don.hollander@icann.org> wrote:

Please note that this is a Geeky post - so carry on if that’s not you.

Email validation is an area where many websites fall short as we found in our study on Website UA >>>Readiness (nearing publication)

The technologies behind these websites generally use a Regular Expression as their first line of >>>defence against rubbish data. The issue is that most of these RegExs are overly restrictive.

As an appendix to the Website review, we looked at some of the technologies behind the websites to >>>see if there were common denominators for good and bad experiences.

One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending >>>this in our documentation. Toward that end, this thread is for discussion.

/^.+@(?:[^.]+\.)+(?:[^.]{2,})$

Regular expression check in Javascript. This accepts any Unicode characters, only insisting that >>>the domain must have more than one label and the TLD is 2 characters or longer. Your thoughts?

Single IDN TLDs for some scripts is something being considered for subsequent procedures, so I >>would think of 1 or more and prevent the same UA challenges previous rounds TLDs are suffering.

Rubens

Don Hollander Universal Acceptance Steering Group Skype: don_hollander

-- Chaals is Charles McCathie Nevile find more at http://yandex.com

Andrew Sullivan

5:50 a.m.

On Wed, Sep 13, 2017 at 12:01:07AM +0200, Chaals McCathie Nevile wrote:

...

I would generally like validation to take place after punycode conversion. First because there are strings that match the regex bu not punycode constraints.

In an IDNA2008 world, there are no "Punycode constraints". A U-label corresponds to exactly one A-label, and conversely. You can do it on either side of the conversion divide. Of course, if you're using the broken-for-years IDNA2003, you have a problem. A -- Andrew Sullivan ajs@anvilwalrusden.com

Vittorio Bertola

8:33 a.m.

...

Il 13 settembre 2017 alle 0.01 Chaals McCathie Nevile <chaals@yandex.ru> ha scritto:

On Tue, 12 Sep 2017 22:43:09 +0200, Don Hollander <don.hollander@icann.org> wrote:

I think there is value in validation - first, to determine whether an email address is real - if it isn't, you are probably better off getting a warning than trying to send it.

But this is nothing you can do just with a regexp. The regexp could allow you to intercept blatant mistakes - e.g., there are national keyboards where typing "@" requires pressing Alt or some uncommon combination of keys, so it's easy to mistype it and you can easily warn the user that their entered string does not have a "@" - but anything beyond that is unnecessary, because, if a user mistypes an email address in any other way, it's very likely that he will still end up entering a valid email address that no regexp will be able to tell as non-existing, or that could even exist but belong to someone else. On the other hand, if you try to implement a complex regexp, and especially if you try to figure it out on your own, it's almost certain that you will mark as invalid several valid email addresses that are corner cases but should be accepted, as well as many future developments of the standards which are invalid now but will be valid in the future.

...

Second, I find it very helpful, including as a protection against phishing emails, to be told if an email is not recognised as a contact to whom I have *sent* an email, which is a stricter validation check. Applications that do that for me - especially for scripts I don't read fluently like Chinese - are common, and I would be upset if they were to stop validating.

But this, again, is a validation that cannot be done via a regexp (can you write a regexp representing your entire contact book?) and that, on the other hand, poses an additional stricter condition than just "the email address is valid". We are just discussing how to check that the email address is syntactically valid, any other checks could still be implemented however appropriate. The point here is that you should not try to determine whether an email address is valid by checking its syntax, other than checking that it has a "@" and possibly a "." on the right of it (but even this latter condition is already too strict, as the ideographic full stop "。" should be accepted in place of the ASCII dot, if you check strings in IDN form). Anything beyond that is going to exclude some valid addresses while not increasing in any significant way your chances of intercepting user input error at this stage - and you will still intercept any user error a few seconds later, when you send the validation/confirmation message. Regards, -- Vittorio Bertola | Research & Innovation Engineer vittorio.bertola@open-xchange.com Open-Xchange Srl - Office @ Via Treviso 12, 10144 Torino, Italy

Mark Svancarek

4:20 p.m.

I believe that validation should be as light as possible. "Contains '@' " is about the extent of it unless you are willing to look at bidi and IFS. Just capture the string and send a test message. -----Original Message----- From: ua-discuss-bounces@icann.org [mailto:ua-discuss-bounces@icann.org] On Behalf Of Vittorio Bertola Sent: Wednesday, September 13, 2017 1:34 AM To: Chaals McCathie Nevile <chaals@yandex.ru>; ua-discuss@icann.org Subject: Re: [UA-discuss] Regular Expression

...

Il 13 settembre 2017 alle 0.01 Chaals McCathie Nevile <chaals@yandex.ru> ha scritto:

On Tue, 12 Sep 2017 22:43:09 +0200, Don Hollander <don.hollander@icann.org> wrote:

I think there is value in validation - first, to determine whether an email address is real - if it isn't, you are probably better off getting a warning than trying to send it.

...

Second, I find it very helpful, including as a protection against phishing emails, to be told if an email is not recognised as a contact to whom I have *sent* an email, which is a stricter validation check. Applications that do that for me - especially for scripts I don't read fluently like Chinese - are common, and I would be upset if they were to stop validating.

Mark Svancarek

4:26 p.m.

Depending on your mail client, you may have experienced a linkification error in my response. Weird. -----Original Message----- From: Mark Svancarek Sent: Wednesday, September 13, 2017 9:21 AM To: 'Vittorio Bertola' <vittorio.bertola@open-xchange.com>; Chaals McCathie Nevile <chaals@yandex.ru>; ua-discuss@icann.org Subject: RE: [UA-discuss] Regular Expression I believe that validation should be as light as possible. "Contains '@' " is about the extent of it unless you are willing to look at bidi and IFS. Just capture the string and send a test message. -----Original Message----- From: ua-discuss-bounces@icann.org [mailto:ua-discuss-bounces@icann.org] On Behalf Of Vittorio Bertola Sent: Wednesday, September 13, 2017 1:34 AM To: Chaals McCathie Nevile <chaals@yandex.ru>; ua-discuss@icann.org Subject: Re: [UA-discuss] Regular Expression

...

Il 13 settembre 2017 alle 0.01 Chaals McCathie Nevile <chaals@yandex.ru> ha scritto:

On Tue, 12 Sep 2017 22:43:09 +0200, Don Hollander <don.hollander@icann.org> wrote:

I think there is value in validation - first, to determine whether an email address is real - if it isn't, you are probably better off getting a warning than trying to send it.

...

Second, I find it very helpful, including as a protection against phishing emails, to be told if an email is not recognised as a contact to whom I have *sent* an email, which is a stricter validation check. Applications that do that for me - especially for scripts I don't read fluently like Chinese - are common, and I would be upset if they were to stop validating.

Don Hollander

5:06 p.m.

Mark. What would such a RegEx look like? D

...

On 14/09/2017, at 4:26 AM, Mark Svancarek via UA-discuss <ua-discuss@icann.org> wrote:

Depending on your mail client, you may have experienced a linkification error in my response. Weird.

-----Original Message----- From: Mark Svancarek Sent: Wednesday, September 13, 2017 9:21 AM To: 'Vittorio Bertola' <vittorio.bertola@open-xchange.com>; Chaals McCathie Nevile <chaals@yandex.ru>; ua-discuss@icann.org Subject: RE: [UA-discuss] Regular Expression

I believe that validation should be as light as possible. "Contains '@' " is about the extent of it unless you are willing to look at bidi and IFS. Just capture the string and send a test message.

-----Original Message----- From: ua-discuss-bounces@icann.org [mailto:ua-discuss-bounces@icann.org] On Behalf Of Vittorio Bertola Sent: Wednesday, September 13, 2017 1:34 AM To: Chaals McCathie Nevile <chaals@yandex.ru>; ua-discuss@icann.org Subject: Re: [UA-discuss] Regular Expression

...
Il 13 settembre 2017 alle 0.01 Chaals McCathie Nevile <chaals@yandex.ru> ha scritto:

On Tue, 12 Sep 2017 22:43:09 +0200, Don Hollander <don.hollander@icann.org> wrote:

I think there is value in validation - first, to determine whether an email address is real - if it isn't, you are probably better off getting a warning than trying to send it.

But this is nothing you can do just with a regexp. The regexp could allow you to intercept blatant mistakes - e.g., there are national keyboards where typing "@" requires pressing Alt or some uncommon combination of keys, so it's easy to mistype it and you can easily warn the user that their entered string does not have a "@" - but anything beyond that is unnecessary, because, if a user mistypes an email address in any other way, it's very likely that he will still end up entering a valid email address that no regexp will be able to tell as non-existing, or that could even exist but belong to someone else.

On the other hand, if you try to implement a complex regexp, and especially if you try to figure it out on your own, it's almost certain that you will mark as invalid several valid email addresses that are corner cases but should be accepted, as well as many future developments of the standards which are invalid now but will be valid in the future.

...
Second, I find it very helpful, including as a protection against phishing emails, to be told if an email is not recognised as a contact to whom I have *sent* an email, which is a stricter validation check. Applications that do that for me - especially for scripts I don't read fluently like Chinese - are common, and I would be upset if they were to stop validating.

But this, again, is a validation that cannot be done via a regexp (can you write a regexp representing your entire contact book?) and that, on the other hand, poses an additional stricter condition than just "the email address is valid". We are just discussing how to check that the email address is syntactically valid, any other checks could still be implemented however appropriate.

The point here is that you should not try to determine whether an email address is valid by checking its syntax, other than checking that it has a "@" and possibly a "." on the right of it (but even this latter condition is already too strict, as the ideographic full stop "。" should be accepted in place of the ASCII dot, if you check strings in IDN form). Anything beyond that is going to exclude some valid addresses while not increasing in any significant way your chances of intercepting user input error at this stage - and you will still intercept any user error a few seconds later, when you send the validation/confirmation message.

Regards, --

Vittorio Bertola | Research & Innovation Engineer vittorio.bertola@open-xchange.com Open-Xchange Srl - Office @ Via Treviso 12, 10144 Torino, Italy

Don Hollander Universal Acceptance Steering Group Skype: don_hollander

Andrew Sullivan

5:48 a.m.

On Tue, Sep 12, 2017 at 05:31:55PM -0300, Rubens Kuhl wrote:

...

Single IDN TLDs for some scripts is something being considered for subsequent procedures, so I would think of 1 or more and prevent the same UA challenges previous rounds TLDs are suffering.

Wait, what? Like "localpart@IDN-TLD"? That won't work. It would require an update of the mail standards. (I explained this to people many years ago when they were preparing marketing materials about 'someone@brand', and was of course ignored by people who wanted to cash in on snake oil sales, but it still doesn't work.) A -- Andrew Sullivan ajs@anvilwalrusden.com

Rubens Kuhl

5:39 p.m.

...

Em 13 de set de 2017, à(s) 02:48:000, Andrew Sullivan <ajs@anvilwalrusden.com> escreveu:

On Tue, Sep 12, 2017 at 05:31:55PM -0300, Rubens Kuhl wrote:

...
Single IDN TLDs for some scripts is something being considered for subsequent procedures, so I would think of 1 or more and prevent the same UA challenges previous rounds TLDs are suffering.

Wait, what? Like "localpart@IDN-TLD"?

Nope, like localpart@domain.TLD <mailto:localpart@domain.TLD> , but where TLD is a single-character IDN in languages where a character means a whole concept in that single character.

...

That won't work. It would require an update of the mail standards. (I explained this to people many years ago when they were preparing marketing materials about 'someone@brand', and was of course ignored by people who wanted to cash in on snake oil sales, but it still doesn't work.)

I do agree that a dotless address violates https://tools.ietf.org/html/rfc5321 <https://tools.ietf.org/html/rfc5321> , but in 5321 I couldn't find during a quick look a requirement on TLD length. Rubens

Werner Staub (axone)

8:46 a.m.

Hi Don, On 2017-09-12 20:44 , Don Hollander wrote:

...

only insisting that the domain must have more than one label and the TLD is 2 characters or longer.

This happens to contain an example of how we can get blinded by our habits. Requiring *2 characters* for a TLD has a small effect for the Latin script. But for syllabic scripts - such as Chinese, Japanese and Korean, it requires a TLD be at least two *syllables* long. That is huge effect. And there are scripts where the number of Unicode code points used to represent a character has little in common with the number of key strokes the has to type, or even what the reader would perceive to be one or multiple characters. With all the effort we undertake for universal acceptance, it would be tragic if that very effort were to hard-wire unintentional forms of discrimination against particular scripts or cultures. Compare this with an example from the past: ICANN's initial versions of the Applicant Guidebook required a length of 3 characters for any gTLD. That included Chinese, Japanese and Korean - in which most frequent dictionary words are shorter. Reducing the length requirement to 2 (rather than 1) for IDN gTLDs was a temporary compromise. At the time, ICANN had almost no staff from countries where non-Latin scripts are used. Regards, Werner

Andre Schappo

10:52 a.m.

I have thought about this many times over the years and here are some of my thoughts - A lot depends on when the validation is done. If it is done at registration time then I think it is possible to use regex to validate. Take http://datamail.in http://电邮.在线 http://डाटामेल.भारत The user is explicitly selecting a language/script and thus constraining the Unicode characters which can be used for the local part.. For a particular language/script the IDN is fixed for that language/script so that leaves just the local part/ mailbox name to validate. One could use a simple regex with a modern Unicode aware regex engine. For a chinese local part then something like: \p{Han}+ or Hindi/Devanagari local part something like: \p{Devanagari}+ André Schappo On 12 Sep 2017, at 19:44, Don Hollander <don.hollander@icann.org<mailto:don.hollander@icann.org>> wrote: Please note that this is a Geeky post - so carry on if that’s not you. Email validation is an area where many websites fall short as we found in our study on Website UA Readiness (nearing publication) The technologies behind these websites generally use a Regular Expression as their first line of defence against rubbish data. The issue is that most of these RegExs are overly restrictive. As an appendix to the Website review, we looked at some of the technologies behind the websites to see if there were common denominators for good and bad experiences. One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion. /^.+@(?:[^.]+\.)+(?:[^.]{2,})$ Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer. Your thoughts? Don Hollander Universal Acceptance Steering Group Skype: don_hollander

Andre Schappo

3:16 p.m.

... so, regex advice could be given as something like: "When there are suitable constraints, as determined by the service provider or the user or a combination of both, email validation can be readily achieved using regex. Suitable constraints include restriction to a single language/script." André Schappo PS. Just thinking about a web form where a user enters name, postal address, telephone number, email address ...etc... Email address validation could be performed in conjunction with the user. The form could interact with the user to determine the constraints, if any, for email validation. This could take the form of Q&A or presentation of radio buttons for choices ...etc... So, involve the user in the process. On 13 Sep 2017, at 11:52, Andre Schappo <A.Schappo@lboro.ac.uk<mailto:A.Schappo@lboro.ac.uk>> wrote: I have thought about this many times over the years and here are some of my thoughts - A lot depends on when the validation is done. If it is done at registration time then I think it is possible to use regex to validate. Take http://datamail.in<http://datamail.in/> http://电邮.在线<http://xn--wny099c.xn--3ds443g> http://डाटामेल.भारत<http://xn--c2bd4bq1db8d.xn--h2brj9c> The user is explicitly selecting a language/script and thus constraining the Unicode characters which can be used for the local part.. For a particular language/script the IDN is fixed for that language/script so that leaves just the local part/ mailbox name to validate. One could use a simple regex with a modern Unicode aware regex engine. For a chinese local part then something like: \p{Han}+ or Hindi/Devanagari local part something like: \p{Devanagari}+ André Schappo On 12 Sep 2017, at 19:44, Don Hollander <don.hollander@icann.org<mailto:don.hollander@icann.org>> wrote: Please note that this is a Geeky post - so carry on if that’s not you. Email validation is an area where many websites fall short as we found in our study on Website UA Readiness (nearing publication) The technologies behind these websites generally use a Regular Expression as their first line of defence against rubbish data. The issue is that most of these RegExs are overly restrictive. As an appendix to the Website review, we looked at some of the technologies behind the websites to see if there were common denominators for good and bad experiences. One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion. /^.+@(?:[^.]+\.)+(?:[^.]{2,})$ Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer. Your thoughts? Don Hollander Universal Acceptance Steering Group Skype: don_hollander

Jim Hague

3:44 p.m.

On 12/09/2017 19:44, Don Hollander wrote:

...

One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion.

/^.+@(?:[^.]+\.)+(?:[^.]{2,})$

Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer.

Note that this in the context of an in-browser check. I only examined a small random subset of the sites surveyed in the main evaluation, and obviously without access to server code could only examine client-side operations. In all the sites I examined, the only check performed was against one (or in one case two) regular expression(s). No decomposition of the email address was attempted, and certainly no translation of the domain to Punycode. It was in that context that I highlighted the above regex, on the basis that it's probably the only sensible option to suggest to organisations as a low-impact UA improvement (I won't say fix) at the moment. If a future evaluation exercise verifies that an existing Javascript module does the right thing, that would be a better alternative, but that would involve more substantial modifications to site code. I agree that modifying it to allow 1 character TLDs would be sensible. I also agree with the page referenced at the start of the thread (which I read before working on the report) that just checking for '@' is about all one should attempt, certainly client-side. Turning again to the above regex, of course, being a proposed regex for validating email addresses, it's got an obvious deficiency. It needs to add support for other label separators (e.g. open dot). Mark Svancarek raised the excellent point of bidi in the domain. Personally I'm not confident I understand the bidi rules. But if the regex requires at least one label separator character in the domain and non-empty labels, will that work, given that if the regex allows 1 character TLDs then a valid TLD is simply a non-empty label? -- Jim Hague - jim@sinodun.com Never trust a computer you can't lift.

Don Hollander

4:58 p.m.

Thanks Jim. The BiDi issue, with raw data input, is which side has the domain side. usually you’ll encounter mailbox@domainname.tld But in Arabic or Hebrew you’ll encounter tld.domainname@mailbox Don

...

On 15/09/2017, at 3:44 AM, Jim Hague <jim@sinodun.com> wrote:

On 12/09/2017 19:44, Don Hollander wrote:

...
One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion.

/^.+@(?:[^.]+\.)+(?:[^.]{2,})$

Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer.

Note that this in the context of an in-browser check. I only examined a small random subset of the sites surveyed in the main evaluation, and obviously without access to server code could only examine client-side operations. In all the sites I examined, the only check performed was against one (or in one case two) regular expression(s). No decomposition of the email address was attempted, and certainly no translation of the domain to Punycode.

It was in that context that I highlighted the above regex, on the basis that it's probably the only sensible option to suggest to organisations as a low-impact UA improvement (I won't say fix) at the moment. If a future evaluation exercise verifies that an existing Javascript module does the right thing, that would be a better alternative, but that would involve more substantial modifications to site code.

I agree that modifying it to allow 1 character TLDs would be sensible.

I also agree with the page referenced at the start of the thread (which I read before working on the report) that just checking for '@' is about all one should attempt, certainly client-side.

Turning again to the above regex, of course, being a proposed regex for validating email addresses, it's got an obvious deficiency. It needs to add support for other label separators (e.g. open dot).

Mark Svancarek raised the excellent point of bidi in the domain. Personally I'm not confident I understand the bidi rules. But if the regex requires at least one label separator character in the domain and non-empty labels, will that work, given that if the regex allows 1 character TLDs then a valid TLD is simply a non-empty label? -- Jim Hague - jim@sinodun.com Never trust a computer you can't lift.

Don Hollander Universal Acceptance Steering Group Skype: don_hollander

Rubens Kuhl

5:27 p.m.

The BiDi issue suggests to me that even enforcing the non-dotless rule is too much for a simple regex, as shabaka.example@don is a valid Arabic EAI , while the same ASCII combination is not valid even if a .don TLD gets delegated. [non-empty]@[non-empty] looks better to me. Rubens

...

Em 14 de set de 2017, à(s) 13:58:000, Don Hollander <don.hollander@icann.org> escreveu:

Thanks Jim.

The BiDi issue, with raw data input, is which side has the domain side.

usually you’ll encounter mailbox@domainname.tld

But in Arabic or Hebrew you’ll encounter tld.domainname@mailbox

Don

...
On 15/09/2017, at 3:44 AM, Jim Hague <jim@sinodun.com> wrote:

On 12/09/2017 19:44, Don Hollander wrote:

...
One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion.

/^.+@(?:[^.]+\.)+(?:[^.]{2,})$

Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer.

Note that this in the context of an in-browser check. I only examined a small random subset of the sites surveyed in the main evaluation, and obviously without access to server code could only examine client-side operations. In all the sites I examined, the only check performed was against one (or in one case two) regular expression(s). No decomposition of the email address was attempted, and certainly no translation of the domain to Punycode.

It was in that context that I highlighted the above regex, on the basis that it's probably the only sensible option to suggest to organisations as a low-impact UA improvement (I won't say fix) at the moment. If a future evaluation exercise verifies that an existing Javascript module does the right thing, that would be a better alternative, but that would involve more substantial modifications to site code.

I agree that modifying it to allow 1 character TLDs would be sensible.

I also agree with the page referenced at the start of the thread (which I read before working on the report) that just checking for '@' is about all one should attempt, certainly client-side.

Turning again to the above regex, of course, being a proposed regex for validating email addresses, it's got an obvious deficiency. It needs to add support for other label separators (e.g. open dot).

Mark Svancarek raised the excellent point of bidi in the domain. Personally I'm not confident I understand the bidi rules. But if the regex requires at least one label separator character in the domain and non-empty labels, will that work, given that if the regex allows 1 character TLDs then a valid TLD is simply a non-empty label? -- Jim Hague - jim@sinodun.com Never trust a computer you can't lift.

Don Hollander Universal Acceptance Steering Group Skype: don_hollander

Jordyn Buchanan

5:38 p.m.

Also worth remembering that "works according to the universe at the moment the RegExp was written" is how we got into a lot of today's UA mess in the first place. Just because dotless domains or some other rule is in place today, I'd want to avoid encoding them into a regexp that we tell people to use since the rules may change again and I don't want to have another group following along in our wake 10 years from now trying to undo the code that we told everyone to write. Jordyn On Thu, Sep 14, 2017 at 1:27 PM, Rubens Kuhl <rubensk@nic.br> wrote:

...

The BiDi issue suggests to me that even enforcing the non-dotless rule is too much for a simple regex, as shabaka.example@don is a valid Arabic EAI , while the same ASCII combination is not valid even if a .don TLD gets delegated. [non-empty]@[non-empty] looks better to me.

Rubens

...
Em 14 de set de 2017, à(s) 13:58:000, Don Hollander < don.hollander@icann.org> escreveu:

Thanks Jim.

The BiDi issue, with raw data input, is which side has the domain side.

usually you’ll encounter mailbox@domainname.tld

But in Arabic or Hebrew you’ll encounter tld.domainname@mailbox

Don

...
On 15/09/2017, at 3:44 AM, Jim Hague <jim@sinodun.com> wrote:

On 12/09/2017 19:44, Don Hollander wrote:

...
One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion.

/^.+@(?:[^.]+\.)+(?:[^.]{2,})$

Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer.

Note that this in the context of an in-browser check. I only examined a small random subset of the sites surveyed in the main evaluation, and obviously without access to server code could only examine client-side operations. In all the sites I examined, the only check performed was against one (or in one case two) regular expression(s). No decomposition of the email address was attempted, and certainly no translation of the domain to Punycode.

It was in that context that I highlighted the above regex, on the basis that it's probably the only sensible option to suggest to organisations as a low-impact UA improvement (I won't say fix) at the moment. If a future evaluation exercise verifies that an existing Javascript module does the right thing, that would be a better alternative, but that would involve more substantial modifications to site code.

I agree that modifying it to allow 1 character TLDs would be sensible.

I also agree with the page referenced at the start of the thread (which I read before working on the report) that just checking for '@' is about all one should attempt, certainly client-side.

Turning again to the above regex, of course, being a proposed regex for validating email addresses, it's got an obvious deficiency. It needs to add support for other label separators (e.g. open dot).

Mark Svancarek raised the excellent point of bidi in the domain. Personally I'm not confident I understand the bidi rules. But if the regex requires at least one label separator character in the domain and non-empty labels, will that work, given that if the regex allows 1 character TLDs then a valid TLD is simply a non-empty label? -- Jim Hague - jim@sinodun.com Never trust a computer you can't lift.

Don Hollander Universal Acceptance Steering Group Skype: don_hollander

Paul Stahura

5:46 p.m.

I totally agree with Jordyn and Mark "Just capture the string and send a test message.”

...

On Sep 14, 2017, at 10:38 AM, Jordyn Buchanan via UA-discuss <ua-discuss@icann.org> wrote:

Also worth remembering that "works according to the universe at the moment the RegExp was written" is how we got into a lot of today's UA mess in the first place. Just because dotless domains or some other rule is in place today, I'd want to avoid encoding them into a regexp that we tell people to use since the rules may change again and I don't want to have another group following along in our wake 10 years from now trying to undo the code that we told everyone to write.

Jordyn

On Thu, Sep 14, 2017 at 1:27 PM, Rubens Kuhl <rubensk@nic.br <mailto:rubensk@nic.br>> wrote:

The BiDi issue suggests to me that even enforcing the non-dotless rule is too much for a simple regex, as shabaka.example@don is a valid Arabic EAI , while the same ASCII combination is not valid even if a .don TLD gets delegated. [non-empty]@[non-empty] looks better to me.

Rubens

...
Em 14 de set de 2017, à(s) 13:58:000, Don Hollander <don.hollander@icann.org <mailto:don.hollander@icann.org>> escreveu:

Thanks Jim.

The BiDi issue, with raw data input, is which side has the domain side.

usually you’ll encounter mailbox@domainname.tld

But in Arabic or Hebrew you’ll encounter tld.domainname@mailbox

Don

...
On 15/09/2017, at 3:44 AM, Jim Hague <jim@sinodun.com <mailto:jim@sinodun.com>> wrote:

On 12/09/2017 19:44, Don Hollander wrote:

...
One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion.

/^.+@(?:[^.]+\.)+(?:[^.]{2,})$

Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer.

Note that this in the context of an in-browser check. I only examined a small random subset of the sites surveyed in the main evaluation, and obviously without access to server code could only examine client-side operations. In all the sites I examined, the only check performed was against one (or in one case two) regular expression(s). No decomposition of the email address was attempted, and certainly no translation of the domain to Punycode.

It was in that context that I highlighted the above regex, on the basis that it's probably the only sensible option to suggest to organisations as a low-impact UA improvement (I won't say fix) at the moment. If a future evaluation exercise verifies that an existing Javascript module does the right thing, that would be a better alternative, but that would involve more substantial modifications to site code.

I agree that modifying it to allow 1 character TLDs would be sensible.

I also agree with the page referenced at the start of the thread (which I read before working on the report) that just checking for '@' is about all one should attempt, certainly client-side.

Turning again to the above regex, of course, being a proposed regex for validating email addresses, it's got an obvious deficiency. It needs to add support for other label separators (e.g. open dot).

Mark Svancarek raised the excellent point of bidi in the domain. Personally I'm not confident I understand the bidi rules. But if the regex requires at least one label separator character in the domain and non-empty labels, will that work, given that if the regex allows 1 character TLDs then a valid TLD is simply a non-empty label? -- Jim Hague - jim@sinodun.com <mailto:jim@sinodun.com> Never trust a computer you can't lift.

Don Hollander Universal Acceptance Steering Group Skype: don_hollander

M3 Sweatt

6:07 p.m.

Adding this back to the main thread (sorry) From: Jordyn Buchanan Sent: Thursday, September 14, 2017 10:51 AM To: M3 Sweatt Cc: Rubens Kuhl Subject: Re: [UA-discuss] Regular Expression I have no idea, and that's the point. But probably people writing the current regexps weren't anticipating EAI or even IDN TLDs generally, and when there were only a handful of gTLDs that hadn't changed since RFC 1591 was written it probably seemed reasonable to hard code the white list. So I'd rather avoid baking in assumptions about what types of addresses are likely to work in the future where the implementation is separate from the infrastructure that actually handles the delivery of mail. That way we don't increase the number of lines of code we have to touch next time we want to change the set of valid e-mail addresses. Jordyn On Thu, Sep 14, 2017 at 1:43 PM, M3 Sweatt wrote: @Jordyn Buchanan<mailto:jordyn@google.com>, thanks for the perspective. What do you mean that the “rules may change again”? From: ua-discuss-bounces@icann.org [mailto:ua-discuss-bounces@icann.org] On Behalf Of Paul Stahura Sent: Thursday, September 14, 2017 10:46 AM To: Jordyn Buchanan <jordyn@google.com> Cc: Universal Acceptance <ua-discuss@icann.org> Subject: Re: [UA-discuss] Regular Expression I totally agree with Jordyn and Mark "Just capture the string and send a test message.” On Sep 14, 2017, at 10:38 AM, Jordyn Buchanan via UA-discuss <ua-discuss@icann.org<mailto:ua-discuss@icann.org>> wrote: Also worth remembering that "works according to the universe at the moment the RegExp was written" is how we got into a lot of today's UA mess in the first place. Just because dotless domains or some other rule is in place today, I'd want to avoid encoding them into a regexp that we tell people to use since the rules may change again and I don't want to have another group following along in our wake 10 years from now trying to undo the code that we told everyone to write. Jordyn On Thu, Sep 14, 2017 at 1:27 PM, Rubens Kuhl <rubensk@nic.br<mailto:rubensk@nic.br>> wrote: The BiDi issue suggests to me that even enforcing the non-dotless rule is too much for a simple regex, as shabaka.example@don<mailto:shabaka.example@don> is a valid Arabic EAI , while the same ASCII combination is not valid even if a .don TLD gets delegated. [non-empty]@[non-empty] looks better to me. Rubens

...

Em 14 de set de 2017, à(s) 13:58:000, Don Hollander <don.hollander@icann.org<mailto:don.hollander@icann.org>> escreveu:

Thanks Jim.

The BiDi issue, with raw data input, is which side has the domain side.

usually you’ll encounter mailbox@domainname.tld<mailto:mailbox@domainname.tld>

But in Arabic or Hebrew you’ll encounter tld.domainname@mailbox<mailto:tld.domainname@mailbox>

Don

...
On 15/09/2017, at 3:44 AM, Jim Hague <jim@sinodun.com<mailto:jim@sinodun.com>> wrote:

On 12/09/2017 19:44, Don Hollander wrote:

...
One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion.

/^.+@(?:[^.]+\.)+(?:[^.]{2,})$<mailto:/%5e.+@(?:[%5e.]+\.)+(?:%5b%5e.%5d%7b2,%7d)$>

Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer.

Note that this in the context of an in-browser check. I only examined a small random subset of the sites surveyed in the main evaluation, and obviously without access to server code could only examine client-side operations. In all the sites I examined, the only check performed was against one (or in one case two) regular expression(s). No decomposition of the email address was attempted, and certainly no translation of the domain to Punycode.

It was in that context that I highlighted the above regex, on the basis that it's probably the only sensible option to suggest to organisations as a low-impact UA improvement (I won't say fix) at the moment. If a future evaluation exercise verifies that an existing Javascript module does the right thing, that would be a better alternative, but that would involve more substantial modifications to site code.

I agree that modifying it to allow 1 character TLDs would be sensible.

I also agree with the page referenced at the start of the thread (which I read before working on the report) that just checking for '@' is about all one should attempt, certainly client-side.

Turning again to the above regex, of course, being a proposed regex for validating email addresses, it's got an obvious deficiency. It needs to add support for other label separators (e.g. open dot).

Mark Svancarek raised the excellent point of bidi in the domain. Personally I'm not confident I understand the bidi rules. But if the regex requires at least one label separator character in the domain and non-empty labels, will that work, given that if the regex allows 1 character TLDs then a valid TLD is simply a non-empty label? -- Jim Hague - jim@sinodun.com<mailto:jim@sinodun.com> Never trust a computer you can't lift.

Don Hollander Universal Acceptance Steering Group Skype: don_hollander

Paul Stahura

6:43 p.m.

I agree with Jordyn because it happens time and time again. remember when SLDs and TLDs couldn’t start with a digit? then 3m.com came along and RFC was changed. remember the acceptance problem .info had when it went live because it was 4 characters? now we are discussing the 2 character issue and guess what… IDNs. maybe we’ll have 1 character TLDs soon. what about all-numeric TLDs? can’t have that currently (i forget which RFC) because IP address right? what if the SLD is not solely digits? then maybe. who knows Tim Berners-Lee regrets “http://“ for gods sake. maybe that will change too someday. Jon Postel said be liberal in what you accept from the universe and conservative in what you send out to the universe (i forget exact quote). I think that applies here. do not restrict what’s accepted. the restriction will change and then we’re screwed… again.

...

On Sep 14, 2017, at 11:07 AM, M3 Sweatt <msweatt@microsoft.com> wrote:

Adding this back to the main thread (sorry)

From: Jordyn Buchanan Sent: Thursday, September 14, 2017 10:51 AM To: M3 Sweatt Cc: Rubens Kuhl Subject: Re: [UA-discuss] Regular Expression

I have no idea, and that's the point. But probably people writing the current regexps weren't anticipating EAI or even IDN TLDs generally, and when there were only a handful of gTLDs that hadn't changed since RFC 1591 was written it probably seemed reasonable to hard code the white list.

So I'd rather avoid baking in assumptions about what types of addresses are likely to work in the future where the implementation is separate from the infrastructure that actually handles the delivery of mail. That way we don't increase the number of lines of code we have to touch next time we want to change the set of valid e-mail addresses.

Jordyn

On Thu, Sep 14, 2017 at 1:43 PM, M3 Sweatt wrote: @Jordyn Buchanan <mailto:jordyn@google.com>, thanks for the perspective. What do you mean that the “rules may change again”?

From: ua-discuss-bounces@icann.org [mailto:ua-discuss-bounces@icann.org] On Behalf Of Paul Stahura Sent: Thursday, September 14, 2017 10:46 AM To: Jordyn Buchanan <jordyn@google.com> Cc: Universal Acceptance <ua-discuss@icann.org> Subject: Re: [UA-discuss] Regular Expression

I totally agree with Jordyn and Mark "Just capture the string and send a test message.”

On Sep 14, 2017, at 10:38 AM, Jordyn Buchanan via UA-discuss <ua-discuss@icann.org <mailto:ua-discuss@icann.org>> wrote:

Also worth remembering that "works according to the universe at the moment the RegExp was written" is how we got into a lot of today's UA mess in the first place. Just because dotless domains or some other rule is in place today, I'd want to avoid encoding them into a regexp that we tell people to use since the rules may change again and I don't want to have another group following along in our wake 10 years from now trying to undo the code that we told everyone to write.

Jordyn

On Thu, Sep 14, 2017 at 1:27 PM, Rubens Kuhl <rubensk@nic.br <mailto:rubensk@nic.br>> wrote:

The BiDi issue suggests to me that even enforcing the non-dotless rule is too much for a simple regex, as shabaka.example@don <mailto:shabaka.example@don> is a valid Arabic EAI , while the same ASCII combination is not valid even if a .don TLD gets delegated. [non-empty]@[non-empty] looks better to me.

Rubens

...
Em 14 de set de 2017, à(s) 13:58:000, Don Hollander <don.hollander@icann.org <mailto:don.hollander@icann.org>> escreveu:

Thanks Jim.

The BiDi issue, with raw data input, is which side has the domain side.

usually you’ll encounter mailbox@domainname.tld <mailto:mailbox@domainname.tld>

But in Arabic or Hebrew you’ll encounter tld.domainname@mailbox <mailto:tld.domainname@mailbox>

Don

...
On 15/09/2017, at 3:44 AM, Jim Hague <jim@sinodun.com <mailto:jim@sinodun.com>> wrote:

On 12/09/2017 19:44, Don Hollander wrote:

...
One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion.

/^.+@(?:[^.]+\.)+(?:[^.]{2,})$ <mailto:/%5e.+@(?:[%5e.]+\.)+(?:%5b%5e.%5d%7b2,%7d)$>

Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer.

Note that this in the context of an in-browser check. I only examined a small random subset of the sites surveyed in the main evaluation, and obviously without access to server code could only examine client-side operations. In all the sites I examined, the only check performed was against one (or in one case two) regular expression(s). No decomposition of the email address was attempted, and certainly no translation of the domain to Punycode.

It was in that context that I highlighted the above regex, on the basis that it's probably the only sensible option to suggest to organisations as a low-impact UA improvement (I won't say fix) at the moment. If a future evaluation exercise verifies that an existing Javascript module does the right thing, that would be a better alternative, but that would involve more substantial modifications to site code.

I agree that modifying it to allow 1 character TLDs would be sensible.

I also agree with the page referenced at the start of the thread (which I read before working on the report) that just checking for '@' is about all one should attempt, certainly client-side.

Turning again to the above regex, of course, being a proposed regex for validating email addresses, it's got an obvious deficiency. It needs to add support for other label separators (e.g. open dot).

Mark Svancarek raised the excellent point of bidi in the domain. Personally I'm not confident I understand the bidi rules. But if the regex requires at least one label separator character in the domain and non-empty labels, will that work, given that if the regex allows 1 character TLDs then a valid TLD is simply a non-empty label? -- Jim Hague - jim@sinodun.com <mailto:jim@sinodun.com> Never trust a computer you can't lift.

Don Hollander Universal Acceptance Steering Group Skype: don_hollander

Asmus Freytag

5:41 p.m.

On 9/14/2017 10:27 AM, Rubens Kuhl wrote:

...

The BiDi issue suggests to me that even enforcing the non-dotless rule is too much for a simple regex, as shabaka.example@don is a valid Arabic EAI , while the same ASCII combination is not valid even if a .don TLD gets delegated. [non-empty]@[non-empty] looks better to me.

Isn't the bidi limited to the display side, that is, in back storage there should not be alternative ordering of host and local parts? A./

...

Rubens

...
Em 14 de set de 2017, à(s) 13:58:000, Don Hollander <don.hollander@icann.org> escreveu:

Thanks Jim.

The BiDi issue, with raw data input, is which side has the domain side.

usually you’ll encounter mailbox@domainname.tld

But in Arabic or Hebrew you’ll encounter tld.domainname@mailbox

Don

...
On 15/09/2017, at 3:44 AM, Jim Hague <jim@sinodun.com> wrote:

On 12/09/2017 19:44, Don Hollander wrote:

...
One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion.

/^.+@(?:[^.]+\.)+(?:[^.]{2,})$

Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer. Note that this in the context of an in-browser check. I only examined a small random subset of the sites surveyed in the main evaluation, and obviously without access to server code could only examine client-side operations. In all the sites I examined, the only check performed was against one (or in one case two) regular expression(s). No decomposition of the email address was attempted, and certainly no translation of the domain to Punycode.

It was in that context that I highlighted the above regex, on the basis that it's probably the only sensible option to suggest to organisations as a low-impact UA improvement (I won't say fix) at the moment. If a future evaluation exercise verifies that an existing Javascript module does the right thing, that would be a better alternative, but that would involve more substantial modifications to site code.

I agree that modifying it to allow 1 character TLDs would be sensible.

I also agree with the page referenced at the start of the thread (which I read before working on the report) that just checking for '@' is about all one should attempt, certainly client-side.

Turning again to the above regex, of course, being a proposed regex for validating email addresses, it's got an obvious deficiency. It needs to add support for other label separators (e.g. open dot).

Mark Svancarek raised the excellent point of bidi in the domain. Personally I'm not confident I understand the bidi rules. But if the regex requires at least one label separator character in the domain and non-empty labels, will that work, given that if the regex allows 1 character TLDs then a valid TLD is simply a non-empty label? -- Jim Hague - jim@sinodun.com Never trust a computer you can't lift. Don Hollander Universal Acceptance Steering Group Skype: don_hollander

Andre Schappo

10:26 a.m.

Yes indeed. Whatever directionality the text is affects the display ordering and not the memory ordering. So memory order will always be (or should be according to current best practice) mailbox@domainname.tld<mailto:mailbox@domainname.tld> and so one processes/validates as usual without regard for the display order. That is the current best practice. When presenting to the user than one uses display order. Which reminds me of a blog article I wrote some time ago because with appropriate html/css one can determine how it is displayed😀 http://schappo.blogspot.co.uk/2016/10/computer-science-internationalization.... Oh ...and... http://schappo.blogspot.co.uk/2016/03/arabic-email-addresses.html André Schappo On 14 Sep 2017, at 18:41, Asmus Freytag <asmusf@ix.netcom.com<mailto:asmusf@ix.netcom.com>> wrote: On 9/14/2017 10:27 AM, Rubens Kuhl wrote: The BiDi issue suggests to me that even enforcing the non-dotless rule is too much for a simple regex, as shabaka.example@don is a valid Arabic EAI , while the same ASCII combination is not valid even if a .don TLD gets delegated. [non-empty]@[non-empty] looks better to me. Isn't the bidi limited to the display side, that is, in back storage there should not be alternative ordering of host and local parts? A./ Rubens Em 14 de set de 2017, à(s) 13:58:000, Don Hollander <don.hollander@icann.org<mailto:don.hollander@icann.org>> escreveu: Thanks Jim. The BiDi issue, with raw data input, is which side has the domain side. usually you’ll encounter mailbox@domainname.tld<mailto:mailbox@domainname.tld> But in Arabic or Hebrew you’ll encounter tld.domainname@mailbox Don On 15/09/2017, at 3:44 AM, Jim Hague <jim@sinodun.com<mailto:jim@sinodun.com>> wrote: On 12/09/2017 19:44, Don Hollander wrote: One RegEx has stood out as being simple and correct. I’d like the UASG to consider recommending this in our documentation. Toward that end, this thread is for discussion. /^.+@(?:[^.]+\.)+(?:[^.]{2,})$ Regular expression check in Javascript. This accepts any Unicode characters, only insisting that the domain must have more than one label and the TLD is 2 characters or longer. Note that this in the context of an in-browser check. I only examined a small random subset of the sites surveyed in the main evaluation, and obviously without access to server code could only examine client-side operations. In all the sites I examined, the only check performed was against one (or in one case two) regular expression(s). No decomposition of the email address was attempted, and certainly no translation of the domain to Punycode. It was in that context that I highlighted the above regex, on the basis that it's probably the only sensible option to suggest to organisations as a low-impact UA improvement (I won't say fix) at the moment. If a future evaluation exercise verifies that an existing Javascript module does the right thing, that would be a better alternative, but that would involve more substantial modifications to site code. I agree that modifying it to allow 1 character TLDs would be sensible. I also agree with the page referenced at the start of the thread (which I read before working on the report) that just checking for '@' is about all one should attempt, certainly client-side. Turning again to the above regex, of course, being a proposed regex for validating email addresses, it's got an obvious deficiency. It needs to add support for other label separators (e.g. open dot). Mark Svancarek raised the excellent point of bidi in the domain. Personally I'm not confident I understand the bidi rules. But if the regex requires at least one label separator character in the domain and non-empty labels, will that work, given that if the regex allows 1 character TLDs then a valid TLD is simply a non-empty label? -- Jim Hague - jim@sinodun.com<mailto:jim@sinodun.com> Never trust a computer you can't lift. Don Hollander Universal Acceptance Steering Group Skype: don_hollander

Andre Schappo

1:01 p.m.

Before we leave the topic of Regular Expressions - If one does use regex then I consider it better working practice, when possible, to work at the Unicode level rather than the encoding level. One example previously given in this discussion thread was "^([a-zA-Z0-9.!#$%&'*+/=?^_`{|}~\u00A0-\uD7FF\uE000-\uFFFF-]|([\uD800-\uDBFF][\uDC00\uDFFF]))+$" This regex is working at the encoding level, specifically UTF-16. It encompasses nearly every Unicode character, including unassigned Unicode codepoints and Private Use Area (PUA) characters. I would not allow unassigned or PUA characters in an identifier. I consider it better to work at the Unicode level. I previously gave a simple example of working at the Unicode level: "\p{Devanagari}+" which will match with one or more Devanagari Script Unicode characters. In this case I do not need to concern myself with codepoints, encodings, unassigned codepoints, additions that may be made in newer versions of Unicode...etc... The regex engine and the Unicode consortium do that for me. It can and frequently does get more complicated than the simple example I have given. See http://www.unicode.org/reports/tr31/ André Schappo

3159

Age (days ago)

3162

Last active (days ago)

List overview

Download

25 comments

14 participants

participants (14)

Andre Schappo
Andrew Sullivan
Asmus Freytag
Chaals McCathie Nevile
David Conrad
Don Hollander
Jim Hague
Jordyn Buchanan
M3 Sweatt
Mark Svancarek
Paul Stahura
Rubens Kuhl
Vittorio Bertola
Werner Staub (axone)