Progress on HTML and email...
Hi, just to note that this week we reached a provisional agreement to change the HTML email input type, and make it accept a unicode value before the "@" instead of the current restriction to ASCII. Once we get this change made and published - hopefully in a few weeks - we would appreciate help encouraging browsers to implement it fast. cheers chaals -- Chaals is Charles McCathie Nevile find more at http://yandex.com
Dear chaals, A very good news! That is a great job done. Thanks for your great contribution! Jiankang Yao
-----原始邮件----- 发件人: "Chaals McCathie Nevile" <chaals@yandex.ru> 发送时间: 2017-11-12 11:13:54 (星期日) 收件人: "ua-discuss@icann.org" <ua-discuss@icann.org> 抄送: 主题: [UA-discuss] Progress on HTML and email...
Hi,
just to note that this week we reached a provisional agreement to change the HTML email input type, and make it accept a unicode value before the "@" instead of the current restriction to ASCII. Once we get this change made and published - hopefully in a few weeks - we would appreciate help encouraging browsers to implement it fast.
cheers
chaals
-- Chaals is Charles McCathie Nevile find more at http://yandex.com
That is indeed good news. Are there other constraints to the email input type? Does it allow unicode on the other side of the ‘@‘? Does it include any validation? And field length constraints? D
On 12/11/2017, at 4:13 PM, Chaals McCathie Nevile <chaals@yandex.ru> wrote:
Hi,
just to note that this week we reached a provisional agreement to change the HTML email input type, and make it accept a unicode value before the "@" instead of the current restriction to ASCII. Once we get this change made and published - hopefully in a few weeks - we would appreciate help encouraging browsers to implement it fast.
cheers
chaals
-- Chaals is Charles McCathie Nevile find more at http://yandex.com
Don Hollander Universal Acceptance Steering Group Skype: don_hollander
This is a great step. Is there a document that sets out the context for this spec and provides the full details ? It would be useful for outreach. satish On Nov 12, 2017 9:24 AM, "Don Hollander" <don.hollander@icann.org> wrote: That is indeed good news. Are there other constraints to the email input type? Does it allow unicode on the other side of the ‘@‘? Does it include any validation? And field length constraints? D
On 12/11/2017, at 4:13 PM, Chaals McCathie Nevile <chaals@yandex.ru> wrote:
Hi,
just to note that this week we reached a provisional agreement to change the HTML email input type, and make it accept a unicode value before the "@" instead of the current restriction to ASCII. Once we get this change made and published - hopefully in a few weeks - we would appreciate help encouraging browsers to implement it fast.
cheers
chaals
-- Chaals is Charles McCathie Nevile find more at http://yandex.com
Don Hollander Universal Acceptance Steering Group Skype: don_hollander
+1 on the kudos as well as a hint on where those of us new to the standards relating to the email input type can find details. Richard Merdinger VP, Domains e: rmerdinger@godaddy.com<mailto:rmerdinger@godaddy.com> s: richard.merdinger From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Satish Babu Sent: Saturday, November 11, 2017 10:04 PM To: Don Hollander <don.hollander@icann.org> Cc: ua-discuss <UA-discuss@icann.org> Subject: Re: [UA-discuss] Progress on HTML and email... This is a great step. Is there a document that sets out the context for this spec and provides the full details ? It would be useful for outreach. satish On Nov 12, 2017 9:24 AM, "Don Hollander" <don.hollander@icann.org<mailto:don.hollander@icann.org>> wrote: That is indeed good news. Are there other constraints to the email input type? Does it allow unicode on the other side of the ‘@‘? Does it include any validation? And field length constraints? D
On 12/11/2017, at 4:13 PM, Chaals McCathie Nevile <chaals@yandex.ru<mailto:chaals@yandex.ru>> wrote:
Hi,
just to note that this week we reached a provisional agreement to change the HTML email input type, and make it accept a unicode value before the "@" instead of the current restriction to ASCII. Once we get this change made and published - hopefully in a few weeks - we would appreciate help encouraging browsers to implement it fast.
cheers
chaals
-- Chaals is Charles McCathie Nevile find more at http://yandex.com Don Hollander Universal Acceptance Steering Group Skype: don_hollander
On Mon, 13 Nov 2017 05:04:48 +0100, Richard Merdinger <rmerdinger@godaddy.com> wrote:
+1 on the kudos as well as a hint on where those of us new to the standards relating to the email input type can find details.
Someone else already posted the link, but what is under discussion is the defintion of the "email" input type in HTM, with the current draft being: https://w3c.github.io/html/sec-forms.html#email-state-typeemail We *agreed* to update this so that it accepts unicodce on the LHS, but we have not yet done the editorial work to make it happen. Andrew's comments have been closest to describing what I believe our spec intends - so at the very least there is some more editorial work to do on expressing this more clearly. Contributions are welcome (although of course there is the requirement for IPR commitment according to W3C's patent policy). Best way to contribute is through github, if you have an account: https://github.com/w3c/html/issues It is also possible to email comments to public-html@w3.org - which is publicly archived. cheers Chaals
Richard Merdinger
VP, Domains
e: rmerdinger@godaddy.com
s: richard.merdinger
From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Satish Babu Sent: Saturday, November 11, 2017 10:04 PM To: Don Hollander <don.hollander@icann.org> Cc: ua-discuss <UA-discuss@icann.org> Subject: Re: [UA-discuss] Progress on HTML and email...
This is a great step.
Is there a document that sets out the context for this spec and provides the full details ? It would be useful for outreach.
satish
On Nov 12, 2017 9:24 AM, "Don Hollander" <don.hollander@icann.org> wrote:
That is indeed good news.
Are there other constraints to the email input type? Does it allow unicode on the other side of the ‘@‘? Does it include any
validation? And field length constraints?
D
On 12/11/2017, at 4:13 PM, Chaals McCathie Nevile <chaals@yandex.ru> wrote:
Hi,
just to note that this week we reached a provisional agreement to change the HTML email input type, and make it accept a >>unicode value before the "@" instead of the current restriction to ASCII. Once we get this change made and published - >>hopefully in a few weeks - we would appreciate help encouraging browsers to implement it fast.
cheers
chaals
-- Chaals is Charles McCathie Nevile find more at http://yandex.com
Don Hollander Universal Acceptance Steering Group Skype: don_hollander
-- Chaals is Charles McCathie Nevile find more at http://yandex.com
On Wed, 15 Nov 2017 12:06:40 +0100, Chaals McCathie Nevile <chaals@yandex.ru> wrote:
On Mon, 13 Nov 2017 05:04:48 +0100, Richard Merdinger <rmerdinger@godaddy.com> wrote:
+1 on the kudos as well as a hint on where those of us new to the standards relating to the email input type can find details.
Someone else already posted the link, but what is under discussion is the defintion of the "email" input type in HTM, with the current draft being: >https://w3c.github.io/html/sec-forms.html#email-state-typeemail
For what it is worth, this is listed as "under consideration" by the Microsoft Edge team - https://developer.microsoft.com/en-us/microsoft-edge/platform/status/interna... - where they note that Firefox supports it but other browsers don't. It seems people can add a vote in favour of (or perhaps against?) a feature. I doubt that such a random mechanism is as effective as the right cutomer approaching the right person in the organisation with an explanation of their requirement, but I believe it is taken into consideration. Now, to see if I can move the work forward in the HTML spec... cheers Chaals
We *agreed* to update this so that it accepts unicodce on the LHS, but we have not yet done the editorial work to make it happen.
Andrew's comments have been closest to describing what I believe our spec intends - so at the very least there is some more editorial work to do on >expressing this more clearly.
Contributions are welcome (although of course there is the requirement for IPR commitment according to W3C's patent policy). Best way to contribute >is through github, if you have an account: https://github.com/w3c/html/issues
It is also possible to email comments to public-html@w3.org - which is publicly archived.
cheers
Chaals
Richard Merdinger
VP, Domains
e: rmerdinger@godaddy.com
s: richard.merdinger
From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Satish Babu Sent: Saturday, November 11, 2017 10:04 PM To: Don Hollander <don.hollander@icann.org> Cc: ua-discuss <UA-discuss@icann.org> Subject: Re: [UA-discuss] Progress on HTML and email...
This is a great step.
Is there a document that sets out the context for this spec and provides the full details ? It would be useful for outreach.
satish
On Nov 12, 2017 9:24 AM, "Don Hollander" <don.hollander@icann.org> wrote:
That is indeed good news.
Are there other constraints to the email input type? Does it allow unicode on the other side of the ‘@‘? Does it include any
validation? And field length constraints?
D
On 12/11/2017, at 4:13 PM, Chaals McCathie Nevile <chaals@yandex.ru> wrote:
Hi,
just to note that this week we reached a provisional agreement to change the HTML email input type, and make it accept a >>>unicode value before the "@" instead of the current restriction to ASCII. Once we get this change made and published - >>>hopefully in a few weeks - we would appreciate help encouraging browsers to implement it fast.
cheers
chaals
-- Chaals is Charles McCathie Nevile find more at http://yandex.com
Don Hollander Universal Acceptance Steering Group Skype: don_hollander
--Chaals is Charles McCathie Nevile find more at http://yandex.com
-- Chaals is Charles McCathie Nevile find more at http://yandex.com
Greet news , Appreciate it If there will be a test pilot for that , I am happy to see it. -----Original Message----- From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Chaals McCathie Nevile Sent: Sunday, November 12, 2017 5:14 AM To: ua-discuss@icann.org Subject: [UA-discuss] Progress on HTML and email... Hi, just to note that this week we reached a provisional agreement to change the HTML email input type, and make it accept a unicode value before the "@" instead of the current restriction to ASCII. Once we get this change made and published - hopefully in a few weeks - we would appreciate help encouraging browsers to implement it fast. cheers chaals -- Chaals is Charles McCathie Nevile find more at http://yandex.com
This is good initiative. I have few queries challs. 1. After @ , are you not checking / validating. What's the plan.? 2. Any specific reason, team decided to consider for before @ only. Thanks. AD On 12 November 2017 08:43:54 GMT+05:30, Chaals McCathie Nevile <chaals@yandex.ru> wrote:
Hi,
just to note that this week we reached a provisional agreement to change the HTML email input type, and make it accept a unicode value before the "@" instead of the current restriction to ASCII. Once we get this change made and published - hopefully in a few weeks - we would appreciate help encouraging browsers to implement it fast.
cheers
chaals
-- Chaals is Charles McCathie Nevile find more at http://yandex.com
-- Sent from my Android device with XGenPlus.
Here’s the applicable portion of the spec: https://w3c.github.io/html/sec-forms.html#valid-e-mail-address The assumption is that all local parts are ASCII letters-digits and that IDNs in the domain part should be expressed in punycode. This is of course doubly broken, because humans can’t really use punycode. Fixing the local part is a good start and I encourage it. But if the domain name part continues to prohibit Unicode, it won’t actually help anyone. Is the plan is to fix the local part in v5.2 now and fix the domain name part in v5.3? From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Dr. Ajay Data Sent: Sunday, November 12, 2017 9:08 PM To: chaals@yandex.ru; ua-discuss@icann.org Subject: Re: [UA-discuss] Progress on HTML and email... This is good initiative. I have few queries challs. 1. After @ , are you not checking / validating. What's the plan.? 2. Any specific reason, team decided to consider for before @ only. Thanks. AD On 12 November 2017 08:43:54 GMT+05:30, Chaals McCathie Nevile <chaals@yandex.ru<mailto:chaals@yandex.ru>> wrote: Hi, just to note that this week we reached a provisional agreement to change the HTML email input type, and make it accept a unicode value before the "@" instead of the current restriction to ASCII. Once we get this change made and published - hopefully in a few weeks - we would appreciate help encouraging browsers to implement it fast. cheers chaals -- Sent from my Android device with XGenPlus.
On Mon, Nov 13, 2017 at 05:47:20AM +0000, Mark Svancarek via UA-discuss wrote:
The assumption is that all local parts are ASCII letters-digits and that IDNs in the domain part should be expressed in punycode. This is of course doubly broken, because humans can’t really use punycode.
Fixing the local part is a good start and I encourage it. But if the domain name part continues to prohibit Unicode, it won’t actually help anyone.
The domain part does not actually prohibit Unicode in the strict sense. There's a note in the document that says This syntax allows e-mail addresses with Internationalised Domain Names using punycode, such as example@xn--d1acpjx3f.xn--p1ai. A user agent should represent that in the user interface as example@яндекс.рф So, the user-agent is asked to do the IDNA transformation from A-label to U-label for display purposes. Since under IDNA2008 that ought to be a 1:1 and fully reversible operation, it shouldn't be a big deal. It's true that this input restriction won't produce an EAI address, but that is trivial to fix, also, if you do the IDNA transformation. Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com
My interpretation is that the user is a human who must enter a string of text into a web form, where it is cast as type Email (which can subsequently be converted into ULABELs if the typed-in string includes ALABELs). It's the first part, where the human user is typing an ALABEL into a web form, that concerns me. Is this the wrong interpretation? -----Original Message----- From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Andrew Sullivan Sent: Sunday, November 12, 2017 10:34 PM To: ua-discuss@icann.org Subject: Re: [UA-discuss] Progress on HTML and email... On Mon, Nov 13, 2017 at 05:47:20AM +0000, Mark Svancarek via UA-discuss wrote:
The assumption is that all local parts are ASCII letters-digits and that IDNs in the domain part should be expressed in punycode. This is of course doubly broken, because humans can’t really use punycode.
Fixing the local part is a good start and I encourage it. But if the domain name part continues to prohibit Unicode, it won’t actually help anyone.
The domain part does not actually prohibit Unicode in the strict sense. There's a note in the document that says This syntax allows e-mail addresses with Internationalised Domain Names using punycode, such as example@xn--d1acpjx3f.xn--p1ai. A user agent should represent that in the user interface as example@яндекс.рф So, the user-agent is asked to do the IDNA transformation from A-label to U-label for display purposes. Since under IDNA2008 that ought to be a 1:1 and fully reversible operation, it shouldn't be a big deal. It's true that this input restriction won't produce an EAI address, but that is trivial to fix, also, if you do the IDNA transformation. Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com
Mark, I can’t speak for the spec itself, but there should be some normalization that takes place so that the user can input either the u-label or the a-label form. If the spec doesn’t allow for this flexibility, I think is should be augmented to do so. Richard Merdinger VP, Domains rmerdinger@godaddy.com On 11/13/17, 10:24 AM, "UA-discuss on behalf of Mark Svancarek via UA-discuss" <ua-discuss-bounces@icann.org on behalf of ua-discuss@icann.org> wrote: My interpretation is that the user is a human who must enter a string of text into a web form, where it is cast as type Email (which can subsequently be converted into ULABELs if the typed-in string includes ALABELs). It's the first part, where the human user is typing an ALABEL into a web form, that concerns me. Is this the wrong interpretation? -----Original Message----- From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Andrew Sullivan Sent: Sunday, November 12, 2017 10:34 PM To: ua-discuss@icann.org Subject: Re: [UA-discuss] Progress on HTML and email... On Mon, Nov 13, 2017 at 05:47:20AM +0000, Mark Svancarek via UA-discuss wrote: > > The assumption is that all local parts are ASCII letters-digits and that IDNs in the domain part should be expressed in punycode. This is of course doubly broken, because humans can’t really use punycode. > > Fixing the local part is a good start and I encourage it. But if the domain name part continues to prohibit Unicode, it won’t actually help anyone. > The domain part does not actually prohibit Unicode in the strict sense. There's a note in the document that says This syntax allows e-mail addresses with Internationalised Domain Names using punycode, such as example@xn--d1acpjx3f.xn--p1ai. A user agent should represent that in the user interface as example@яндекс.рф So, the user-agent is asked to do the IDNA transformation from A-label to U-label for display purposes. Since under IDNA2008 that ought to be a 1:1 and fully reversible operation, it shouldn't be a big deal. It's true that this input restriction won't produce an EAI address, but that is trivial to fix, also, if you do the IDNA transformation. Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com
To me, personally speaking, both a-label and u-label should be "universally accepted" in forms. In a utopian world, it should just be the u-label always, but that assumes everything works correctly everywhere with ux, forms and what not. It has been my experience that if someone is entering an a-label they are not the standard user, and they are doing so to override ux concerns and eai. They have some understanding of how a/u works, have something functional with the domain name, and are wanting to ensure that the domain they entered works correctly, bypassing the ux / other systems (eai) that may not be working as expected. If a person filling out a form takes the initiative on entering the appropriate registry's a-label, they should not be penalized for it. On Nov 13, 2017 15:56, "Richard Merdinger" <rmerdinger@godaddy.com> wrote:
Mark, I can’t speak for the spec itself, but there should be some normalization that takes place so that the user can input either the u-label or the a-label form. If the spec doesn’t allow for this flexibility, I think is should be augmented to do so.
Richard Merdinger VP, Domains rmerdinger@godaddy.com
On 11/13/17, 10:24 AM, "UA-discuss on behalf of Mark Svancarek via UA-discuss" <ua-discuss-bounces@icann.org on behalf of ua-discuss@icann.org> wrote:
My interpretation is that the user is a human who must enter a string of text into a web form, where it is cast as type Email (which can subsequently be converted into ULABELs if the typed-in string includes ALABELs).
It's the first part, where the human user is typing an ALABEL into a web form, that concerns me.
Is this the wrong interpretation?
-----Original Message----- From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Andrew Sullivan Sent: Sunday, November 12, 2017 10:34 PM To: ua-discuss@icann.org Subject: Re: [UA-discuss] Progress on HTML and email...
On Mon, Nov 13, 2017 at 05:47:20AM +0000, Mark Svancarek via UA-discuss wrote: > > The assumption is that all local parts are ASCII letters-digits and that IDNs in the domain part should be expressed in punycode. This is of course doubly broken, because humans can’t really use punycode. > > Fixing the local part is a good start and I encourage it. But if the domain name part continues to prohibit Unicode, it won’t actually help anyone. >
The domain part does not actually prohibit Unicode in the strict sense. There's a note in the document that says
This syntax allows e-mail addresses with Internationalised Domain Names using punycode, such as example@xn--d1acpjx3f.xn--p1ai. A user agent should represent that in the user interface as example@яндекс.рф
So, the user-agent is asked to do the IDNA transformation from A-label to U-label for display purposes. Since under IDNA2008 that ought to be a 1:1 and fully reversible operation, it shouldn't be a big deal. It's true that this input restriction won't produce an EAI address, but that is trivial to fix, also, if you do the IDNA transformation.
Best regards,
A
-- Andrew Sullivan ajs@anvilwalrusden.com
On Mon, Nov 13, 2017 at 04:23:29PM +0000, Mark Svancarek wrote:
My interpretation is that the user is a human who must enter a string of text into a web form, where it is cast as type Email (which can subsequently be converted into ULABELs if the typed-in string includes ALABELs).
No, I don't think that's it. This is the specification for HTML, not for the UI. The user agent can do transformations. So I _think_ it means that an email type of an input element, when it is _sent_ as input, has to be in this form; but that the input method could be different and the user agent could do a transformation on it so that Unicode user input (which could be in any form, recall) is transformed into a valid U-label/A-label pair before it becomes HTML form input. (One feels the need for another word for "stuff that comes from the user in the UI" vs "stuff that ends up in the form as 'input' formally so defined". There may be a term of art already in the specifications for this, but I'm not going to dig it out just now.) For the purposes of wire transmission and storage the server-part is A-labels, but for the purposes of display they're U-labels. Presumably, for the purposes of input they're whatever the user might input. After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all. We probably need someone who is working directly on browser code to say more about how this works in practice. Maybe Shawn Steele knows? I suspect this is slightly more obscure in the specification than I at least would like because of some of the WHATWG/W3C politics around HTML5. (Some of the principals in WHATWG don't believe that IDAN2008 is a thing. I will leave divining the consequences of using an IDNA specification that does not have a perfect 1:1 A-label/U-label mapping as an exercise for the reader, but I note that IDNA2008 doesn't solve the need for mappings: upper case characters aren't allowed in IDNA2008 U-labels.) Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com
Shawn is on the DL, but adding him explicitly for clarification. -----Original Message----- From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Andrew Sullivan Sent: Monday, November 13, 2017 4:29 PM To: ua-discuss@icann.org Subject: Re: [UA-discuss] Progress on HTML and email... On Mon, Nov 13, 2017 at 04:23:29PM +0000, Mark Svancarek wrote:
My interpretation is that the user is a human who must enter a string of text into a web form, where it is cast as type Email (which can subsequently be converted into ULABELs if the typed-in string includes ALABELs).
No, I don't think that's it. This is the specification for HTML, not for the UI. The user agent can do transformations. So I _think_ it means that an email type of an input element, when it is _sent_ as input, has to be in this form; but that the input method could be different and the user agent could do a transformation on it so that Unicode user input (which could be in any form, recall) is transformed into a valid U-label/A-label pair before it becomes HTML form input. (One feels the need for another word for "stuff that comes from the user in the UI" vs "stuff that ends up in the form as 'input' formally so defined". There may be a term of art already in the specifications for this, but I'm not going to dig it out just now.) For the purposes of wire transmission and storage the server-part is A-labels, but for the purposes of display they're U-labels. Presumably, for the purposes of input they're whatever the user might input. After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all. We probably need someone who is working directly on browser code to say more about how this works in practice. Maybe Shawn Steele knows? I suspect this is slightly more obscure in the specification than I at least would like because of some of the WHATWG/W3C politics around HTML5. (Some of the principals in WHATWG don't believe that IDAN2008 is a thing. I will leave divining the consequences of using an IDNA specification that does not have a perfect 1:1 A-label/U-label mapping as an exercise for the reader, but I note that IDNA2008 doesn't solve the need for mappings: upper case characters aren't allowed in IDNA2008 U-labels.) Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com
Um. Something's confused about that statement. "After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all." That's not how input works on any browser. People type things on their keyboard (or soft keyboard or whatever) and those get translated into whatever characters the browser's using for their input boxes (hopefully unicode). On Windows basically the "input" from the user to the browser is UTF-16. All of that's irrelevant as far as the HTML spec is concerned. When the user submits the form, then it's up to the browser to send it to the server in the correctly negotiated encoding - hopefully that's UTF-8 for most sites and all browsers, though some negotiations could've stuck it in some really stupid limited codepage. Hopefully the user's entering Unicode@Unicode email addresses, the browser sees that and sends it to the server in UTF-8 (assuming that's the negotiated encoding). The server then sticks it in their database, hopefully in Unicode. When some process actually sends the mail, then something low level's going to have to use Punicode encoding on the domain in order to resolve the name so the mail can be sent to the right server, but hopefully most of the stack is oblivious to that hack. On a Windows box I don't think the application would actually need to deal with the Punicode at all (unless they wanted to do some sort of manual validation of the domain themselves). I'd expect the email input type to be pretty much the same as the text input type - except for the extra validation that a browser might do for sanity checking. (Which, I suppose could even include pinging the DNS to find out if it's a real mail server). -Shawn -----Original Message----- From: Mark Svancarek Sent: Tuesday, November 14, 2017 7:49 AM To: Andrew Sullivan <ajs@anvilwalrusden.com>; ua-discuss@icann.org; Shawn Steele <Shawn.Steele@microsoft.com> Subject: RE: [UA-discuss] Progress on HTML and email... Shawn is on the DL, but adding him explicitly for clarification. -----Original Message----- From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Andrew Sullivan Sent: Monday, November 13, 2017 4:29 PM To: ua-discuss@icann.org Subject: Re: [UA-discuss] Progress on HTML and email... On Mon, Nov 13, 2017 at 04:23:29PM +0000, Mark Svancarek wrote:
My interpretation is that the user is a human who must enter a string of text into a web form, where it is cast as type Email (which can subsequently be converted into ULABELs if the typed-in string includes ALABELs).
No, I don't think that's it. This is the specification for HTML, not for the UI. The user agent can do transformations. So I _think_ it means that an email type of an input element, when it is _sent_ as input, has to be in this form; but that the input method could be different and the user agent could do a transformation on it so that Unicode user input (which could be in any form, recall) is transformed into a valid U-label/A-label pair before it becomes HTML form input. (One feels the need for another word for "stuff that comes from the user in the UI" vs "stuff that ends up in the form as 'input' formally so defined". There may be a term of art already in the specifications for this, but I'm not going to dig it out just now.) For the purposes of wire transmission and storage the server-part is A-labels, but for the purposes of display they're U-labels. Presumably, for the purposes of input they're whatever the user might input. After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all. We probably need someone who is working directly on browser code to say more about how this works in practice. Maybe Shawn Steele knows? I suspect this is slightly more obscure in the specification than I at least would like because of some of the WHATWG/W3C politics around HTML5. (Some of the principals in WHATWG don't believe that IDAN2008 is a thing. I will leave divining the consequences of using an IDNA specification that does not have a perfect 1:1 A-label/U-label mapping as an exercise for the reader, but I note that IDNA2008 doesn't solve the need for mappings: upper case characters aren't allowed in IDNA2008 U-labels.) Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com
Here's the definition of the Email Input Type. https://w3c.github.io/html/sec-forms.html#valid-e-mail-address My assertion was that the Email Input Type was mostly applicable to text entered by the user from a keyboard/IME. I acknowledge that it could be piped in from other sources, but it seems to me that human entry from keyboard/IME is the primary use case. Based on that assertion, I believe that the intention of the spec is to support only ALABELs as form inputs, regardless what additional processing may occur once the input is submitted. Since it is a nongoal for ALABELs to be human-friendly, it will be unacceptably hard for any nontechnical human to use a Form based on the Email Input Type if they want to Submit an address with a Unicode domain name part, in spite of the availability of punycode converters. Am I confused? -----Original Message----- From: Shawn Steele Sent: Tuesday, November 14, 2017 10:28 AM To: Mark Svancarek <marksv@microsoft.com>; Andrew Sullivan <ajs@anvilwalrusden.com>; ua-discuss@icann.org Subject: RE: [UA-discuss] Progress on HTML and email... Um. Something's confused about that statement. "After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all." That's not how input works on any browser. People type things on their keyboard (or soft keyboard or whatever) and those get translated into whatever characters the browser's using for their input boxes (hopefully unicode). On Windows basically the "input" from the user to the browser is UTF-16. All of that's irrelevant as far as the HTML spec is concerned. When the user submits the form, then it's up to the browser to send it to the server in the correctly negotiated encoding - hopefully that's UTF-8 for most sites and all browsers, though some negotiations could've stuck it in some really stupid limited codepage. Hopefully the user's entering Unicode@Unicode email addresses, the browser sees that and sends it to the server in UTF-8 (assuming that's the negotiated encoding). The server then sticks it in their database, hopefully in Unicode. When some process actually sends the mail, then something low level's going to have to use Punicode encoding on the domain in order to resolve the name so the mail can be sent to the right server, but hopefully most of the stack is oblivious to that hack. On a Windows box I don't think the application would actually need to deal with the Punicode at all (unless they wanted to do some sort of manual validation of the domain themselves). I'd expect the email input type to be pretty much the same as the text input type - except for the extra validation that a browser might do for sanity checking. (Which, I suppose could even include pinging the DNS to find out if it's a real mail server). -Shawn -----Original Message----- From: Mark Svancarek Sent: Tuesday, November 14, 2017 7:49 AM To: Andrew Sullivan <ajs@anvilwalrusden.com>; ua-discuss@icann.org; Shawn Steele <Shawn.Steele@microsoft.com> Subject: RE: [UA-discuss] Progress on HTML and email... Shawn is on the DL, but adding him explicitly for clarification. -----Original Message----- From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Andrew Sullivan Sent: Monday, November 13, 2017 4:29 PM To: ua-discuss@icann.org Subject: Re: [UA-discuss] Progress on HTML and email... On Mon, Nov 13, 2017 at 04:23:29PM +0000, Mark Svancarek wrote:
My interpretation is that the user is a human who must enter a string of text into a web form, where it is cast as type Email (which can subsequently be converted into ULABELs if the typed-in string includes ALABELs).
No, I don't think that's it. This is the specification for HTML, not for the UI. The user agent can do transformations. So I _think_ it means that an email type of an input element, when it is _sent_ as input, has to be in this form; but that the input method could be different and the user agent could do a transformation on it so that Unicode user input (which could be in any form, recall) is transformed into a valid U-label/A-label pair before it becomes HTML form input. (One feels the need for another word for "stuff that comes from the user in the UI" vs "stuff that ends up in the form as 'input' formally so defined". There may be a term of art already in the specifications for this, but I'm not going to dig it out just now.) For the purposes of wire transmission and storage the server-part is A-labels, but for the purposes of display they're U-labels. Presumably, for the purposes of input they're whatever the user might input. After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all. We probably need someone who is working directly on browser code to say more about how this works in practice. Maybe Shawn Steele knows? I suspect this is slightly more obscure in the specification than I at least would like because of some of the WHATWG/W3C politics around HTML5. (Some of the principals in WHATWG don't believe that IDAN2008 is a thing. I will leave divining the consequences of using an IDNA specification that does not have a perfect 1:1 A-label/U-label mapping as an exercise for the reader, but I note that IDNA2008 doesn't solve the need for mappings: upper case characters aren't allowed in IDNA2008 U-labels.) Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com
That's the discussion: To fix the email input type to allow Unicode. Clearly humans can't be expected to enter ALABELS, so the current email address validation spec is a non-starter. The EAI RFCs are quite clear that Punycode is to be avoided and apps are supposed to use Unicode for email addresses. The expectation is that Punycoding only needs to occur during actual mail delivery when the server needs to do the DNS resolution step. And then (obviously) only for the domain name part. I expect that without other guidance, browsers wanting to support EAI would extend the validation to allow Unicode characters > U+007F. Clearly it would be best to formalize that in an updated spec. -Shawn -----Original Message----- From: Mark Svancarek Sent: Tuesday, November 14, 2017 1:41 PM To: Shawn Steele <Shawn.Steele@microsoft.com>; Andrew Sullivan <ajs@anvilwalrusden.com>; ua-discuss@icann.org Subject: RE: [UA-discuss] Progress on HTML and email... Here's the definition of the Email Input Type. https://w3c.github.io/html/sec-forms.html#valid-e-mail-address My assertion was that the Email Input Type was mostly applicable to text entered by the user from a keyboard/IME. I acknowledge that it could be piped in from other sources, but it seems to me that human entry from keyboard/IME is the primary use case. Based on that assertion, I believe that the intention of the spec is to support only ALABELs as form inputs, regardless what additional processing may occur once the input is submitted. Since it is a nongoal for ALABELs to be human-friendly, it will be unacceptably hard for any nontechnical human to use a Form based on the Email Input Type if they want to Submit an address with a Unicode domain name part, in spite of the availability of punycode converters. Am I confused? -----Original Message----- From: Shawn Steele Sent: Tuesday, November 14, 2017 10:28 AM To: Mark Svancarek <marksv@microsoft.com>; Andrew Sullivan <ajs@anvilwalrusden.com>; ua-discuss@icann.org Subject: RE: [UA-discuss] Progress on HTML and email... Um. Something's confused about that statement. "After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all." That's not how input works on any browser. People type things on their keyboard (or soft keyboard or whatever) and those get translated into whatever characters the browser's using for their input boxes (hopefully unicode). On Windows basically the "input" from the user to the browser is UTF-16. All of that's irrelevant as far as the HTML spec is concerned. When the user submits the form, then it's up to the browser to send it to the server in the correctly negotiated encoding - hopefully that's UTF-8 for most sites and all browsers, though some negotiations could've stuck it in some really stupid limited codepage. Hopefully the user's entering Unicode@Unicode email addresses, the browser sees that and sends it to the server in UTF-8 (assuming that's the negotiated encoding). The server then sticks it in their database, hopefully in Unicode. When some process actually sends the mail, then something low level's going to have to use Punicode encoding on the domain in order to resolve the name so the mail can be sent to the right server, but hopefully most of the stack is oblivious to that hack. On a Windows box I don't think the application would actually need to deal with the Punicode at all (unless they wanted to do some sort of manual validation of the domain themselves). I'd expect the email input type to be pretty much the same as the text input type - except for the extra validation that a browser might do for sanity checking. (Which, I suppose could even include pinging the DNS to find out if it's a real mail server). -Shawn -----Original Message----- From: Mark Svancarek Sent: Tuesday, November 14, 2017 7:49 AM To: Andrew Sullivan <ajs@anvilwalrusden.com>; ua-discuss@icann.org; Shawn Steele <Shawn.Steele@microsoft.com> Subject: RE: [UA-discuss] Progress on HTML and email... Shawn is on the DL, but adding him explicitly for clarification. -----Original Message----- From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Andrew Sullivan Sent: Monday, November 13, 2017 4:29 PM To: ua-discuss@icann.org Subject: Re: [UA-discuss] Progress on HTML and email... On Mon, Nov 13, 2017 at 04:23:29PM +0000, Mark Svancarek wrote:
My interpretation is that the user is a human who must enter a string of text into a web form, where it is cast as type Email (which can subsequently be converted into ULABELs if the typed-in string includes ALABELs).
No, I don't think that's it. This is the specification for HTML, not for the UI. The user agent can do transformations. So I _think_ it means that an email type of an input element, when it is _sent_ as input, has to be in this form; but that the input method could be different and the user agent could do a transformation on it so that Unicode user input (which could be in any form, recall) is transformed into a valid U-label/A-label pair before it becomes HTML form input. (One feels the need for another word for "stuff that comes from the user in the UI" vs "stuff that ends up in the form as 'input' formally so defined". There may be a term of art already in the specifications for this, but I'm not going to dig it out just now.) For the purposes of wire transmission and storage the server-part is A-labels, but for the purposes of display they're U-labels. Presumably, for the purposes of input they're whatever the user might input. After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all. We probably need someone who is working directly on browser code to say more about how this works in practice. Maybe Shawn Steele knows? I suspect this is slightly more obscure in the specification than I at least would like because of some of the WHATWG/W3C politics around HTML5. (Some of the principals in WHATWG don't believe that IDAN2008 is a thing. I will leave divining the consequences of using an IDNA specification that does not have a perfect 1:1 A-label/U-label mapping as an exercise for the reader, but I note that IDNA2008 doesn't solve the need for mappings: upper case characters aren't allowed in IDNA2008 U-labels.) Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com
On Tue, Nov 14, 2017 at 06:28:28PM +0000, Shawn Steele wrote:
Um. Something's confused about that statement. "After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all."
Not confused, just probably a little too glib.
That's not how input works on any browser. People type things on their keyboard (or soft keyboard or whatever) and those get translated into whatever characters the browser's using for their input boxes (hopefully unicode). On Windows basically the "input" from the user to the browser is UTF-16. All of that's irrelevant as far as the HTML spec is concerned.
This was my point.
When the user submits the form, then it's up to the browser to send it to the server in the correctly negotiated encoding - hopefully that's UTF-8 for most sites and all browsers, though some negotiations could've stuck it in some really stupid limited codepage.
Yes, but again my point was that if the HTML spec now says that the server-part of an email address in input has to be LDH (which is what the draft spec says), and the user has put some Unicode characters in the server-part of an email address, then a browser running on Windows has to do the "something to UTF-8" transformation before it can even get around to doing the U-label to A-label transformation, because the user-supplied input isn't in UTF-8 normalized with NFC _anyway_, and that's the starting line for being a candidate U-label. The same, actually, would be true of the local-part of the email address, since EAI addresses are required to be in UTF-8 but the user-supplied input won't necessarily be that (won't be on MacOS, either, since the native form on OSX is NFD, not NFC). So there's nothing strange about a browser having to do transformations of user-supplied information here, I think. Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com
On 11/14/2017 2:18 PM, Andrew Sullivan wrote:
then a browser running on Windows has to do the "something to UTF-8" transformation before it can even get around to doing the U-label to A-label transformation,
Except that library functions written for Windows may well allow A-labels to be generated from UTF-16 in a single step, possibly applying normalization as well. UTF-16 <--> UTF-8 being loss-less and reversible for well-formed strings... A./
On Tue, Nov 14, 2017 at 02:55:41PM -0800, Asmus Freytag wrote:
Except that library functions written for Windows may well allow A-labels to be generated from UTF-16 in a single step, possibly applying normalization as well.
Sure. There are lots of ways to implement this, which is why IETF standards normally specify wire formats. What you do in the privacy of your own implementation has no interoperability implications. But if you want to hand a U-label to anyone else, it had better be UTF-8 in NFC, or else it can't be a U-label. A -- Andrew Sullivan ajs@anvilwalrusden.com
Maybe this is semantics, but I don't think of it as "the browser" doing the conversions. It would need to be some JS coded into the page. The browser is just the sandbox the page's script would be running inside. -----Original Message----- From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Andrew Sullivan Sent: Tuesday, November 14, 2017 2:19 PM To: ua-discuss@icann.org Subject: Re: [UA-discuss] Progress on HTML and email... On Tue, Nov 14, 2017 at 06:28:28PM +0000, Shawn Steele wrote:
Um. Something's confused about that statement. "After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all."
Not confused, just probably a little too glib.
That's not how input works on any browser. People type things on their keyboard (or soft keyboard or whatever) and those get translated into whatever characters the browser's using for their input boxes (hopefully unicode). On Windows basically the "input" from the user to the browser is UTF-16. All of that's irrelevant as far as the HTML spec is concerned.
This was my point.
When the user submits the form, then it's up to the browser to send it to the server in the correctly negotiated encoding - hopefully that's UTF-8 for most sites and all browsers, though some negotiations could've stuck it in some really stupid limited codepage.
Yes, but again my point was that if the HTML spec now says that the server-part of an email address in input has to be LDH (which is what the draft spec says), and the user has put some Unicode characters in the server-part of an email address, then a browser running on Windows has to do the "something to UTF-8" transformation before it can even get around to doing the U-label to A-label transformation, because the user-supplied input isn't in UTF-8 normalized with NFC _anyway_, and that's the starting line for being a candidate U-label. The same, actually, would be true of the local-part of the email address, since EAI addresses are required to be in UTF-8 but the user-supplied input won't necessarily be that (won't be on MacOS, either, since the native form on OSX is NFD, not NFC). So there's nothing strange about a browser having to do transformations of user-supplied information here, I think. Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com
On Tue, Nov 14, 2017 at 10:57:23PM +0000, Mark Svancarek wrote:
Maybe this is semantics, but I don't think of it as "the browser" doing the conversions. It would need to be some JS coded into the page. The browser is just the sandbox the page's script would be running inside.
Strictly of course you're right, which is why the specification talks about "user agents". The point is that the thing the user interacts with is not necessarily the data that qualifies as "input". If those must be 1:1, then there is obviously a problem in the spec because users can't reliably enter A-labels or Punycode-form IDNA2003/UTS#46 labels or even NFC-normalized UTF-8. Alternatively, maybe the specification is doing some filtering but not actually producing valid email addresses, in which case the spec has a serious problem as written in not permitting Unicode in the server-part. But since it explicitly allows the user agent to do transformations for display, that doesn't sound like what's intended (at least to me). I've now said enough about this :) A -- Andrew Sullivan ajs@anvilwalrusden.com
participants (12)
-
Abdalmonem Tharwat Galila -
Andrew Sullivan -
Asmus Freytag -
Chaals McCathie Nevile -
Don Hollander -
Dr. Ajay Data -
Jiankang Yao -
Jothan Frakes -
Mark Svancarek -
Richard Merdinger -
Satish Babu -
Shawn Steele