Hi, On Mon, Nov 06, 2017 at 04:23:50PM -0800, Jim DeLaHunt wrote:
So, RFC5895 "Mapping Characters for Internationalized Domain Names in Applications (IDNA) 2008" <https://www.rfc-editor.org/rfc/rfc5895.txt>, section 2 "The General Procedure", says,
4. If an implementation of this mapping is also performing the step of separation of the parts of a domain name into labels by using the FULL STOP character (U+002E), the IDEOGRAPHIC FULL STOP character (U+3002) can be mapped to the FULL STOP before label separation occurs. There are other characters that are used as "full stops" that one could consider mapping as label separators, but their use as such has not been investigated thoroughly.
Yes.
And UTS #46 "Unicode IDNA Compatibility Processing" <http://www.unicode.org/reports/tr46/>, section 2.3 "Notation", says,
In this document, a label is a substring of a domain name. That substring is bounded on both sides by either the start or the end of the string, or any of the following characters, called label-separators:
1. U+002E ( . ) FULL STOP 2. U+FF0E ( . ) FULLWIDTH FULL STOP 3. U+3002 ( 。 ) IDEOGRAPHIC FULL STOP 4. U+FF61 ( 。 ) HALFWIDTH IDEOGRAPHIC FULL STOP
Note that these would be covered by RFC 5895 too, by step 2 (where the fillwidth and halfwidth characters are decomposed), but it's a more general mechanism than that outlined by UTS #46. I think it is worth pointing out that UTS #46 is a pretty serious burr under the saddle in the relationship between the UTC and the IETF. This is partly because UTS #46 explicitly permits a number of labels that are clearly not permitted under IDNA2008 (see e.g., "For transitional use, the Compatibility Processing also allows domain names containing symbols and punctuation that were valid in IDNA2003, such as √.com (which has an associated web page). Such domain names containing symbols will gradually disappear as registries shift to IDNA2008.") In the IETF, when we have transition mechanisms we are generally required to specify how they work, or else they are regarded as hand-waving. There is basically no mechanism for such transition in UTS#46 ("registries shift to IDNA2008" is the very same transition as "implement IDNA2008", so it's not a mapping at all). The UTC is plainly the expert in the relevant character encodings and how that all functions within applications, but it is also plainly deficient in expertise in the area of network protocols, and the gap shows. The fact that the UTC and the IETF have been so far incapable of collaborating on this topic is IMO a problem. Part of the disagreement comes from a different stance: the IETF's general belief is that, if you're going to fail, declare failure early and then replace the bad protocol (and break stuff if you have to). UTC's approach maximises stability, which means that once something is out in the world you're more or less stuck with it (with a few limited exceptions). INDA2008 was intended to break certain cases early on the grounds that we could already see they were a problem; the most obvious ones were nailing the protocol to a version of Unicode and the expansion of the repertoire beyond LDH analogues. UTS#46's approach is, alas, delaying the reckoning with that damage, and may well have put it off forever (the WHATWG's approach to all of this hasn't helped).
From my point of view as a UASG explainer, this is good an sufficient grounding for a recommendation that apps treat U+3002 as a label separator. I would go further and warn people that this list might grow; that U+FF0E and U+FF61 may be on their way.
That's reasonable, yes, but I would not go too far. It's worth remembering that domain names are, at bottom, protocol elements. There's only so much munging one can do to protocol elements without introducing ambiguities that can be exploited by attackers.
(Interesting, I just noticed that UASG007 also recommends treating the Arabic full stop character “۔” (U+06D4) as a label separator. UTS #46 and RFC5885 don't mention that.)
Yeah, it hadn't been generally studied at the time, and I'm still not sure that the recommendation is ideal. I have heard but am not sure that in some Arabic-using writing systems (not the majority ones), there is some problem in the handling of that code point. I'm not clear on the details, but the population of languages that use Arabic characters for non-Arabic languages is way larger than the Han case. Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com