Suggested meta-questions to think about
Hi, I am sending this as an interested individual, and not as SSAC Chair... I have a few times this weekend already tried to explain my view on "variants", and after doing that in a chat, I felt it start to (for me) make sense, so I wanted to share with you. We have, I think, a problem divided in two different questions. And unfortunately many people think of the solution only the form of "answers to the second question". Let me try to explain. First, whether something is a variant or not (note: word is undefined), is actually a grayscale from "yes" to "no". There are various shades of gray there. For example: A.1. Two characters in Unicode really are to be treated as being equivalent. I presume one could say that the Hangul SC/TC issues fall in that category. A.2. Two different spellings of the same word in the same script and same language, like color/colour. A.3. Same word in the same language in two different scripts (bulgarian) A.4. Same word in two different languages And then there are many A.1.1, A.1.2, A.2.1 etc, and I did even hear today people say "two variants are two different accepted spellings of the same word that _sound_ the same". I do not even know where to put that. But one thing I because of that think should be done, and could be done, by people is to list all different "variants" they can come up with... The one draw the line, what is and what is not? Is the line drawn at A.1.1232 or A.2.56? Ok, given we have some agreement on what is a variant and not, we have to discuss what implications it has. I here also see a number of different questions to be answered. For example: B.1. Should an application with more TLDs than one be counted as one application if the TLDs in the application are variants of each other? And if so, should there be only one fee per application? B.2. Should two different variants be able to be managed by two different registries or not, and if not, what should happen with the variants? One primary and others like the bundling tactics in some TLDs (i.e. choice between "yes delegation" or "just block for other to register")? And then there might be a technical question in there... C.1. Given two domain names are variants of each other, is there something that can be done in the DNS from a technical point of view to express that, or can we only do delegations? The really tricky question is of course to really draw the line between variants and not variants. I think the line from a technical point of view, AND the implications on the second questions, should be for the new TLD approval process be as conservative as possible. Default answer: If someone want two domain names, just send in two applications. Exception: As you desperately need both and not only one of the domain names, you will get both treated as one application. Then ICANN ask IETF formally "can you please let us know if it is possible to have some kind of solution for _technically_ link two TLDs with each other, in a safe and stable way". Via a letter to IAB. Until and if IETF give such a solution, ICANN only have the following two alternatives for the ones that do get two variants approved: 1. Get both delegated 2. Get one delegated and the other blocked Then MAYBE there will be a third option: 3. Get both with some alias solution But these are things which are implications given a definition on what "variants" are, and that discussion is in the future -- although I am pretty sure some parties really would like to have certain solutions to the problem... Patrik
Patrick, You have good summary of the terminology/topic definition problem we stumble upon. Here are my comments, biased towards cyrillic and Bulgarian (I claim to have sufficient understanding of it as such): On Jun 20, 2011, at 14:44 , Patrik Fältström wrote:
A.1. Two characters in Unicode really are to be treated as being equivalent. I presume one could say that the Hangul SC/TC issues fall in that category.
This is a problem with how Unicode was defined. The initial definition of Unicode could be simplistically described as "here is a bunch of code tables that various companies, scripts, languages, countries have designed and use, let's stick them together in a common larger table and call it Unicode". Later, and recently, Unicode has improved, but not much for some "stable" scripts such as Cyrillic. It just remains a collection of interlinked (independent) code tables. Probably there is no real other solution anyway. One example, related to Bulgarian/Cyrillic: as you remember, the Bulgarian IDN ccTLD application was refused on the basis of alleged display similarity with (as someone anonymous suggested) .br. This has come to a shock to our community and experts, because there is really not any similarity between those letters. We have done some research afterwards and found out, that this really is a font confusion issue. In Bulgaria, the Cyrillic script is presented to children and students using much different graphical representation, than found in some computer fonts. We have therefore asked experts and they identified an Unicode character that graphically represent the 'cyrillic ghe' properly. It is included in the Cyrillic script in Unicode 6.0 (we can discuss the technical details separately). Our experts say there are 11 characters in Bulgarian/Cyrillic that needs fixin in Unicode. However, the Unicode modification process is (very) slow and it is near non-realistic to expect that those two characters will be marked as 'equivalent' any time soon. The font issue is very big and it relates to 'variants' very much. We need to address it somehow.
A.2. Two different spellings of the same word in the same script and same language, like color/colour.
This brings up the main issue I have with 'variants'. Again, my take is on Cyrillic. In my opinion, in Cyrillic the variants are label (word) based and not character based. In the Bulgarian language, there is direct 1:1 relationship between how an word is pronounced and how it is spelled. The Cyrillic script was originally designed so that it has this attribute. Therefore, there is rarely such cases. Where they exist, they are well known. I understand it has been a perception for years that we look at character variants, but for many scripts, there rarely have any meaning. Perhaps we should review this in depth for each individual script and possible threat different scripts differently.
A.3. Same word in the same language in two different scripts (bulgarian)
There is only one script for Bulgarian: Cyrillic. Will love to learn your source of this information. There are however issues as you describe with Serbian, where both Latin and Cyrillic are official scripts and there are examples of distorted words written in both scripts (we are not looking into this), or the same word written in both scripts, which is very common. Probably with other languages/scripts as well.
A.4. Same word in two different languages
I don't believe our small community can resolve this or even touch the subject below the surface. Especially, given the extremely short time frame.
And then there are many A.1.1, A.1.2, A.2.1 etc, and I did even hear today people say "two variants are two different accepted spellings of the same word that _sound_ the same". I do not even know where to put that.
You do not speak Bulgarian, nor you write in Cyrillic. If you did, this would make perfect sense to you. I could imagine this issue is present in other non-Latin based scripts as well -- or even in some Latin based languages. To sum it up, your questions make me think you too consider variants to be label (word) based. There is one more case. In Russian language for example, there is the E character with two dots. With mass usage of computers it was largely ignored and now many write the words that originally contained this character with the "ordinary" E. Russian speaking people can recognize these cases and read the "ordinary" E, as they read the E with dots, because the human brain actually reads the word, not the characters. We should consider and probably document many cases of such word simplification that came with the mass introduction of computers in our life. Now, there are Russian words that contain both the E with dots and the "normal" E. In these words, it is not appropriate to use these characters interchangeably, which is yet another proof, that variants are word based. Finally, the E with dots is not going to go away, so we cannot just ignore the issue and suggest everyone should stop using it in this Internet era. My observation is that in recent years, as computers because more potent, humans started to make less sacrifices with their languages/scripts.
B.1. Should an application with more TLDs than one be counted as one application if the TLDs in the application are variants of each other? And if so, should there be only one fee per application? B.2. Should two different variants be able to be managed by two different registries or not, and if not, what should happen with the variants? One primary and others like the bundling tactics in some TLDs (i.e. choice between "yes delegation" or "just block for other to register")?
As far as I understood, our study groups are not to touch any policy issues. There questions are very interesting and important, and must be considered in detail, but probably outside of the scope of this study. Especially considering the short time frame. Another comment I have here is that the bias of our study seems to be towards TLD variants. It is my belief, that our work will be extremely useful at any DNS level, especially as it comes to your next question. DNS is to be treated the same at each level. Therefore, I believe we need to drop the "TLD" part of the terms and threat all labels generically -- any variant application at TLD level might be subject to additional policy if we do not have good working technical design/implementation.
And then there might be a technical question in there... C.1. Given two domain names are variants of each other, is there something that can be done in the DNS from a technical point of view to express that, or can we only do delegations?
There can be many things, that can be done. Perhaps the most significant, to amend DNS to use Unicode. Or plug in IDNA (keep the protocol simple and stable, as Unicode is far from simple and stable). In my opinion, only the protocol, technical solution makes sense in the long run for implementing the variants. Anyway, in any case what is a variant, how to identify it, how to keep the list of variants up to date (languages change) etc is to be worked out first. Daniel Kalchev Register.BG
Just a few comments:
A.2. Two different spellings of the same word in the same script and same language, like color/colour. This brings up the main issue I have with 'variants'. Again, my take is on Cyrillic. In my opinion, in Cyrillic the variants are label (word) based and not character based. In the Bulgarian language, there is direct 1:1 relationship between how an word is pronounced and how it is spelled. The Cyrillic script was originally designed so that it has this attribute. Therefore, there is rarely such cases. Where they exist, they are well known.
I understand it has been a perception for years that we look at character variants, but for many scripts, there rarely have any meaning. Perhaps we should review this in depth for each individual script and possible threat different scripts differently.
In Bulgarian language we have the so called doublet forms of words, just like color and colour. They are divided in many subgroups, from words where only the accent is changed (and there is no visual changes, as in Bulgarian the accent is written only in academic texts.) to the ethymological doublets which have the same root but were accepted into Bulgarian from different languages. The most common doublet forms are the phonetical ones, but they are all put in lists and as Daniel says, they are well known and I seriously doubt that any of them could be used as a TLD. One example is the word for "lunch" - can be written as обед or обяд. I think if we start to review all the possible doublet cases just in Bulgarian, we will miss the deadline, and therefore we have to put the end to this somewhere and decide on what we should actually take, and what we should leave outside of the project.
A.3. Same word in the same language in two different scripts (bulgarian) There is only one script for Bulgarian: Cyrillic. Will love to learn your source of this information.
There are however issues as you describe with Serbian, where both Latin and Cyrillic are official scripts and there are examples of distorted words written in both scripts (we are not looking into this), or the same word written in both scripts, which is very common. Probably with other languages/scripts as well.
Since 2006 in Serbia the Cyrillic is the only official script, but the Latin is widely used. (even more than the Cyrillic) Iliya Bazlyankov UNINET
Hi Patrik, It is so refreshing and so welcome to see someone of high standing with ICANN establishent, especially someone reared in Latin alphabet, to show genuine and spontaneous interest in this set of problems. I am intentionally talking about a 'set of problems' and not a monolithic 'Variant Problem' because talking about 'the variant problem' is like talking about 'the rest of the world' versus 'us-the normal'. To me, the present ICANN VIP endeavor is something between a postponement tactic and a benign academic exercise. At best, the only outcome of this project would be raising awareness of the Latin script community of the existence of the 'problem'. Especially useless is wasting time on 'definitions'. This shows ignorance of the diversity of the problems we face. All we need are 'functional working definitions' to ensure precision of communication; let us leave the rest to the academia. The important thing to note is that not all IDN Problems are created equal: Some are very crucial and immediate from end-user standpoint, some can be managed over time. There is no single answer to whether variant strings should be treated as one application or multiple applications, etc, etc. And now to a concrete question you raise:
And then there might be a technical question in there... C.1. Given two domain names are variants of each other, is there something that can be done in the DNS from a technical point of view to express that, or can we only do delegations?
The really tricky question is of course to really draw the line between variants and not variants. I think the line from a technical point of view, AND the implications on the second questions, should be for the new TLD approval process be as conservative as possible.
Default answer: If someone want two domain names, just send in two applications.
Exception: As you desperately need both and not only one of the domain names, you will get both treated as one application.
Then ICANN ask IETF formally "can you please let us know if it is possible to have some kind of solution for _technically_ link two TLDs with each other, in a safe and stable way". Via a letter to IAB.
This is different from the statement coming from Dennis Jennings with assured tone that a solution is 25 years away. We have been using DNames at second level for .ir for six years with no problems. I am not claiming this will necessarily work at the root level, nor that it should be the universal solution. But a solution for us is an absolute necessity because delegating only one of the two alternate A-labels would amount to rendering useless half of the keyboards in use in the country. We can't help thinking cynically that if Iran had the political clout of China, then ICANN would be more open to actively seeking a solution. Regards, Siavash
On 21 jun 2011, at 12.15, Siavash Shahshahani wrote:
We have been using DNames at second level for .ir for six years with no problems. I am not claiming this will necessarily work at the root level, nor that it should be the universal solution.
What I think *personally* is that DNAME can be a piece of the solution (or some *name) but only with a massive amount of context, for example (possibly) "only pointers within a zone", "not a pointer to another pointer" etc. But as I said, that is something for future discussions. Patrik
This is actually a very good list to start discussing the problem of variants. In ICANN context, people often lapped all these layers as one and discussed issues revolving them as one single class. And if we draw vertical lines down, these issues could further be divided by different etymology of the scripts, or in our case, divided into different languages that is currently having issues. I would strongly argue that we should strictly focus on A1 and maybe to some extend A2 for the context of variant in our group. I hesitate we should be dealing with anything lower than that. 2011/6/20 Patrik Fältström <patrik@frobbit.se>:
A.1. Two characters in Unicode really are to be treated as being equivalent. I presume one could say that the Hangul SC/TC issues fall in that category. A.2. Two different spellings of the same word in the same script and same language, like color/colour.
In between A2 & A3, I would insert "two ways to describe the same concept in the same script and language, like shop and mall"
A.3. Same word in the same language in two different scripts (bulgarian) A.4. Same word in two different languages
Dear Patrik and James, This would indeed seem to be an elegant and practical approach. We have a list of situations that may be referred to as "variants" and we collect examples in the wiki in as many languages as possible that fall into the situations, adding some as we go, if necessary. We hope that we end up with only a short number of situations that we officially call variants (or better some other word!). There may be a small precedent for whether a variant should be treated as one application or two: what happened when the PRC applied for China in SC and TC under the ccTLD Fast Track? Regards, Chris. == Faculty Information Support Officer for Arts & Humanities and Laws Arts & Humanities Faculty Office Andrew Huxley Building UCL, Gower St, London WC1E 6BT Tel 020 7679 1599 (int. 31599) http://www.ucl.ac.uk/isd/staff/fiso/ah ________________________________________ From: vip-bounces@icann.org [vip-bounces@icann.org] on behalf of James Seng [james.seng@gmail.com] Sent: 21 June 2011 05:59 To: Patrik Fältström Cc: vip@icann.org Subject: Re: [vip] Suggested meta-questions to think about This is actually a very good list to start discussing the problem of variants. In ICANN context, people often lapped all these layers as one and discussed issues revolving them as one single class. And if we draw vertical lines down, these issues could further be divided by different etymology of the scripts, or in our case, divided into different languages that is currently having issues. I would strongly argue that we should strictly focus on A1 and maybe to some extend A2 for the context of variant in our group. I hesitate we should be dealing with anything lower than that. 2011/6/20 Patrik Fältström <patrik@frobbit.se>:
A.1. Two characters in Unicode really are to be treated as being equivalent. I presume one could say that the Hangul SC/TC issues fall in that category. A.2. Two different spellings of the same word in the same script and same language, like color/colour.
In between A2 & A3, I would insert "two ways to describe the same concept in the same script and language, like shop and mall"
A.3. Same word in the same language in two different scripts (bulgarian) A.4. Same word in two different languages
Dear All .. Allow me also to share my views on 'Variants' in general and thanks to Patrik for triggering this discussion on the list .. I believe 'Variants' is some sort of bundling or linking a set of domain names together .. meaning that: - delegation of one occurrence of a specific set means either delegation/activation or reservation/blocking of all other occurrences of the same set and - I believe also release, if ever, of one occurrence of a this set would also mean release of the whole set !!! *It's worth noting here that this: - is a personal understanding - of Visual variants - from an Arabic Language background and I'm sure other views occur even within the Arabic Study case (as we haven't come up yet with our views as a group) .. On what to do should variants occur, I believe: 1. that in case of 'Identical/exact matches' (where there is no way to differentiate without looking into the unicode assignments) there need to be some means of automatic bundling .. This would of course be from a pure security point of view where no one can afford having such occurrences activated and delegated to different registries. 2. whereas other cases would fall into a 'similar/confusingly similar' category where different communities may tend to have different views and where bundling would be made optional and furthermore activation would also be made optional .. and I believe 2 types of needs fall wihtin this category 2.1. community need (where activation will be requested) and 2.2. defensive registrations (where activation may or may not be requested) Hope this adds to the discussion .. Kind regards --Manal Ismail Egyptian National Telecom regulatory Authority Egypt GAC Representative ________________________________ من: vip-bounces@icann.org بالنيابة عن Dillon, Chris تاريخ الإرسال: الاربعاء 22/06/2011 05:33 ص إلى: James Seng; Patrik F?ltstr?m نسخة: vip@icann.org الموضوع: Re: [vip] Suggested meta-questions to think about Dear Patrik and James, This would indeed seem to be an elegant and practical approach. We have a list of situations that may be referred to as "variants" and we collect examples in the wiki in as many languages as possible that fall into the situations, adding some as we go, if necessary. We hope that we end up with only a short number of situations that we officially call variants (or better some other word!). There may be a small precedent for whether a variant should be treated as one application or two: what happened when the PRC applied for China in SC and TC under the ccTLD Fast Track? Regards, Chris. == Faculty Information Support Officer for Arts & Humanities and Laws Arts & Humanities Faculty Office Andrew Huxley Building UCL, Gower St, London WC1E 6BT Tel 020 7679 1599 (int. 31599) http://www.ucl.ac.uk/isd/staff/fiso/ah ________________________________________ From: vip-bounces@icann.org [vip-bounces@icann.org] on behalf of James Seng [james.seng@gmail.com] Sent: 21 June 2011 05:59 To: Patrik F?ltstr?m Cc: vip@icann.org Subject: Re: [vip] Suggested meta-questions to think about This is actually a very good list to start discussing the problem of variants. In ICANN context, people often lapped all these layers as one and discussed issues revolving them as one single class. And if we draw vertical lines down, these issues could further be divided by different etymology of the scripts, or in our case, divided into different languages that is currently having issues. I would strongly argue that we should strictly focus on A1 and maybe to some extend A2 for the context of variant in our group. I hesitate we should be dealing with anything lower than that. 2011/6/20 Patrik F?ltstr?m <patrik@frobbit.se>:
A.1. Two characters in Unicode really are to be treated as being equivalent. I presume one could say that the Hangul SC/TC issues fall in that category. A.2. Two different spellings of the same word in the same script and same language, like color/colour.
In between A2 & A3, I would insert "two ways to describe the same concept in the same script and language, like shop and mall"
A.3. Same word in the same language in two different scripts (bulgarian) A.4. Same word in two different languages
Hello everyone, this is my first post to the list -- thank you for all discussion so far. I am from UA ccTLD (we do have some experience with pseudo-variants, from 1997), but this is personal comment regards to technical implementations on variants in DNS and registries -- On Jun 20, 2011, at 9:44 AM, Patrik Fältström wrote: [...]
Ok, given we have some agreement on what is a variant and not, we have to discuss what implications it has. I here also see a number of different questions to be answered. For example:
B.1. Should an application with more TLDs than one be counted as one application if the TLDs in the application are variants of each other? And if so, should there be only one fee per application? B.2. Should two different variants be able to be managed by two different registries or not, and if not, what should happen with the variants? One primary and others like the bundling tactics in some TLDs (i.e. choice between "yes delegation" or "just block for other to register")?
And then there might be a technical question in there... C.1. Given two domain names are variants of each other, is there something that can be done in the DNS from a technical point of view to express that, or can we only do delegations?
[...]
Then ICANN ask IETF formally "can you please let us know if it is possible to have some kind of solution for _technically_ link two TLDs with each other, in a safe and stable way". Via a letter to IAB.
Actually this question has to be asked right away, for if there is no easy way to handle, say, two variant names handled by two different registries (and to the best of my knowledge it is the case), then many of options we can be considering to be good are impossible to implement in current DNS. Otherwise it would be just a waste of time to invent solutions which can not be handled.
Ok, given we have some agreement on what is a variant and not, we have to
discuss what implications it has. I here also see a number of different questions to be answered. For example:
B.1. Should an application with more TLDs than one be counted as one application if the TLDs in the application are variants of each other? And if so, should there be only one fee per application? B.2. Should two different variants be able to be managed by two different registries or not, and if not, what should happen with the variants? One primary and others like the bundling tactics in some TLDs (i.e. choice between "yes delegation" or "just block for other to register")?
Here is another way to look at this: B.1. if variants are needed for resolution due to script, language and cultural factors or due to the "accidents" of encoding and/or localization (e.g KB variation) in the past, it is not really the fault of the applicant of a TLD. So why should the applicant pay for the extra variant strings? If Unicode encoded two A's, A1 and A2, both of which looked exactly alike, should a TLD applicant be expected to pay for A1pple separately from A2pple if both strings rendered indistinguishably as "Apple" to users and half the users had A1 on their keyboards and others had A2 thus generating both versions of encodings for the same string. Of course, one could always argue that ICANN does not insure against "natural" calamities, tough luck!!! But what would be fair? B.2. if variants are needed for blocking (which means that the applicant really does not need it for usability reasons, but for preventing it to be registered by somebody else) they should be paid by the applicant. Interestingly this logic will push for more (instead of less) delegations, which seems counter intuitive, if one looks at it from operational view point. B.3 There could also be another motivation for blocking, which is to prevent malicious use thus securing the users. If this is the need and motivation behind blocking, the extra TLDs should not be charged (thinking mark variation in Arabic). A sub-categoraization of A.1.x... is needed to address how variants may be categorized in B.1-B1.3 above. regards, Sarmad
Hello all, Please accept my apologies if I’ll be re-iterating what has been discussed, and furiously argued, over the previous sessions that I missed. Having read this thread starting from Patrik’s post, here are some of thoughts, I hope these can help. There are a number of very broad issues being discussed here, and it may make sense to try and ring-fence those that this work group can address within reasonable time. I agree with James Seng that A1 and A2 should be of priority. I hate to come back to the definitions here, but it’s important to agree what we’re trying to regulate here. For example, Patrik’s post, mentions case A4, same word in two different languages. From a linguistic point of view, it is not possible as each language is a standalone system, so cross-language similarities should probably be kept out of the scope. It’d be interesting to see a rare case where two TLD applications claim the same string but in different languages. Also, Daniel says that in Cyrillic, variants are word-based rather than character-based and gives an example of E and Ё in one word. I’m not sure I follow the example and logic and tend to disagree, although the exception of “обед” given later on makes a point. I’ll be happy to have a separate discussion with the Cyrillic group to clarify this, but as a linguist and native Russian speaker, I do not see a problem with Ё using E forming variant domain names. There is always a character layer, pure spelling with no pronunciation issues, and that’s what we need to focus on, as that’s what makes up an FQDN. So, taking on board Siavash’s advice, I’ve made up a short list of working definitions for the purpose of this discussion, just to make myself clear. For me, an atomic unit here is a specific character within a specific language, and the variations this character produces when forming a (domain) name. Then “variant” can be a string of characters that is similar and interchangeable with another string; all “variant” strings form a “bundle”, an atomic domain unit that can be treated as one – cf. the SC & TC treatments in ccTLD registries. If two strings are similar but one cannot be mistaken for another, they are not variants. I don’t know what to call similar strings as in a language they are just “different words”, and no-one defines the degree of differentiation. I’ll use a random word like “pancakes” to mean unique strings that are similar but not interchangeable. Pancake cases may be useful where two words differ only in diacritics. So, from my standpoint, there are several layers here: 1. Ring-fencing character variants within different scripts, with sub-groups for specific languages where needed (for example, where the same character is used in different languages differently – cf., Arabic Alif and Cyrillic Yer (Hard sign); explained below). Any pancakes need to be identified and not mixed with variants. 2. Determining policies to define all variants of a specific character forming a bundle, its Unicode representations, font implications within a language, and any cross-language specifics. 3. Where possible, forming recommendations on technical implementations of those policies within the DNS or at higher levels. Ok, so I have my own terms and my own plan of action. Starting with point one and looking at the practical experiences presented at the Wednesday session, here are my initial thoughts. This is part one of a series of rants, and I plan to continue with French and more thoughts on Cyrillic in a separate email. I don’t speak Arabic and mostly base my assumptions on the Internets – presentations, wiki, etc. Please accept my apologies if I’m wrong, I’ll gladly stand corrected.
From what I see, most “variants” in Arabic scripts stem from the optional tashkeel diacritics modifying consonant letters to show which vowels to read them with. Tashkeel are optional and vary in different scripts, thus it is impossible to distinguish between words formed written with and without diacritics. That’s why ccTLD registries in the region treat them as variants and block the possible options, once a variant is written. To me, this sounds reasonable although policy work could help determine how these variants are managed, and what can be done to simplify and improve management of shadow-domains.
Perhaps, there’s a special case for the Arabic Hamza, a glottal stop separating two syllables, which can be represented as a diacritic or use a carrier. If Hamza is required and cannot be omitted, then should words without it be treated as variants of the word with Hamza? By the way, in Russian, there’s a similar glottal stop situation with the old character Yer or Hard Sign, ъ, often replaced by an apostrophe in modern Russian. No other language using Cyrillic alphabet has this character except Bulgarian, where it denotes a specific sound. For Russian IDNs, should the spelling with no Yer be a variant of the spelling with it, and vice versa? There are a number of other characters in Russian that are somehow “special”, including the mentioned Ё or characters that in some fonts may be confusingly similar to other letters. In some cases, it is not reasonable to treat these similarities as variants; instead, the confusion can be avoided prohibiting registration of names that can be confusingly similar to a canonical string that has already been registered. Perhaps, Vladimir Shadrunov from the .tel Registry could share Telnic’s experiences in defining language policies for Russian and other supported IDN languages in .tel. Kind regards, Nadya Morozova 2011/6/20 Patrik Fältström <patrik@frobbit.se>
Hi, I am sending this as an interested individual, and not as SSAC Chair...
I have a few times this weekend already tried to explain my view on "variants", and after doing that in a chat, I felt it start to (for me) make sense, so I wanted to share with you.
We have, I think, a problem divided in two different questions. And unfortunately many people think of the solution only the form of "answers to the second question". Let me try to explain.
First, whether something is a variant or not (note: word is undefined), is actually a grayscale from "yes" to "no". There are various shades of gray there. For example:
A.1. Two characters in Unicode really are to be treated as being equivalent. I presume one could say that the Hangul SC/TC issues fall in that category. A.2. Two different spellings of the same word in the same script and same language, like color/colour. A.3. Same word in the same language in two different scripts (bulgarian) A.4. Same word in two different languages
And then there are many A.1.1, A.1.2, A.2.1 etc, and I did even hear today people say "two variants are two different accepted spellings of the same word that _sound_ the same". I do not even know where to put that.
But one thing I because of that think should be done, and could be done, by people is to list all different "variants" they can come up with...
The one draw the line, what is and what is not? Is the line drawn at A.1.1232 or A.2.56?
Ok, given we have some agreement on what is a variant and not, we have to discuss what implications it has. I here also see a number of different questions to be answered. For example:
B.1. Should an application with more TLDs than one be counted as one application if the TLDs in the application are variants of each other? And if so, should there be only one fee per application? B.2. Should two different variants be able to be managed by two different registries or not, and if not, what should happen with the variants? One primary and others like the bundling tactics in some TLDs (i.e. choice between "yes delegation" or "just block for other to register")?
And then there might be a technical question in there... C.1. Given two domain names are variants of each other, is there something that can be done in the DNS from a technical point of view to express that, or can we only do delegations?
The really tricky question is of course to really draw the line between variants and not variants. I think the line from a technical point of view, AND the implications on the second questions, should be for the new TLD approval process be as conservative as possible.
Default answer: If someone want two domain names, just send in two applications.
Exception: As you desperately need both and not only one of the domain names, you will get both treated as one application.
Then ICANN ask IETF formally "can you please let us know if it is possible to have some kind of solution for _technically_ link two TLDs with each other, in a safe and stable way". Via a letter to IAB.
Until and if IETF give such a solution, ICANN only have the following two alternatives for the ones that do get two variants approved:
1. Get both delegated
2. Get one delegated and the other blocked
Then MAYBE there will be a third option:
3. Get both with some alias solution
But these are things which are implications given a definition on what "variants" are, and that discussion is in the future -- although I am pretty sure some parties really would like to have certain solutions to the problem...
Patrik
Hello all, Agreed with the view of Nadya Morozova on the point of variants. 2011/6/23 Nadya Morozova <nad.morozova@gmail.com>
Hello all,
Please accept my apologies if I’ll be re-iterating what has been discussed, and furiously argued, over the previous sessions that I missed. Having read this thread starting from Patrik’s post, here are some of thoughts, I hope these can help.
There are a number of very broad issues being discussed here, and it may make sense to try and ring-fence those that this work group can address within reasonable time. I agree with James Seng that A1 and A2 should be of priority.
I hate to come back to the definitions here, but it’s important to agree what we’re trying to regulate here. For example, Patrik’s post, mentions case A4, same word in two different languages. From a linguistic point of view, it is not possible as each language is a standalone system, so cross-language similarities should probably be kept out of the scope. It’d be interesting to see a rare case where two TLD applications claim the same string but in different languages.
Also, Daniel says that in Cyrillic, variants are word-based rather than character-based and gives an example of E and Ё in one word. I’m not sure I follow the example and logic and tend to disagree, although the exception of “обед” given later on makes a point. I’ll be happy to have a separate discussion with the Cyrillic group to clarify this, but as a linguist and native Russian speaker, I do not see a problem with Ё using E forming variant domain names. There is always a character layer, pure spelling with no pronunciation issues, and that’s what we need to focus on, as that’s what makes up an FQDN.
So, taking on board Siavash’s advice, I’ve made up a short list of working definitions for the purpose of this discussion, just to make myself clear. For me, an atomic unit here is a specific character within a specific language, and the variations this character produces when forming a (domain) name. Then “variant” can be a string of characters that is similar and interchangeable with another string; all “variant” strings form a “bundle”, an atomic domain unit that can be treated as one – cf. the SC & TC treatments in ccTLD registries. If two strings are similar but one cannot be mistaken for another, they are not variants. I don’t know what to call similar strings as in a language they are just “different words”, and no-one defines the degree of differentiation. I’ll use a random word like “pancakes” to mean unique strings that are similar but not interchangeable. Pancake cases may be useful where two words differ only in diacritics.
So, from my standpoint, there are several layers here:
1. Ring-fencing character variants within different scripts, with sub-groups for specific languages where needed (for example, where the same character is used in different languages differently – cf., Arabic Alif and Cyrillic Yer (Hard sign); explained below). Any pancakes need to be identified and not mixed with variants.
2. Determining policies to define all variants of a specific character forming a bundle, its Unicode representations, font implications within a language, and any cross-language specifics.
3. Where possible, forming recommendations on technical implementations of those policies within the DNS or at higher levels.
Ok, so I have my own terms and my own plan of action. Starting with point one and looking at the practical experiences presented at the Wednesday session, here are my initial thoughts. This is part one of a series of rants, and I plan to continue with French and more thoughts on Cyrillic in a separate email.
I don’t speak Arabic and mostly base my assumptions on the Internets – presentations, wiki, etc. Please accept my apologies if I’m wrong, I’ll gladly stand corrected.
From what I see, most “variants” in Arabic scripts stem from the optional tashkeel diacritics modifying consonant letters to show which vowels to read them with. Tashkeel are optional and vary in different scripts, thus it is impossible to distinguish between words formed written with and without diacritics. That’s why ccTLD registries in the region treat them as variants and block the possible options, once a variant is written. To me, this sounds reasonable although policy work could help determine how these variants are managed, and what can be done to simplify and improve management of shadow-domains.
Perhaps, there’s a special case for the Arabic Hamza, a glottal stop separating two syllables, which can be represented as a diacritic or use a carrier. If Hamza is required and cannot be omitted, then should words without it be treated as variants of the word with Hamza? By the way, in Russian, there’s a similar glottal stop situation with the old character Yer or Hard Sign, ъ, often replaced by an apostrophe in modern Russian. No other language using Cyrillic alphabet has this character except Bulgarian, where it denotes a specific sound. For Russian IDNs, should the spelling with no Yer be a variant of the spelling with it, and vice versa? There are a number of other characters in Russian that are somehow “special”, including the mentioned Ё or characters that in some fonts may be confusingly similar to other letters. In some cases, it is not reasonable to treat these similarities as variants; instead, the confusion can be avoided prohibiting registration of names that can be confusingly similar to a canonical string that has already been registered. Perhaps, Vladimir Shadrunov from the .tel Registry could share Telnic’s experiences in defining language policies for Russian and other supported IDN languages in .tel. Kind regards, Nadya Morozova
2011/6/20 Patrik Fältström <patrik@frobbit.se>
Hi, I am sending this as an interested individual, and not as SSAC Chair...
I have a few times this weekend already tried to explain my view on "variants", and after doing that in a chat, I felt it start to (for me) make sense, so I wanted to share with you.
We have, I think, a problem divided in two different questions. And unfortunately many people think of the solution only the form of "answers to the second question". Let me try to explain.
First, whether something is a variant or not (note: word is undefined), is actually a grayscale from "yes" to "no". There are various shades of gray there. For example:
A.1. Two characters in Unicode really are to be treated as being equivalent. I presume one could say that the Hangul SC/TC issues fall in that category. A.2. Two different spellings of the same word in the same script and same language, like color/colour. A.3. Same word in the same language in two different scripts (bulgarian) A.4. Same word in two different languages
And then there are many A.1.1, A.1.2, A.2.1 etc, and I did even hear today people say "two variants are two different accepted spellings of the same word that _sound_ the same". I do not even know where to put that.
But one thing I because of that think should be done, and could be done, by people is to list all different "variants" they can come up with...
The one draw the line, what is and what is not? Is the line drawn at A.1.1232 or A.2.56?
Ok, given we have some agreement on what is a variant and not, we have to discuss what implications it has. I here also see a number of different questions to be answered. For example:
B.1. Should an application with more TLDs than one be counted as one application if the TLDs in the application are variants of each other? And if so, should there be only one fee per application? B.2. Should two different variants be able to be managed by two different registries or not, and if not, what should happen with the variants? One primary and others like the bundling tactics in some TLDs (i.e. choice between "yes delegation" or "just block for other to register")?
And then there might be a technical question in there... C.1. Given two domain names are variants of each other, is there something that can be done in the DNS from a technical point of view to express that, or can we only do delegations?
The really tricky question is of course to really draw the line between variants and not variants. I think the line from a technical point of view, AND the implications on the second questions, should be for the new TLD approval process be as conservative as possible.
Default answer: If someone want two domain names, just send in two applications.
Exception: As you desperately need both and not only one of the domain names, you will get both treated as one application.
Then ICANN ask IETF formally "can you please let us know if it is possible to have some kind of solution for _technically_ link two TLDs with each other, in a safe and stable way". Via a letter to IAB.
Until and if IETF give such a solution, ICANN only have the following two alternatives for the ones that do get two variants approved:
1. Get both delegated
2. Get one delegated and the other blocked
Then MAYBE there will be a third option:
3. Get both with some alias solution
But these are things which are implications given a definition on what "variants" are, and that discussion is in the future -- although I am pretty sure some parties really would like to have certain solutions to the problem...
Patrik
-- If you have any query then feel free to call me With regards, Satyendra Kumar Pandey Advocate Call me : +91-9460263991 http://in.linkedin.com/in/satyendrakpandey Residence: D 3/ 191, Chitrakoot Vaishali Nagar Jaipur Home: D 61/22 A, Siddhgiri bag, Varanasi - 221010 ------------------------------------------------------------------------------------------------------------------------------ This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version. ------------------------------------------------------------------------------------------------------------------------------
Nadya, Thanks for your comments. On Jun 23, 2011, at 18:46 , Nadya Morozova wrote:
Also, Daniel says that in Cyrillic, variants are word-based rather than character-based and gives an example of E and Ё in one word. I’m not sure I follow the example and logic and tend to disagree, although the exception of “обед” given later on makes a point. I’ll be happy to have a separate discussion with the Cyrillic group to clarify this, but as a linguist and native Russian speaker, I do not see a problem with Ё using E forming variant domain names. There is always a character layer, pure spelling with no pronunciation issues, and that’s what we need to focus on, as that’s what makes up an FQDN.
So, taking on board Siavash’s advice, I’ve made up a short list of working definitions for the purpose of this discussion, just to make myself clear. For me, an atomic unit here is a specific character within a specific language, and the variations this character produces when forming a (domain) name. Then “variant” can be a string of characters that is similar and interchangeable with another string; all “variant” strings form a “bundle”, an atomic domain unit that can be treated as one – cf. the SC & TC treatments in ccTLD registries. If two strings are similar but one cannot be mistaken for another, they are not variants.
I still feel a need to clarify the purpose of my examples. The purpose of my examples is to demonstrate, that at least in Bulgarian (Cyrliic) and I believe also in Russian, variants are label (word) based. Defining variant characters makes no sense, because the variants are language specific and a language is defined by words and not the alphabet alone (otherwise, you should consider Bulgarian and Russian and Serbian one and the same language). Therefore, in my opinion it is useless to try to define variant characters, within s script (say, Bulgarian), because what is variant for one language is not variant for another and what might be variant for the spelling of one word, might not be variant for other. Dublets in Bulgarian is yet another form of word variants and they definitely cannot be described by any form of character variants. I see, that although you disagree with my suggestion that variants are label (word) base, you later refer to variant as word based in your comments. just wanted to clarify this. Daniel Kalchev Register.BG
Hello Nadya, Just one remark about a part of your comments (since you asked to be corrected about this): On Thu, 23 Jun 2011 11:46:17 +0100, Nadya Morozova <nad.morozova@gmail.com> wrote:
I don’t speak Arabic ...
Nor do I and some 300 million other people who use the so-called Arabic script, but not the Arabic language. And that is a major part of the variant problem in Arabic script. Just as Latin script is used by a number of varied languages, so is Arabic script.
From what I see, most “variants” in Arabic scripts stem from the optional tashkeel diacritics modifying consonant letters to show which vowels to read them with. ...
No, there are more vital problems. There are even consonants with different Unicodes that look exactly the same when they occur in certain parts of the word (e.g., at the beginning and middle, but not at the end of a word). And what is more, some dozen country names considered as IDN TLDs contain one or more of these characters, hence the need for variant TLDs. Regards, Siavash
Siavash, Daniel, Thank you for your clarifications, I will research both problems further. I've also contacted my former professor of linguistics to ask for her advice and guidance on terminology and available data. I hope to access some formalised knowledge on variants in different languages and scripts and will naturally share any discoveries and data with the mailing list. Regards, Nadya 2011/6/23 Siavash Shahshahani <shahshah@irnic.ir>
Hello Nadya, Just one remark about a part of your comments (since you asked to be corrected about this):
On Thu, 23 Jun 2011 11:46:17 +0100, Nadya Morozova <nad.morozova@gmail.com> wrote:
I don’t speak Arabic ...
Nor do I and some 300 million other people who use the so-called Arabic script, but not the Arabic language. And that is a major part of the variant problem in Arabic script. Just as Latin script is used by a number of varied languages, so is Arabic script.
From what I see, most “variants” in Arabic scripts stem from the optional tashkeel diacritics modifying consonant letters to show which vowels to read them with. ...
No, there are more vital problems. There are even consonants with different Unicodes that look exactly the same when they occur in certain parts of the word (e.g., at the beginning and middle, but not at the end of a word). And what is more, some dozen country names considered as IDN TLDs contain one or more of these characters, hence the need for variant TLDs.
Regards, Siavash
Hi, On Thu, Jun 23, 2011 at 10:00:28PM +0100, Nadya Morozova wrote:
Siavash, Daniel,
and guidance on terminology and available data. I hope to access some formalised knowledge on variants in different languages and scripts and will naturally share any discoveries and data with the mailing list.
Thanks for this. Something that is quite important to pay attention to is the special meaning of "variants" we have here. In particular, it is related to but quite different from the meaning of "variant" in Unicode. It's unfortunate that the term got overloaded historically. For some more detail, see http://tools.ietf.org/wg/appsawg/draft-ietf-appsawg-rfc3536bis/ -- which I suggested as useful reading in my short remarks last Saturday, and which is also mentioned in the documents we've put out. Note that we orginally referred to the -01 draft, but -02 is out and is currently in last call. It would be great if people who read that and found it helpful -- or especially if not -- sent some remarks to the authors or to the working group. A -- Andrew Sullivan ajs@anvilwalrusden.com
I am not sure if the following message sent by Andrew got the traction it needed. I would request the members of the Arabic script team to look at the document urgently, especially Section 7 (and the definition of variants in Section 7.2) and comment, as the document is in the last call. There are a few categories of variants in Arabic script which are missing from the suggested list in Section 7.2 (ref. my earlier mail on variants in Arabic). regards, Sarmad Thanks for this. Something that is quite important to pay attention
to is the special meaning of "variants" we have here. In particular, it is related to but quite different from the meaning of "variant" in Unicode. It's unfortunate that the term got overloaded historically. For some more detail, see http://tools.ietf.org/wg/appsawg/draft-ietf-appsawg-rfc3536bis/ -- which I suggested as useful reading in my short remarks last Saturday, and which is also mentioned in the documents we've put out. Note that we orginally referred to the -01 draft, but -02 is out and is currently in last call. It would be great if people who read that and found it helpful -- or especially if not -- sent some remarks to the authors or to the working group.
A
-- Andrew Sullivan ajs@anvilwalrusden.com
Dear Dr Hussain, colleagues, On Fri, Jun 24, 2011 at 05:48:16AM -0700, Dr.Sarmad Hussain wrote:
I am not sure if the following message sent by Andrew got the traction it needed. I would request the members of the Arabic script team to look at the document urgently, especially Section 7 (and the definition of variants in Section 7.2) and comment, as the document is in the last call. There are a few categories of variants in Arabic script which are missing from the suggested list in Section 7.2 (ref. my earlier mail on variants in Arabic).
Thanks very much. Let me also urge again team members to look at that document, especially section 7. Feedback can be sent to the IETF apps-discuss@ietf.org WG mailing list. You need to be a member of the list to post; but if you don't want to join, feel free to send your notes to me, with an explicit request to forward them to the relevant IETF list as part of the last call, and I'll forward your remarks. For those not familiar with the IETF, note that in the IETF, people normally post as themselves and not as a representative of any constituency, organization, or company. In addition, even quite informal remarks are welcome. But comments that are received after publication are impossible to address: a "quick fix" to an RFC, once published, is mostly impossible except for errata. The RFC series is archival, and so once something is published a new RFC needs to be written to take its place (if an update is needed). So get those comments in soon, if they're coming. The last call for this draft ends 2011-06-30. A -- Andrew Sullivan ajs@anvilwalrusden.com
Dear colleagues, I'm replying to myself in order to change the subject: line, which I foolishly failed to do before. But given the deadline for last call comments (on Thursday), I thought it important to raise this issue to as high a prominence as possible. Best, A On Tue, Jun 28, 2011 at 05:26:25PM -0400, Andrew Sullivan wrote:
Dear Dr Hussain, colleagues,
On Fri, Jun 24, 2011 at 05:48:16AM -0700, Dr.Sarmad Hussain wrote:
I am not sure if the following message sent by Andrew got the traction it needed. I would request the members of the Arabic script team to look at the document urgently, especially Section 7 (and the definition of variants in Section 7.2) and comment, as the document is in the last call. There are a few categories of variants in Arabic script which are missing from the suggested list in Section 7.2 (ref. my earlier mail on variants in Arabic).
Thanks very much. Let me also urge again team members to look at that document, especially section 7. Feedback can be sent to the IETF apps-discuss@ietf.org WG mailing list. You need to be a member of the list to post; but if you don't want to join, feel free to send your notes to me, with an explicit request to forward them to the relevant IETF list as part of the last call, and I'll forward your remarks.
For those not familiar with the IETF, note that in the IETF, people normally post as themselves and not as a representative of any constituency, organization, or company. In addition, even quite informal remarks are welcome. But comments that are received after publication are impossible to address: a "quick fix" to an RFC, once published, is mostly impossible except for errata. The RFC series is archival, and so once something is published a new RFC needs to be written to take its place (if an update is needed). So get those comments in soon, if they're coming.
The last call for this draft ends 2011-06-30.
A
-- Andrew Sullivan ajs@anvilwalrusden.com
-- Andrew Sullivan ajs@crankycanuck.ca
Dear colleagues, I have been comparing the definitions in section 7 with Patrik's list. The original definition in section 7 (referring to Traditional Characters and Simplified Characters in Chinese) is similar to Patrik's A.1. "Visually similar" or "visually confusable" would be an addition to Patrik's list, perhaps A.5. "Characters assigned more than one Unicode code point because of some special property" is for me near A.1, perhaps a sub-point of it. Numerals are not on Patrik's list. Let us assign that A.6. "Orthographic differences" within a language corresponds to Patrik's A.2. Patrik also mentions words that _sound_ the same (perhaps "there" and "their"). Perhaps we could assign that A.7, although we may not want to keep that definition in the long run. I am intending to set up examples in several languages of each possible definition (several of which we won't want to call variant at all in future) in the wiki, in the hope that that would make things clearer. There probably won't be time for that before the deadline. Looking at things the other way round, it may be good to add Patrik's definitions A.3. and A.4 to section 7. Are you aware of any other possible definition of the word "variant"? Regards, Chris. == Faculty Information Support Officer for Arts & Humanities and Laws Arts & Humanities Faculty Office Andrew Huxley Building UCL, Gower St, London WC1E 6BT Tel 020 7679 1599 (int. 31599) http://www.ucl.ac.uk/isd/staff/fiso/ah On Tue, Jun 28, 2011 at 05:26:25PM -0400, Andrew Sullivan wrote:
Dear Dr Hussain, colleagues,
On Fri, Jun 24, 2011 at 05:48:16AM -0700, Dr.Sarmad Hussain wrote:
I am not sure if the following message sent by Andrew got the traction it needed. I would request the members of the Arabic script team to look at the document urgently, especially Section 7 (and the definition of variants in Section 7.2) and comment, as the document is in the last call. There are a few categories of variants in Arabic script which are missing from the suggested list in Section 7.2 (ref. my earlier mail on variants in Arabic).
Thanks very much. Let me also urge again team members to look at that document, especially section 7. Feedback can be sent to the IETF apps-discuss@ietf.org WG mailing list. You need to be a member of the list to post; but if you don't want to join, feel free to send your notes to me, with an explicit request to forward them to the relevant IETF list as part of the last call, and I'll forward your remarks.
For those not familiar with the IETF, note that in the IETF, people normally post as themselves and not as a representative of any constituency, organization, or company. In addition, even quite informal remarks are welcome. But comments that are received after publication are impossible to address: a "quick fix" to an RFC, once published, is mostly impossible except for errata. The RFC series is archival, and so once something is published a new RFC needs to be written to take its place (if an update is needed). So get those comments in soon, if they're coming.
The last call for this draft ends 2011-06-30.
A
-- Andrew Sullivan ajs@anvilwalrusden.com
On 29 jun 2011, at 09.28, Dillon, Chris wrote:
Patrik also mentions words that _sound_ the same (perhaps "there" and "their"). Perhaps we could assign that A.7, although we may not want to keep that definition in the long run.
The reason why I added it was that I got the question two weeks ago about "sound". Because we look at "looks". I of course said "not variants", but was an indication I think it is important to just say exactly that. Patrik
Just adding 'URGENT' to the subject line. On 6/29/11 1:56 AM, "Andrew Sullivan" <ajs@anvilwalrusden.com> wrote:
Dear colleagues,
I'm replying to myself in order to change the subject: line, which I foolishly failed to do before. But given the deadline for last call comments (on Thursday), I thought it important to raise this issue to as high a prominence as possible.
Best,
A
On Tue, Jun 28, 2011 at 05:26:25PM -0400, Andrew Sullivan wrote:
Dear Dr Hussain, colleagues,
On Fri, Jun 24, 2011 at 05:48:16AM -0700, Dr.Sarmad Hussain wrote:
I am not sure if the following message sent by Andrew got the traction it needed. I would request the members of the Arabic script team to look at the document urgently, especially Section 7 (and the definition of variants in Section 7.2) and comment, as the document is in the last call. There are a few categories of variants in Arabic script which are missing from the suggested list in Section 7.2 (ref. my earlier mail on variants in Arabic).
Thanks very much. Let me also urge again team members to look at that document, especially section 7. Feedback can be sent to the IETF apps-discuss@ietf.org WG mailing list. You need to be a member of the list to post; but if you don't want to join, feel free to send your notes to me, with an explicit request to forward them to the relevant IETF list as part of the last call, and I'll forward your remarks.
For those not familiar with the IETF, note that in the IETF, people normally post as themselves and not as a representative of any constituency, organization, or company. In addition, even quite informal remarks are welcome. But comments that are received after publication are impossible to address: a "quick fix" to an RFC, once published, is mostly impossible except for errata. The RFC series is archival, and so once something is published a new RFC needs to be written to take its place (if an update is needed). So get those comments in soon, if they're coming.
The last call for this draft ends 2011-06-30.
A
-- Andrew Sullivan ajs@anvilwalrusden.com
Dear all, Some attempts at "underloading" the word "variant": I would like to give an example of Nadya's "same word in two different languages": .travel could be translated as .reise ('trip' in German or Norwegian, also '(to) travel' in the latter). It could also be translated as .reisen 'to travel' in German. This and similar cases are why this approach could never work. The same English word can be translated several ways even into one language. I would also like to pick up Urdu diacritics. I think the situation may be similar in Arabic and Farsi. In Urdu, apart from the Qur'an and texts for foreigners, often texts are written without indicating short vowels (a, i, u). Short a looks like an acute accent. Short i is an acute accent but is written below the letter. Short u is another accent above the letter. For me this is actually rather like spelling e.g. colour and color, or better cafe and café but I can see why at least one of the Arabic registries is registering the basic form and then the three forms with diacritics to avoid confusion. For me Cyrillic cases where the same word e.g. komercant may be written with (pre-Revolution spelling) and without the hard sign are also spelling. I am now thinking I should have filed this under Andrew's rather than Nadya's e-mail, but I'm worried about losing the e-mail so I shall send it. Regards, Chris. == Faculty Information Support Officer for Arts & Humanities and Laws Arts & Humanities Faculty Office Andrew Huxley Building UCL, Gower St, London WC1E 6BT Tel 020 7679 1599 (int. 31599) http://www.ucl.ac.uk/isd/staff/fiso/ah ________________________________ From: vip-bounces@icann.org [vip-bounces@icann.org] on behalf of Nadya Morozova [nad.morozova@gmail.com] Sent: 23 June 2011 11:46 To: Patrik Fältström Cc: vip@icann.org Subject: Re: [vip] Suggested meta-questions to think about Hello all, Please accept my apologies if I’ll be re-iterating what has been discussed, and furiously argued, over the previous sessions that I missed. Having read this thread starting from Patrik’s post, here are some of thoughts, I hope these can help. There are a number of very broad issues being discussed here, and it may make sense to try and ring-fence those that this work group can address within reasonable time. I agree with James Seng that A1 and A2 should be of priority. I hate to come back to the definitions here, but it’s important to agree what we’re trying to regulate here. For example, Patrik’s post, mentions case A4, same word in two different languages. From a linguistic point of view, it is not possible as each language is a standalone system, so cross-language similarities should probably be kept out of the scope. It’d be interesting to see a rare case where two TLD applications claim the same string but in different languages. Also, Daniel says that in Cyrillic, variants are word-based rather than character-based and gives an example of E and Ё in one word. I’m not sure I follow the example and logic and tend to disagree, although the exception of “обед” given later on makes a point. I’ll be happy to have a separate discussion with the Cyrillic group to clarify this, but as a linguist and native Russian speaker, I do not see a problem with Ё using E forming variant domain names. There is always a character layer, pure spelling with no pronunciation issues, and that’s what we need to focus on, as that’s what makes up an FQDN. So, taking on board Siavash’s advice, I’ve made up a short list of working definitions for the purpose of this discussion, just to make myself clear. For me, an atomic unit here is a specific character within a specific language, and the variations this character produces when forming a (domain) name. Then “variant” can be a string of characters that is similar and interchangeable with another string; all “variant” strings form a “bundle”, an atomic domain unit that can be treated as one – cf. the SC & TC treatments in ccTLD registries. If two strings are similar but one cannot be mistaken for another, they are not variants. I don’t know what to call similar strings as in a language they are just “different words”, and no-one defines the degree of differentiation. I’ll use a random word like “pancakes” to mean unique strings that are similar but not interchangeable. Pancake cases may be useful where two words differ only in diacritics. So, from my standpoint, there are several layers here: 1. Ring-fencing character variants within different scripts, with sub-groups for specific languages where needed (for example, where the same character is used in different languages differently – cf., Arabic Alif and Cyrillic Yer (Hard sign); explained below). Any pancakes need to be identified and not mixed with variants. 2. Determining policies to define all variants of a specific character forming a bundle, its Unicode representations, font implications within a language, and any cross-language specifics. 3. Where possible, forming recommendations on technical implementations of those policies within the DNS or at higher levels. Ok, so I have my own terms and my own plan of action. Starting with point one and looking at the practical experiences presented at the Wednesday session, here are my initial thoughts. This is part one of a series of rants, and I plan to continue with French and more thoughts on Cyrillic in a separate email. I don’t speak Arabic and mostly base my assumptions on the Internets – presentations, wiki, etc. Please accept my apologies if I’m wrong, I’ll gladly stand corrected. From what I see, most “variants” in Arabic scripts stem from the optional tashkeel diacritics modifying consonant letters to show which vowels to read them with. Tashkeel are optional and vary in different scripts, thus it is impossible to distinguish between words formed written with and without diacritics. That’s why ccTLD registries in the region treat them as variants and block the possible options, once a variant is written. To me, this sounds reasonable although policy work could help determine how these variants are managed, and what can be done to simplify and improve management of shadow-domains. Perhaps, there’s a special case for the Arabic Hamza, a glottal stop separating two syllables, which can be represented as a diacritic or use a carrier. If Hamza is required and cannot be omitted, then should words without it be treated as variants of the word with Hamza? By the way, in Russian, there’s a similar glottal stop situation with the old character Yer or Hard Sign, ъ, often replaced by an apostrophe in modern Russian. No other language using Cyrillic alphabet has this character except Bulgarian, where it denotes a specific sound. For Russian IDNs, should the spelling with no Yer be a variant of the spelling with it, and vice versa? There are a number of other characters in Russian that are somehow “special”, including the mentioned Ё or characters that in some fonts may be confusingly similar to other letters. In some cases, it is not reasonable to treat these similarities as variants; instead, the confusion can be avoided prohibiting registration of names that can be confusingly similar to a canonical string that has already been registered. Perhaps, Vladimir Shadrunov from the .tel Registry could share Telnic’s experiences in defining language policies for Russian and other supported IDN languages in .tel. Kind regards, Nadya Morozova 2011/6/20 Patrik Fältström <patrik@frobbit.se<mailto:patrik@frobbit.se>> Hi, I am sending this as an interested individual, and not as SSAC Chair... I have a few times this weekend already tried to explain my view on "variants", and after doing that in a chat, I felt it start to (for me) make sense, so I wanted to share with you. We have, I think, a problem divided in two different questions. And unfortunately many people think of the solution only the form of "answers to the second question". Let me try to explain. First, whether something is a variant or not (note: word is undefined), is actually a grayscale from "yes" to "no". There are various shades of gray there. For example: A.1. Two characters in Unicode really are to be treated as being equivalent. I presume one could say that the Hangul SC/TC issues fall in that category. A.2. Two different spellings of the same word in the same script and same language, like color/colour. A.3. Same word in the same language in two different scripts (bulgarian) A.4. Same word in two different languages And then there are many A.1.1, A.1.2, A.2.1 etc, and I did even hear today people say "two variants are two different accepted spellings of the same word that _sound_ the same". I do not even know where to put that. But one thing I because of that think should be done, and could be done, by people is to list all different "variants" they can come up with... The one draw the line, what is and what is not? Is the line drawn at A.1.1232 or A.2.56? Ok, given we have some agreement on what is a variant and not, we have to discuss what implications it has. I here also see a number of different questions to be answered. For example: B.1. Should an application with more TLDs than one be counted as one application if the TLDs in the application are variants of each other? And if so, should there be only one fee per application? B.2. Should two different variants be able to be managed by two different registries or not, and if not, what should happen with the variants? One primary and others like the bundling tactics in some TLDs (i.e. choice between "yes delegation" or "just block for other to register")? And then there might be a technical question in there... C.1. Given two domain names are variants of each other, is there something that can be done in the DNS from a technical point of view to express that, or can we only do delegations? The really tricky question is of course to really draw the line between variants and not variants. I think the line from a technical point of view, AND the implications on the second questions, should be for the new TLD approval process be as conservative as possible. Default answer: If someone want two domain names, just send in two applications. Exception: As you desperately need both and not only one of the domain names, you will get both treated as one application. Then ICANN ask IETF formally "can you please let us know if it is possible to have some kind of solution for _technically_ link two TLDs with each other, in a safe and stable way". Via a letter to IAB. Until and if IETF give such a solution, ICANN only have the following two alternatives for the ones that do get two variants approved: 1. Get both delegated 2. Get one delegated and the other blocked Then MAYBE there will be a third option: 3. Get both with some alias solution But these are things which are implications given a definition on what "variants" are, and that discussion is in the future -- although I am pretty sure some parties really would like to have certain solutions to the problem... Patrik
All, FWIW, I personally do not at all include "translations" as variants, but these kind of discussion is exactly what I was hoping to trigger. A. Talk about different categories (across scripts, same or similar, rating system), and give examples, lots of examples B. Discuss what categories are variants and not And maybe I should not have sent this mail as I am violating myself the process I suggested myself :-) Regards, Patrik On 24 jun 2011, at 14.19, Dillon, Chris wrote:
Dear all,
Some attempts at "underloading" the word "variant":
I would like to give an example of Nadya's "same word in two different languages": .travel could be translated as .reise ('trip' in German or Norwegian, also '(to) travel' in the latter). It could also be translated as .reisen 'to travel' in German. This and similar cases are why this approach could never work. The same English word can be translated several ways even into one language.
I would also like to pick up Urdu diacritics. I think the situation may be similar in Arabic and Farsi. In Urdu, apart from the Qur'an and texts for foreigners, often texts are written without indicating short vowels (a, i, u). Short a looks like an acute accent. Short i is an acute accent but is written below the letter. Short u is another accent above the letter. For me this is actually rather like spelling e.g. colour and color, or better cafe and café but I can see why at least one of the Arabic registries is registering the basic form and then the three forms with diacritics to avoid confusion.
For me Cyrillic cases where the same word e.g. komercant may be written with (pre-Revolution spelling) and without the hard sign are also spelling.
I am now thinking I should have filed this under Andrew's rather than Nadya's e-mail, but I'm worried about losing the e-mail so I shall send it.
Regards,
Chris. == Faculty Information Support Officer for Arts & Humanities and Laws Arts & Humanities Faculty Office Andrew Huxley Building UCL, Gower St, London WC1E 6BT Tel 020 7679 1599 (int. 31599) http://www.ucl.ac.uk/isd/staff/fiso/ah ________________________________ From: vip-bounces@icann.org [vip-bounces@icann.org] on behalf of Nadya Morozova [nad.morozova@gmail.com] Sent: 23 June 2011 11:46 To: Patrik Fältström Cc: vip@icann.org Subject: Re: [vip] Suggested meta-questions to think about
Hello all, Please accept my apologies if I’ll be re-iterating what has been discussed, and furiously argued, over the previous sessions that I missed. Having read this thread starting from Patrik’s post, here are some of thoughts, I hope these can help. There are a number of very broad issues being discussed here, and it may make sense to try and ring-fence those that this work group can address within reasonable time. I agree with James Seng that A1 and A2 should be of priority. I hate to come back to the definitions here, but it’s important to agree what we’re trying to regulate here. For example, Patrik’s post, mentions case A4, same word in two different languages. From a linguistic point of view, it is not possible as each language is a standalone system, so cross-language similarities should probably be kept out of the scope. It’d be interesting to see a rare case where two TLD applications claim the same string but in different languages. Also, Daniel says that in Cyrillic, variants are word-based rather than character-based and gives an example of E and Ё in one word. I’m not sure I follow the example and logic and tend to disagree, although the exception of “обед” given later on makes a point. I’ll be happy to have a separate discussion with the Cyrillic group to clarify this, but as a linguist and native Russian speaker, I do not see a problem with Ё using E forming variant domain names. There is always a character layer, pure spelling with no pronunciation issues, and that’s what we need to focus on, as that’s what makes up an FQDN. So, taking on board Siavash’s advice, I’ve made up a short list of working definitions for the purpose of this discussion, just to make myself clear. For me, an atomic unit here is a specific character within a specific language, and the variations this character produces when forming a (domain) name. Then “variant” can be a string of characters that is similar and interchangeable with another string; all “variant” strings form a “bundle”, an atomic domain unit that can be treated as one – cf. the SC & TC treatments in ccTLD registries. If two strings are similar but one cannot be mistaken for another, they are not variants. I don’t know what to call similar strings as in a language they are just “different words”, and no-one defines the degree of differentiation. I’ll use a random word like “pancakes” to mean unique strings that are similar but not interchangeable. Pancake cases may be useful where two words differ only in diacritics. So, from my standpoint, there are several layers here:
1. Ring-fencing character variants within different scripts, with sub-groups for specific languages where needed (for example, where the same character is used in different languages differently – cf., Arabic Alif and Cyrillic Yer (Hard sign); explained below). Any pancakes need to be identified and not mixed with variants.
2. Determining policies to define all variants of a specific character forming a bundle, its Unicode representations, font implications within a language, and any cross-language specifics.
3. Where possible, forming recommendations on technical implementations of those policies within the DNS or at higher levels. Ok, so I have my own terms and my own plan of action. Starting with point one and looking at the practical experiences presented at the Wednesday session, here are my initial thoughts. This is part one of a series of rants, and I plan to continue with French and more thoughts on Cyrillic in a separate email. I don’t speak Arabic and mostly base my assumptions on the Internets – presentations, wiki, etc. Please accept my apologies if I’m wrong, I’ll gladly stand corrected. From what I see, most “variants” in Arabic scripts stem from the optional tashkeel diacritics modifying consonant letters to show which vowels to read them with. Tashkeel are optional and vary in different scripts, thus it is impossible to distinguish between words formed written with and without diacritics. That’s why ccTLD registries in the region treat them as variants and block the possible options, once a variant is written. To me, this sounds reasonable although policy work could help determine how these variants are managed, and what can be done to simplify and improve management of shadow-domains. Perhaps, there’s a special case for the Arabic Hamza, a glottal stop separating two syllables, which can be represented as a diacritic or use a carrier. If Hamza is required and cannot be omitted, then should words without it be treated as variants of the word with Hamza? By the way, in Russian, there’s a similar glottal stop situation with the old character Yer or Hard Sign, ъ, often replaced by an apostrophe in modern Russian. No other language using Cyrillic alphabet has this character except Bulgarian, where it denotes a specific sound. For Russian IDNs, should the spelling with no Yer be a variant of the spelling with it, and vice versa? There are a number of other characters in Russian that are somehow “special”, including the mentioned Ё or characters that in some fonts may be confusingly similar to other letters. In some cases, it is not reasonable to treat these similarities as variants; instead, the confusion can be avoided prohibiting registration of names that can be confusingly similar to a canonical string that has already been registered. Perhaps, Vladimir Shadrunov from the .tel Registry could share Telnic’s experiences in defining language policies for Russian and other supported IDN languages in .tel. Kind regards, Nadya Morozova
2011/6/20 Patrik Fältström <patrik@frobbit.se<mailto:patrik@frobbit.se>> Hi, I am sending this as an interested individual, and not as SSAC Chair...
I have a few times this weekend already tried to explain my view on "variants", and after doing that in a chat, I felt it start to (for me) make sense, so I wanted to share with you.
We have, I think, a problem divided in two different questions. And unfortunately many people think of the solution only the form of "answers to the second question". Let me try to explain.
First, whether something is a variant or not (note: word is undefined), is actually a grayscale from "yes" to "no". There are various shades of gray there. For example:
A.1. Two characters in Unicode really are to be treated as being equivalent. I presume one could say that the Hangul SC/TC issues fall in that category. A.2. Two different spellings of the same word in the same script and same language, like color/colour. A.3. Same word in the same language in two different scripts (bulgarian) A.4. Same word in two different languages
And then there are many A.1.1, A.1.2, A.2.1 etc, and I did even hear today people say "two variants are two different accepted spellings of the same word that _sound_ the same". I do not even know where to put that.
But one thing I because of that think should be done, and could be done, by people is to list all different "variants" they can come up with...
The one draw the line, what is and what is not? Is the line drawn at A.1.1232 or A.2.56?
Ok, given we have some agreement on what is a variant and not, we have to discuss what implications it has. I here also see a number of different questions to be answered. For example:
B.1. Should an application with more TLDs than one be counted as one application if the TLDs in the application are variants of each other? And if so, should there be only one fee per application? B.2. Should two different variants be able to be managed by two different registries or not, and if not, what should happen with the variants? One primary and others like the bundling tactics in some TLDs (i.e. choice between "yes delegation" or "just block for other to register")?
And then there might be a technical question in there... C.1. Given two domain names are variants of each other, is there something that can be done in the DNS from a technical point of view to express that, or can we only do delegations?
The really tricky question is of course to really draw the line between variants and not variants. I think the line from a technical point of view, AND the implications on the second questions, should be for the new TLD approval process be as conservative as possible.
Default answer: If someone want two domain names, just send in two applications.
Exception: As you desperately need both and not only one of the domain names, you will get both treated as one application.
Then ICANN ask IETF formally "can you please let us know if it is possible to have some kind of solution for _technically_ link two TLDs with each other, in a safe and stable way". Via a letter to IAB.
Until and if IETF give such a solution, ICANN only have the following two alternatives for the ones that do get two variants approved:
1. Get both delegated
2. Get one delegated and the other blocked
Then MAYBE there will be a third option:
3. Get both with some alias solution
But these are things which are implications given a definition on what "variants" are, and that discussion is in the future -- although I am pretty sure some parties really would like to have certain solutions to the problem...
Patrik
Greetings vip@ readers, On Jun 23, 2011, at 1:46 PM, Nadya Morozova wrote:
By the way, in Russian, there’s a similar glottal stop situation with the old character Yer or Hard Sign, ъ, often replaced by an apostrophe in modern Russian. No other language using Cyrillic alphabet has this character except Bulgarian, where it denotes a specific sound. For Russian IDNs, should the spelling with no Yer be a variant of the spelling with it, and vice versa? There are a number of other characters in Russian that are somehow “special”, including the mentioned Ё or characters that in some fonts may be confusingly similar to other letters. In some cases, it is not reasonable to treat these similarities as variants; instead, the confusion can be avoided prohibiting registration of names that can be confusingly similar to a canonical string that has already been registered.
Excuse me Nadya, but as native Russian (and Ukrainian) speaker I beg to differ - apostrophe is not a character of Russian aphabet (although it is one for Ukrainian), and spelling with "trailing hard sign" was not used in Soviet Union, and now Russia, since 1917. Hard sign is used as a letter in several Russian words as a glottal stop indeed (unlike in Bulgarian, where it is a vowel character.) Should we also consider variants of orphography in century-old use - or minor (poetic) uses? (example for Russian: черный / чорный). A side note on Ukrainian while I am on it - the apostrophe is not a "real" Unicode character and does not exist in uppercase format; perhaps it can be handled similarly to German eszet (ß) but this is not yet supported (at least, in IDN registrations for UA we don't allow it.)
Perhaps, Vladimir Shadrunov from the .tel Registry could share Telnic’s experiences in defining language policies for Russian and other supported IDN languages in .tel. This would be very helpful, indeed.
Dmitry, Thanks for the clarification. I didn't mean the old trailing hard sign, was talking about the modern use of the hard sign in the middle of words and that it's sometimes replaced by an apostrophe, esp. in informal writing or advertising. Agreed, the apostrophe is not officially in the alphabet so any "colloquial" use can be ignored. Best regards, Nadya 2011/6/24 Dmitry Kohmanyuk <dk@nic.net.ua>
Greetings vip@ readers,
On Jun 23, 2011, at 1:46 PM, Nadya Morozova wrote:
By the way, in Russian, there’s a similar glottal stop situation with the old character Yer or Hard Sign, ъ, often replaced by an apostrophe in modern Russian. No other language using Cyrillic alphabet has this character except Bulgarian, where it denotes a specific sound. For Russian IDNs, should the spelling with no Yer be a variant of the spelling with it, and vice versa? There are a number of other characters in Russian that are somehow “special”, including the mentioned Ё or characters that in some fonts may be confusingly similar to other letters. In some cases, it is not reasonable to treat these similarities as variants; instead, the confusion can be avoided prohibiting registration of names that can be confusingly similar to a canonical string that has already been registered.
Excuse me Nadya, but as native Russian (and Ukrainian) speaker I beg to differ - apostrophe is not a character of Russian aphabet (although it is one for Ukrainian), and spelling with "trailing hard sign" was not used in Soviet Union, and now Russia, since 1917. Hard sign is used as a letter in several Russian words as a glottal stop indeed (unlike in Bulgarian, where it is a vowel character.)
Should we also consider variants of orphography in century-old use - or minor (poetic) uses? (example for Russian: черный / чорный).
A side note on Ukrainian while I am on it - the apostrophe is not a "real" Unicode character and does not exist in uppercase format; perhaps it can be handled similarly to German eszet (ß) but this is not yet supported (at least, in IDN registrations for UA we don't allow it.)
Perhaps, Vladimir Shadrunov from the .tel Registry could share Telnic’s experiences in defining language policies for Russian and other supported IDN languages in .tel.
This would be very helpful, indeed.
Quoting Nadya:
Agreed, the apostrophe is not officially in the alphabet so any "colloquial" use can be ignored.
The apostrophe is disallowed by the IDNABIS protocol and cannot appear in an IDN in any regard, whatsoever. If there is some writing system to which it is essential, its unavailability nonetheless provides a good illustration of the simple fact that the DNS was not designed to be a vehicle for the full orthographic detail of the languages that labels are intended to reflect. /Cary
Are we sure we really need to draw the lines as to what a variant is and what a variant isn't? Why can't we just say that a variant is whatever the registry wants to be a variant? The registry only needs to define a unique way of finding out whether String1 is indeed a variant of String2 or not. For example, on the Top Level a registry could declare String1 and String2 to be variants, and if .String1 and .String2 get delegated pursuant to then-current ICANN TLD program, then so be it. On the Second Level the registry might wish to declare that, for example: - for Chinese script the variants for traditional characters are corresponding simplified charaters - for Cyrillic script Е is a variant for Ё with no other variant relationships between characters etc. The rest is up to the linguistic community. In fact, the work is already under way. There are RFCs for Chinese and for Cyrillic script that define at least some guidelines regarding similarity and equivalence of characters. The registry may decide to use these RFCs or even to use something else. What we need to consider in my opinion are the properties of these relationships. For example are we fine that variant relationships may be asymmetric? I. e. Е is a variant for Ё, but not vice versa. If you think I am wrong please don't hesitate to say. Best regards, Vladimir Shadrunov
Quoting Vladimir:
Why can't we just say that a variant is whatever the registry wants to be a variant? The registry only needs to define a unique way of finding out whether String1 is indeed a variant of String2 or not.
I suspect that we are going to need to spend a bit of time sorting out the relationship between this study and the ICANN policy development process. To be sure, the former is intended to inform the latter, but our contribution is the objective consideration and description of issues attaching to "variance" as we end up defining that concept. Policy itself is, however, not made here. Notwithstanding, a look at policies already in effect -- many of which are well entrenched -- might lend some useful focus to our effort. Here is a snippet from the narrative adjunct to the Latin script table provided jointly by .SE (the host organization for the Latin study) and .MUSEUM: "The repertoire supports numerous languages written using the Latin alphabet and is intended to permit the representation of names derived from European languages, using their native orthographies to the fullest extent possible. There is, however, neither a requirement nor an expectation that a label in a domain name will correspond to a proper name or dictionary word in any language, and many labels deliberately do not have any such attributes. There is therefore no basis for determining the extent to which any word-based restrictions or other language-specific orthographic conventions can be applied here and, in consequence, all registration policies are script based. Any permissible character may appear at any point in a string, with the exception of digits and the hyphen, which may not be in the initial or final positions in a label. [The positional constraint on digits has since be rescinded.] The holder of an IDN is responsible for the orthographic rigor of any proper words or names used as labels. Each representation of a label in an alternative orthography requires separate registration. For example, the prospective holder of the label 'lättöl' is free to register the correlate 'lattol', without either form imposing any restriction on the availability of the other, or on any further variants using the more than twenty diacritically marked forms of the base 'a' in the Unicode chart, or the similar number of marked forms of 'o'. This also applies to marked or ligated characters that can alternately be represented as digraphs. It is again up to the prospective name holder to make an individual determination as to whether or not there is an equivalence between an umlauted 'ä', and an 'ae' digraph or an 'æ' ligature, or if the 'ä' can acceptably also be indicated with an 'a'. Even if lexicographic rules might be contemplated for reducing the inherent ambiguity, their automated implementation would easily be stymied by reasonable differences between the representations of both proper names and dictionary words: 'encyclopaedia' and 'encyclopædia' could be treated as identical, but 'mueller' and 'müller' cannot, and 'öresund' and 'øresund' can be argued either way." /Cary
On Fri, Jun 24, 2011 at 08:31:40PM +0800, Vladimir Shadrunov wrote:
Are we sure we really need to draw the lines as to what a variant is and what a variant isn't?
Why can't we just say that a variant is whatever the registry wants to be a variant? The registry only needs to define a unique way of finding out whether String1 is indeed a variant of String2 or not.
Without disagreeing with Cary's remarks elsewhere in this thread, I'd like to take a different line. Suppose we assume the above: the registry only needs to define a unique way of finding out whether string1 is indeed a variant of string2. Now, ICANN is in effect the registry (or at least the registry policy-holder) for the root zone: it is the technical co-ordination body that publishes the policies for the root zone. Those policies are developed in various ways inside the ICANN processes; how is not that important for our task. (Please note that this is not a claim of ICANN authority over the root zone, to impinge on the authority of sovereigns or the independence of the root operators, or any other such controversial claim.) So, if we assume (1) that this is just a matter of registry policy, and (2) that ICANN needs to have such a policy for the root zone, then it is clear why this group needs to investigate (and try to develop) a general rule for what a variant is and what a variant isn't: ICANN needs an emprical foundation for the development of such a policy (according to the usual ICANN policy development procedures). Alternatively, perhaps we will find that "variant" doesn't mean one thing, but actually means different things -- v1, v2, v3, and so on. I believe that any ICANN policy could not be "do whatever the applicant says", because that would just push the problem one step deeper. Suppose two applicants requested conflicting things? What applicant should be preferred? What exactly to do is a policy question and beyond our remit, as far as I understand it. But surely our remit is to tease out the principles by which such a policy could be made. Now, our answer might well be, "It's something that has to be decided case by case," or even, "A general description of 'variant', whatever it is, is impossible." If that be our conclusion, then we need to report as much back so that the policy makers can work with that input. But I believe we can do better, by working according to our plan to study the issues under various scripts, and then to draw some (perhaps tentative) conclusions from that. One other thing:
On the Second Level the registry might wish to declare that, for example: - for Chinese script the variants for traditional characters are corresponding simplified charaters - for Cyrillic script Е is a variant for Ё with no other variant relationships between characters
It need not be the case that a variant policy at the root level (i.e. for TLDs) would extend all the way down the tree. It is likely going to be true, however, that for the simple reasons of user experience, rules that get adopted at the top level will end up being used elsewhere in the tree. (We can draw an analogy with the label "www", which is used very widely for web servers even though there is no reason it must be. Why? Because everyone expects it to work like that.) Speaking only for myself, the reason I supported talking about other parts of the tree than just the top level is that I think existing practices elsewhere in the DNS can inform the things we think are reasonable in the top level. Best, Andrew -- Andrew Sullivan ajs@anvilwalrusden.com
On Jun 24, 2011, at 18:15 , Andrew Sullivan wrote:
Now, ICANN is in effect the registry (or at least the registry policy-holder) for the root zone: it is the technical co-ordination body that publishes the policies for the root zone. Those policies are developed in various ways inside the ICANN processes; how is not that important for our task. (Please note that this is not a claim of ICANN authority over the root zone, to impinge on the authority of sovereigns or the independence of the root operators, or any other such controversial claim.)
ICANN is just preparing to enter the "registry business" with the introduction of the new gTLDs. This makes ICANNs decisions and implementation to what is a variant and what is not comparable with that of any other registry. The DNS is designed in such a way, that uniform technical rules apply to any label at any level in the hierarchy. Therefore, the community will benefit best from our work, if we come up with an agreed and understood "standard" of what variants are, what they are not and how are (or will be) implemented in the todays world of non-ASCII DNS. I do understand, that ICANN (staff, board) needs fast answers to just few questions in order to proceed with the new gTLD launch, but there are other processes/parties that can benefit from this work, such as the ccNSO IDN PDP, as well as each and every DNS registry from now on.
From what we have so far, there seems to be strong opinion, that variants should be considered character based and script specific. I will again voice my opinion, that variants are label (word) based and language specific. Or even go as far to suggest sometimes they are community or region specific.
Some languages might be simplified enough to use character/script variants -- look at how "good" Latin got translated into ASCII. But my belief is, that todays computers and software are not that primitive, so that we should make such sacrifices for the same of simplicity. Daniel Kalchev Register.BG
Quoting Daniel:
ICANN is just preparing to enter the "registry business" with the introduction of the new gTLDs. This makes ICANNs decisions and implementation to what is a variant and what is not comparable with that of any other registry.
ICANN introduced the first new TLDs a decade ago. If the impending round means ICANN is just preparing to enter the registry business, why did the previous two rounds not take them into it?
The DNS is designed in such a way, that uniform technical rules apply to any label at any level in the hierarchy. Therefore, the community will benefit best from our work, if we come up with an agreed and understood "standard" of what variants are, what they are not and how are (or will be) implemented in the todays world of non-ASCII DNS.
I do understand, that ICANN (staff, board) needs fast answers to just few questions in order to proceed with the new gTLD launch, but there are other processes/parties that can benefit from this work, such as the ccNSO IDN PDP, as well as each and every DNS registry from now on.
I recognize the significance of your placing quotes around the word "standard". I also understand the potential utility of a generally applicable statement of suggested registry practices, broader than the document(s) ICANN has charged our study group with preparing. ICANN already maintains such a document under the designation, "ICANN Guidelines for the Implementation of Internationalized Domain Names". Our work is as much intended to inform the maintainers of that text as it is any other facet of ICANN's IDN activity. If we perceive need for releasing an extended statement of our own external to ICANN we need a separate venue for it. Our present charge does not go beyond ICANN's immediate needs, and reframing our mandate is out of scope. Establishing a platform for any further activity we might wish to conduct is probably best deferred until we see how far we get with the initial task we have undertaken. And just to have it said -- the DNS standard (here deliberately without quotes although I'm not certain it's entirely the right word) remains as firmly ASCII oriented as it has ever been. The 'A' in IDNA means something very important and if any members of our group are uncertain about what that is, we would probably benefit from a brief tutorial review.
From what we have so far, there seems to be strong opinion, that variants should be considered character based and script specific. I will again voice my opinion, that variants are label (word) based and language specific. Or even go as far to suggest sometimes they are community or region specific.
Some languages might be simplified enough to use character/script variants -- look at how "good" Latin got translated into ASCII. But my belief is, that todays computers and software are not that primitive, so that we should make such sacrifices for the same of simplicity.
Actually, Latin is quite comfortably accommodated by ASCII. It's the other languages that are written with the same script that may have a worse time of it. The closer consideration of that is, I am assuming, precisely what should be discussed in the Latin study and I will post my own further remarks about it to the dedicated list. /Cary
Hi Daniel, Thanks for your response. Since ICANN has engaged me as a resident technical guy, I'm going to respond with that (somewhat geeky) hat. On Sat, Jun 25, 2011 at 06:47:34AM +0300, Daniel Kalchev wrote:
ICANN is just preparing to enter the "registry business" with the introduction of the new gTLDs.
I want to distinguish between "being a registry" and "the registry business". In the DNS world, where I come from, everyone who operates a zone and performs delegations from it is "a registry". It's possible to be a registry for some.strange.domain.name.example, even though you'll probably never make any money at it. That ICANN is charging amounts of money for entries in the root zone and so on is important from a business/legal/political point of view, but it does not directly impinge on the techno-policy implications of that decision. In my (personal) view, we are charged with providing guidance about that techno-policy only. I have my own views about the wisdom of expanding the TLD space; but given that it is happening, what are the issues that need to be addressed? We are tasked, as I understand it, with addressing one of the latter types of issues. If I am misunderstanding, I'm sure others will correct me.
The DNS is designed in such a way, that uniform technical rules apply to any label at any level in the hierarchy.
That is only sort of true, and it is not much true at all for registration policies. For instance, many people do not realise that there is no strict technical restriction at all of what eight bit octet you can put in your zone. DNS labels are made of octets. If you want, in your zone, to put any series of bits you like in there, you can do so. This means that you could just plop UTF-8 directly into the zone; and some people have done this. But, the RFCs (STD 13) also say that it would be better to stick to the hostname rules ("letter, digit, hyphen"). So, in the TLD space, we have mostly stuck to this, for maximum interoperability on the Internet. The closer you are to the root, the more conservative you need to be, since things will break otherwise. But there are plenty of labels lower in the tree that don't follow that. No TLD of which I am aware, for instance, permits "underscore labels" (_example.example.com). But they're not only legal, they're important for the functioning of things like Apple's Bonjour, DKIM, and so on.
From what we have so far, there seems to be strong opinion, that variants should be considered character based and script specific. I will again voice my opinion, that variants are label (word) based and language specific. Or even go as far to suggest sometimes they are community or region specific.
For the sake of argument, let me grant that variants are label based and language specific. (I've on purpose left aside "word" because it's a disaster: "ns1" is a perfectly viable DNS label, but it is not a word in any language as far as I know. For that matter, "com" is a perfectly viable label, but it's not a word in English, even though people keep insisting it is. And as soon as we start talking about words, we have the problems of neologisms and trademarks -- neither of which are in any dictionary except in very unusual cases -- as well as the problem of which dictionary to pick.) Now, suppose you receive a DNS registration request for "monaco.example". Does it have a variant? Well, if you're Italian targetting a German market, the answer is, "Yes". We don't have an "intention bit" that comes with attempted DNS registrations -- never mind DNS lookups, on which I haven't touched in the above example. This makes the situation terribly complex, and it is why, in the absence of a complete proposal for how a label-wide variant system would work, I have been arguing we shouldn't go down that path. If, however, you have a proposal for how a complete label-based variant system might work, I would be very keen to review it; and I would enthusiastically contribute to its development if it seemed viable. Until I see such a proposal, however, I remain very sceptical that it is possible except in the most constrained environments. And since gTLDs are by definition not constrained environments, I am worried about any suggestion that we ought to work label by label. Best regards, Andrew -- Andrew Sullivan ajs@anvilwalrusden.com
Quoting Andrew:
.., many people do not realise that there is no strict technical restriction at all of what eight bit octet you can put in your zone. DNS labels are made of octets. If you want, in your zone, to put any series of bits you like in there, you can do so. This means that you could just plop UTF-8 directly into the zone; and some people have done this. But, the RFCs (STD 13) also say that it would be better to stick to the hostname rules ("letter, digit, hyphen"). So, in the TLD space, we have mostly stuck to this, for maximum interoperability on the Internet. The closer you are to the root, the more conservative you need to be, since things will break otherwise.
The root is not just constrained by the hostname rule. To ensure that a domain name cannot be confused with a numerical IP address, RFC 1123 requires that "at least the highest-level component label will be alphabetic". Strictly speaking, this bars the ASCII-encoded form of an IDN TLD label (the "A-label") from the root (since every A-label contains hyphens and many also include digits). The necessary clarification of what is meant by "alphabetic", explicitly to permit the inclusion of A-labels in the root, is provided in <http://tools.ietf.org/html/draft-liman-tld-names-04>. (Since that draft recently expired, I assume subsequent action has been taken to address the underlying issue. ??) /Cary
Hi Andrew, I am glad to have provoked this discussion, because I believe many of the study groups members will benefit from more clarification as to what are the specific goals and restrictions in our work. Before going any further, I need to state, that my comments intentionally try to avoid any examples, as my intention is to provoke other study group members to check if these concerns apply to their scripts/languages without influencing/constraining this with specific cases. We seems to agree on what a registry is. I too have mostly technical background and 20 years of running one. I also tend to agree with you on the fact that only a subset of permitted octets is used in DNS. But I still fail to see why a DNS registry at any level would want to not comply with the lowest common denominator (the hostname character limits). Same applies to variants. But see at the end. Your comments on label based variants is completely correct. We might say that in some languages (but not others) using certain scripts, some characters may have been ASCII-ized, usually by removing additional character elements (good example is the Russian E with dots and without). This convention however is strictly language specific. Another language, that uses the same script might not use the same character ASCII-fication, but instead use different variant character(s). Can we declare in such cases that variants are character based and script specific? Let's see what other members of the study groups see. What worries me is the short timeframe we have to complete our work. I have to agree here with Cary Karp, that we may have to postpone this work for another study/working group in the future. One possible way to quicker complete our work is to make the assumption, that 'variants' do not include any language specific word variants and that variants are all character and script based. We might restrict ourself to only possible application of variants in an ICANN IDN TLD process. After all, at the root level the language is never known. Other DNS levels will remain different. This would mean, that a character variant in one script must be valid for any and all languages that use the same script. Otherwise, we ignore it. (and leave it for another study) Best Regards, Daniel
Hi, On Sun, Jun 26, 2011 at 11:21:43PM +0300, Daniel Kalchev wrote:
I am glad to have provoked this discussion, because I believe many of the study groups members will benefit from more clarification as to what are the specific goals and restrictions in our work.
Yes, I'm glad too. This is all very helful.
I also tend to agree with you on the fact that only a subset of permitted octets is used in DNS. But I still fail to see why a DNS registry at any level would want to not comply with the lowest common denominator (the hostname character limits). Same applies to variants. But see at the end.
Well, any registry publising (for instance) NAPTR records in support of ENUM will do it, because they always "underscore labels" (e.g. _sip._udp.example.com). The underscore is not one of the hostname characters. So this is a common use.
We might say that in some languages (but not others) using certain scripts, some characters may have been ASCII-ized, usually by removing additional character elements (good example is the Russian E with dots and without). This convention however is strictly language specific. Another language, that uses the same script might not use the same character ASCII-fication, but instead use different variant character(s). Can we declare in such cases that variants are character based and script specific? Let's see what other members of the study groups see.
Variants are, at the very least, script specific. The problem is that they're not _only_ script-specific, in that in at least some cases for a given language using a given script, a variant in that language is emphatically _not_ a variant for another language using the same script. I suppose if this problem were solved easily, none of us would be working on it!
We might restrict ourself to only possible application of variants in an ICANN IDN TLD process. After all, at the root level the language is never known. Other DNS levels will remain different.
This would mean, that a character variant in one script must be valid for any and all languages that use the same script. Otherwise, we ignore it. (and leave it for another study) Best Regards, Daniel
That might be a pragmatic rule to adopt. What we'd be saying, in effect, is that if there were possible conflicts with others, then the candidate variant would be ineligible for use in the root (i.e. for top level names). But I have not worked out the implications of that for all the different cases, and it might be very serious. I think this is worth investigating, at least. Thanks for your continued efforts on this vexing topic. Best, A -- Andrew Sullivan ajs@anvilwalrusden.com
participants (15)
-
Andrew Sullivan -
Baher Esmat -
Cary Karp -
Daniel Kalchev -
Dillon, Chris -
Dmitry Kohmanyuk -
Dr.Sarmad Hussain -
Iliya Bazlyankov -
James Seng -
Manal Ismail -
Nadya Morozova -
Patrik Fältström -
Satyendra Kumar Pandey -
Siavash Shahshahani -
Vladimir Shadrunov