Dear Latin GP members, Please find below the message from the IP. This is a follow-up from the Phone call with Latin GP, suggesting guidelines for Latin script variant analysis. Regards, Pitinan From: Sarmad Hussain <sarmad.hussain@icann.org> Date: Wednesday, February 5, 2020 at 16:43 To: Pitinan Kooarmornpatana <pitinan.koo@icann.org> Subject: Fwd: Follow-up on Phone Call with Latin GP Dear Pitinan, Please share with Latin GP from IP. Regards, Sarmad Considerations in designing variant sets First, we generally say that more blocked variants are "good"; reducing the namespace a bit can make it safer, because it removes instances that can be exploited for malicious registrations. However, as we've recognized with cross-script variants, there's also a cost. There are two costs, one is that large variant sets tend to link things via transitivity that do not pass the "smell test" for users. They appear unmotivated, and when they lead to a desirable label being blocked by some other label that on the face of it isn't a true look-alike, the LGR comes out looking bad. There was the case of the Cyrillic country code that people claimed looked too much like "br". The similarity was not that close and it was allowed on appeal. An LGR variant, by design, does not allow appeal, so the goal has to be to limit such variants to "clear-cut" cases (where one would expect an appeal to fail, if it had been possible to lodge one). The other cost is that there is a tendency to impose in-language variants: except for well-defined exceptions we can assume that any two characters that are used in contrast in a language can (and will) be distinguished by its users (even if they look confusingly similar to people unfamiliar with them). We have been making similar arguments when it comes to in-script and cross-script variants. Preventing simultaneous registrations of words that are seen as clearly different by native users is not ideal. We need to treat that on the "cost" side of the ledger. Even in cases where the distinctions are subtle or where two spellings lead to the same result. For example, for ASCII TLD's, both .unicorn and .unicom are available (subject to further manual review) as are ".color" and ".colour". When it comes to diacritical marks, these precedents need to be borne in mind. However, diacritical marks themselves are not always easy to tell apart at small sizes (and/or when an underline interferes that is drawn without a "skip-ink" feature"). That also needs to be recognized, because it may well be that they provide an especially convenient target for spoofing. When it comes to possible confusions, the variant process is intended to deal with cases where one character, if placed inside a label, makes the label look like it is spelled with another character. Usually by substituting something that is "out of context". It is often said that labels do not have a linguistic context, but if a label matches the spelling of a word in the user's language (or a wordlike entity like a brand name that the user thinks they recognize) then we know that users do not necessarily read such labels letter by letter, but recognize them as a whole. If the altered "whole" looks acceptable enough, it will be treated as equivalent - even if we assume our "reasonably observant user". A user in a language that only has a macron above as a diacritic may be unlikely to accept most of the other diacritics as a substitute, but in the right context, a g with dot above, perhaps, might not be noticed when standing in for a g with cedilla (a small comma like shape above). If, however, the language context in which a user thinks they are recognizing a label includes both a dot above and some other mark, these two are less likely conflated --- even if they do not occur on the same base character. So, what we have are a number of principles, some that define benefits, some that define costs for adding or not adding variants. Like the case with so many other principles, any two individual principles may be in tension with each other. The goal is therefore to judge how much weight to give each one, so that they are collectively best satisfied, even if that means that some won't be. This also goes with the principle of language neutrality: ideally, we don't want to either burden or treat preferentially any given language to the exclusion of other. However, sometimes, taking other principles to their logical extreme, may impose too much of a cost. Or, in reverse, some balancing may lead to such an uneven treatment inside a single language that we are better off with more consistency. (This situation isn't new to the Latin script, it has been a constant across the Root Zone LGR -- and, again, it is a matter of judgment to get it right). For the Latin script: If two diacritics are never used in the same language, then there should be very limited cost on making those two a variant pair - unless doing so creates a large variant set with transitivity issues). This is true, in particular, if they occur on the same base character: G with dot above and G with cedilla might be an example; there's little downside and potential benefit of making them variants. If the language with the G cedilla also has a dot above some other letter, that language would not benefit from the variant, but the other language still might. If both have a dot above on some letters, but the other language has some other mark above some letters, the case would be even weaker: neither set of users would be very likely to accept a substitute of the other language's marks. If some diacritics are practically only ever used in one language (and look quite distinct) then it makes little sens to make them variants of each other. Native users can be expected to discriminate them and therefore a variant would have substantial cost and little benefit. If some diacritics are used contrastively in some language(s) on the same base character, the cost of making them a variant is potentially quite high (blocking many labels that users consider distinct). It might be best to limit variants in that case to diacritics that are particularly prone to being misidentified in any other contexts. If marks are reasonably distinct in printed presentation (as typical for user interfaces), causing in-language variants between them by transitivity should probably be avoided (if possible). Even if there's some hypothetical scenario why certain users might confuse them. Example is tilde vs. o-diaeresis. Even if there's a scenario where these can be substituted in Swedish, that scenario is perhaps not the most obvious and if it were to lead to some other language having two common characters made variants it would not pass the cost/benefit analysis. Variant sets with cross-script members need to be well-scrutinized for issues of transitivity. Overlapped variants cause permutations and need to be curtailed sharply; otherwise the attendant cost imposed especially in cross script cases will violate some of the overall principles against complexity and unexpected behavior. Hopefully, this summary is a useful guideline to help reasoning about these variant sets. As always, it's not a recipe or algorithm, but might be an aid in making judgements.
Dear colleagues, If I am reading this correctly, we will need to go thru a lot of our proposed variant pairs (and possible also some of the edge cases where we were considering not making a pair variants) and determine whether they are both used within a language or not. Is that how the rest of you are reading it? Bill Jouris Inside Products bill.jouris@insidethestack.com 831-659-8360 925-855-9512 (direct) On Wednesday, February 5, 2020, 01:56:51 AM PST, Pitinan Kooarmornpatana <pitinan.koo@icann.org> wrote: <!--#yiv8712760724 _filtered {} _filtered {} _filtered {} _filtered {} _filtered {}#yiv8712760724 #yiv8712760724 p.yiv8712760724MsoNormal, #yiv8712760724 li.yiv8712760724MsoNormal, #yiv8712760724 div.yiv8712760724MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:11.0pt;font-family:"Calibri", sans-serif;}#yiv8712760724 span.yiv8712760724EmailStyle20 {font-family:"Calibri", sans-serif;color:windowtext;}#yiv8712760724 .yiv8712760724MsoChpDefault {font-size:10.0pt;} _filtered {}#yiv8712760724 div.yiv8712760724WordSection1 {}#yiv8712760724 _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {} _filtered {}#yiv8712760724 ol {margin-bottom:0in;}#yiv8712760724 ul {margin-bottom:0in;}--> Dear Latin GP members, Please find below the message from the IP. This is a follow-up from the Phone call with Latin GP, suggesting guidelines for Latin script variant analysis. Regards, Pitinan From: Sarmad Hussain <sarmad.hussain@icann.org> Date: Wednesday, February 5, 2020 at 16:43 To: Pitinan Kooarmornpatana <pitinan.koo@icann.org> Subject: Fwd: Follow-up on Phone Call with Latin GP Dear Pitinan, Please share with Latin GP from IP. Regards, Sarmad Considerations in designing variant sets First, we generally say that more blocked variants are "good"; reducing the namespace a bit can make it safer, because it removes instances that can be exploited for malicious registrations. However, as we've recognized with cross-script variants, there's also a cost. There are two costs, one is that large variant sets tend to link things via transitivity that do not pass the "smell test" for users. They appear unmotivated, and when they lead to a desirable label being blocked by some other label that on the face of it isn't a true look-alike, the LGR comes out looking bad. There was the case of the Cyrillic country code that people claimed looked too much like "br". The similarity was not that close and it was allowed on appeal. An LGR variant, by design, does not allow appeal, so the goal has to be to limit such variants to "clear-cut" cases (where one would expect an appeal to fail, if it had been possible to lodge one). The other cost is that there is a tendency to impose in-language variants: except for well-defined exceptions we can assume that any two characters that are used in contrast in a language can (and will) be distinguished by its users (even if they look confusingly similar to people unfamiliar with them). We have been making similar arguments when it comes to in-script and cross-script variants. Preventing simultaneous registrations of words that are seen as clearly different by native users is not ideal. We need to treat that on the "cost" side of the ledger. Even in cases where the distinctions are subtle or where two spellings lead to the same result. For example, for ASCII TLD's, both .unicorn and .unicom are available (subject to further manual review) as are ".color" and ".colour". When it comes to diacritical marks, these precedents need to be borne in mind. However, diacritical marks themselves are not always easy to tell apart at small sizes (and/or when an underline interferes that is drawn without a "skip-ink" feature"). That also needs to be recognized, because it may well be that they provide an especially convenient target for spoofing. When it comes to possible confusions, the variant process is intended to deal with cases where one character, if placed inside a label, makes the label look like it is spelled with another character. Usually by substituting something that is "out of context". It is often said that labels do not have a linguistic context, but if a label matches the spelling of a word in the user's language (or a wordlike entity like a brand name that the user thinks they recognize) then we know that users do not necessarily read such labels letter by letter, but recognize them as a whole. If the altered "whole" looks acceptable enough, it will be treated as equivalent - even if we assume our "reasonably observant user". A user in a language that only has a macron above as a diacritic may be unlikely to accept most of the other diacritics as a substitute, but in the right context, a g with dot above, perhaps, might not be noticed when standing in for a g with cedilla (a small comma like shape above). If, however, the language context in which a user thinks they are recognizing a label includes both a dot above and some other mark, these two are less likely conflated --- even if they do not occur on the same base character. So, what we have are a number of principles, some that define benefits, some that define costs for adding or not adding variants. Like the case with so many other principles, any two individual principles may be in tension with each other. The goal is therefore to judge how much weight to give each one, so that they are collectively best satisfied, even if that means that some won't be. This also goes with the principle of language neutrality: ideally, we don't want to either burden or treat preferentially any given language to the exclusion of other. However, sometimes, taking other principles to their logical extreme, may impose too much of a cost. Or, in reverse, some balancing may lead to such an uneven treatment inside a single language that we are better off with more consistency. (This situation isn't new to the Latin script, it has been a constant across the Root Zone LGR -- and, again, it is a matter of judgment to get it right). For the Latin script: - If two diacritics are never used in the same language, then there should be very limited cost on making those two a variant pair - unless doing so creates a large variant set with transitivity issues). This is true, in particular, if they occur on the same base character: G with dot above and G with cedilla might be an example; there's little downside and potential benefit of making them variants. - If the language with the G cedilla also has a dot above some other letter, that language would not benefit from the variant, but the other language still might. If both have a dot above on some letters, but the other language has some other mark above some letters, the case would be even weaker: neither set of users would be very likely to accept a substitute of the other language's marks. - If some diacritics are practically only ever used in one language (and look quite distinct) then it makes little sens to make them variants of each other. Native users can be expected to discriminate them and therefore a variant would have substantial cost and little benefit. - If some diacritics are used contrastively in some language(s) on the same base character, the cost of making them a variant is potentially quite high (blocking many labels that users consider distinct). It might be best to limit variants in that case to diacritics that are particularly prone to being misidentified in any other contexts. - If marks are reasonably distinct in printed presentation (as typical for user interfaces), causing in-language variants between them by transitivity should probably be avoided (if possible). Even if there's some hypothetical scenario why certain users might confuse them. Example is tilde vs. o-diaeresis. Even if there's a scenario where these can be substituted in Swedish, that scenario is perhaps not the most obvious and if it were to lead to some other language having two common characters made variants it would not pass the cost/benefit analysis. - Variant sets with cross-script members need to be well-scrutinized for issues of transitivity. - Overlapped variants cause permutations and need to be curtailed sharply; otherwise the attendant cost imposed especially in cross script cases will violate some of the overall principles against complexity and unexpected behavior. Hopefully, this summary is a useful guideline to help reasoning about these variant sets. As always, it's not a recipe or algorithm, but might be an aid in making judgements. _______________________________________________ Latingp mailing list Latingp@icann.org https://mm.icann.org/mailman/listinfo/latingp _______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
Dear Bill, I‘m not understanding it that way. Let‘s discuss the feedback sometime at our next meetings and see what conclusions we need to draw from it. Dennis also wanted to review this new feedback. Best, Meikal Am 6. Feb. 2020, 02:45 +0100 schrieb Bill Jouris <bill.jouris@insidethestack.com>:
Dear colleagues,
If I am reading this correctly, we will need to go thru a lot of our proposed variant pairs (and possible also some of the edge cases where we were considering not making a pair variants) and determine whether they are both used within a language or not. Is that how the rest of you are reading it?
Bill Jouris Inside Products bill.jouris@insidethestack.com 831-659-8360 925-855-9512 (direct)
On Wednesday, February 5, 2020, 01:56:51 AM PST, Pitinan Kooarmornpatana <pitinan.koo@icann.org> wrote:
Dear Latin GP members,
Please find below the message from the IP. This is a follow-up from the Phone call with Latin GP, suggesting guidelines for Latin script variant analysis.
Regards, Pitinan
From: Sarmad Hussain <sarmad.hussain@icann.org> Date: Wednesday, February 5, 2020 at 16:43 To: Pitinan Kooarmornpatana <pitinan.koo@icann.org> Subject: Fwd: Follow-up on Phone Call with Latin GP
Dear Pitinan, Please share with Latin GP from IP. Regards, Sarmad Considerations in designing variant sets First, we generally say that more blocked variants are "good"; reducing the namespace a bit can make it safer, because it removes instances that can be exploited for malicious registrations. However, as we've recognized with cross-script variants, there's also a cost. There are two costs, one is that large variant sets tend to link things via transitivity that do not pass the "smell test" for users. They appear unmotivated, and when they lead to a desirable label being blocked by some other label that on the face of it isn't a true look-alike, the LGR comes out looking bad. There was the case of the Cyrillic country code that people claimed looked too much like "br". The similarity was not that close and it was allowed on appeal. An LGR variant, by design, does not allow appeal, so the goal has to be to limit such variants to "clear-cut" cases (where one would expect an appeal to fail, if it had been possible to lodge one). The other cost is that there is a tendency to impose in-language variants: except for well-defined exceptions we can assume that any two characters that are used in contrast in a language can (and will) be distinguished by its users (even if they look confusingly similar to people unfamiliar with them). We have been making similar arguments when it comes to in-script and cross-script variants. Preventing simultaneous registrations of words that are seen as clearly different by native users is not ideal. We need to treat that on the "cost" side of the ledger. Even in cases where the distinctions are subtle or where two spellings lead to the same result. For example, for ASCII TLD's, both .unicorn and .unicom are available (subject to further manual review) as are ".color" and ".colour". When it comes to diacritical marks, these precedents need to be borne in mind. However, diacritical marks themselves are not always easy to tell apart at small sizes (and/or when an underline interferes that is drawn without a "skip-ink" feature"). That also needs to be recognized, because it may well be that they provide an especially convenient target for spoofing. When it comes to possible confusions, the variant process is intended to deal with cases where one character, if placed inside a label, makes the label look like it is spelled with another character. Usually by substituting something that is "out of context". It is often said that labels do not have a linguistic context, but if a label matches the spelling of a word in the user's language (or a wordlike entity like a brand name that the user thinks they recognize) then we know that users do not necessarily read such labels letter by letter, but recognize them as a whole. If the altered "whole" looks acceptable enough, it will be treated as equivalent - even if we assume our "reasonably observant user". A user in a language that only has a macron above as a diacritic may be unlikely to accept most of the other diacritics as a substitute, but in the right context, a g with dot above, perhaps, might not be noticed when standing in for a g with cedilla (a small comma like shape above). If, however, the language context in which a user thinks they are recognizing a label includes both a dot above and some other mark, these two are less likely conflated --- even if they do not occur on the same base character. So, what we have are a number of principles, some that define benefits, some that define costs for adding or not adding variants. Like the case with so many other principles, any two individual principles may be in tension with each other. The goal is therefore to judge how much weight to give each one, so that they are collectively best satisfied, even if that means that some won't be. This also goes with the principle of language neutrality: ideally, we don't want to either burden or treat preferentially any given language to the exclusion of other. However, sometimes, taking other principles to their logical extreme, may impose too much of a cost. Or, in reverse, some balancing may lead to such an uneven treatment inside a single language that we are better off with more consistency. (This situation isn't new to the Latin script, it has been a constant across the Root Zone LGR -- and, again, it is a matter of judgment to get it right). For the Latin script:
• If two diacritics are never used in the same language, then there should be very limited cost on making those two a variant pair - unless doing so creates a large variant set with transitivity issues). This is true, in particular, if they occur on the same base character: G with dot above and G with cedilla might be an example; there's little downside and potential benefit of making them variants. • If the language with the G cedilla also has a dot above some other letter, that language would not benefit from the variant, but the other language still might. If both have a dot above on some letters, but the other language has some other mark above some letters, the case would be even weaker: neither set of users would be very likely to accept a substitute of the other language's marks. • If some diacritics are practically only ever used in one language (and look quite distinct) then it makes little sens to make them variants of each other. Native users can be expected to discriminate them and therefore a variant would have substantial cost and little benefit. • If some diacritics are used contrastively in some language(s) on the same base character, the cost of making them a variant is potentially quite high (blocking many labels that users consider distinct). It might be best to limit variants in that case to diacritics that are particularly prone to being misidentified in any other contexts. • If marks are reasonably distinct in printed presentation (as typical for user interfaces), causing in-language variants between them by transitivity should probably be avoided (if possible). Even if there's some hypothetical scenario why certain users might confuse them. Example is tilde vs. o-diaeresis. Even if there's a scenario where these can be substituted in Swedish, that scenario is perhaps not the most obvious and if it were to lead to some other language having two common characters made variants it would not pass the cost/benefit analysis. • Variant sets with cross-script members need to be well-scrutinized for issues of transitivity. • Overlapped variants cause permutations and need to be curtailed sharply; otherwise the attendant cost imposed especially in cross script cases will violate some of the overall principles against complexity and unexpected behavior.
Hopefully, this summary is a useful guideline to help reasoning about these variant sets. As always, it's not a recipe or algorithm, but might be an aid in making judgements.
_______________________________________________ Latingp mailing list Latingp@icann.org https://mm.icann.org/mailman/listinfo/latingp
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on. _______________________________________________ Latingp mailing list Latingp@icann.org https://mm.icann.org/mailman/listinfo/latingp
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
Hi Meikal, I definitely agree that we need to review it. Both to be sure we all have the same understanding of what the IP is trying to communicate to us and to see what additional tasks arise from it. Regards, Bill Sent from Yahoo Mail on Android On Thu, Feb 13, 2020 at 9:01 AM, Meikal Mumin<meikal@mumin.de> wrote: Dear Bill, I‘m not understanding it that way. Let‘s discuss the feedback sometime at our next meetings and see what conclusions we need to draw from it. Dennis also wanted to review this new feedback. Best, MeikalAm 6. Feb. 2020, 02:45 +0100 schrieb Bill Jouris <bill.jouris@insidethestack.com>: Dear colleagues, If I am reading this correctly, we will need to go thru a lot of our proposed variant pairs (and possible also some of the edge cases where we were considering not making a pair variants) and determine whether they are both used within a language or not. Is that how the rest of you are reading it? Bill Jouris Inside Products bill.jouris@insidethestack.com 831-659-8360 925-855-9512 (direct) On Wednesday, February 5, 2020, 01:56:51 AM PST, Pitinan Kooarmornpatana <pitinan.koo@icann.org> wrote: #yiv3519786125 #yiv3519786125 #yiv3519786125 -- filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 p.yiv3519786125MsoNormal, #yiv3519786125 li.yiv3519786125MsoNormal, #yiv3519786125 div.yiv3519786125MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:11.0pt;font-family:sans-serif;}#yiv3519786125 span.yiv3519786125EmailStyle20 {font-family:sans-serif;color:windowtext;}#yiv3519786125 .yiv3519786125MsoChpDefault {font-size:10.0pt;}#yiv3519786125 filtered {}#yiv3519786125 div.yiv3519786125WordSection1 {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 filtered {}#yiv3519786125 ol {margin-bottom:0in;}#yiv3519786125 ul {margin-bottom:0in;}#yiv3519786125 #yiv3519786125 Dear Latin GP members, Please find below the message from the IP. This is a follow-up from the Phone call with Latin GP, suggesting guidelines for Latin script variant analysis. Regards, Pitinan From: Sarmad Hussain <sarmad.hussain@icann.org> Date: Wednesday, February 5, 2020 at 16:43 To: Pitinan Kooarmornpatana <pitinan.koo@icann.org> Subject: Fwd: Follow-up on Phone Call with Latin GP Dear Pitinan, Please share with Latin GP from IP. Regards, Sarmad Considerations in designing variant sets First, we generally say that more blocked variants are "good"; reducing the namespace a bit can make it safer, because it removes instances that can be exploited for malicious registrations. However, as we've recognized with cross-script variants, there's also a cost. There are two costs, one is that large variant sets tend to link things via transitivity that do not pass the "smell test" for users. They appear unmotivated, and when they lead to a desirable label being blocked by some other label that on the face of it isn't a true look-alike, the LGR comes out looking bad. There was the case of the Cyrillic country code that people claimed looked too much like "br". The similarity was not that close and it was allowed on appeal. An LGR variant, by design, does not allow appeal, so the goal has to be to limit such variants to "clear-cut" cases (where one would expect an appeal to fail, if it had been possible to lodge one). The other cost is that there is a tendency to impose in-language variants: except for well-defined exceptions we can assume that any two characters that are used in contrast in a language can (and will) be distinguished by its users (even if they look confusingly similar to people unfamiliar with them). We have been making similar arguments when it comes to in-script and cross-script variants. Preventing simultaneous registrations of words that are seen as clearly different by native users is not ideal. We need to treat that on the "cost" side of the ledger. Even in cases where the distinctions are subtle or where two spellings lead to the same result. For example, for ASCII TLD's, both .unicorn and .unicom are available (subject to further manual review) as are ".color" and ".colour". When it comes to diacritical marks, these precedents need to be borne in mind. However, diacritical marks themselves are not always easy to tell apart at small sizes (and/or when an underline interferes that is drawn without a "skip-ink" feature"). That also needs to be recognized, because it may well be that they provide an especially convenient target for spoofing. When it comes to possible confusions, the variant process is intended to deal with cases where one character, if placed inside a label, makes the label look like it is spelled with another character. Usually by substituting something that is "out of context". It is often said that labels do not have a linguistic context, but if a label matches the spelling of a word in the user's language (or a wordlike entity like a brand name that the user thinks they recognize) then we know that users do not necessarily read such labels letter by letter, but recognize them as a whole. If the altered "whole" looks acceptable enough, it will be treated as equivalent - even if we assume our "reasonably observant user". A user in a language that only has a macron above as a diacritic may be unlikely to accept most of the other diacritics as a substitute, but in the right context, a g with dot above, perhaps, might not be noticed when standing in for a g with cedilla (a small comma like shape above). If, however, the language context in which a user thinks they are recognizing a label includes both a dot above and some other mark, these two are less likely conflated --- even if they do not occur on the same base character. So, what we have are a number of principles, some that define benefits, some that define costs for adding or not adding variants. Like the case with so many other principles, any two individual principles may be in tension with each other. The goal is therefore to judge how much weight to give each one, so that they are collectively best satisfied, even if that means that some won't be. This also goes with the principle of language neutrality: ideally, we don't want to either burden or treat preferentially any given language to the exclusion of other. However, sometimes, taking other principles to their logical extreme, may impose too much of a cost. Or, in reverse, some balancing may lead to such an uneven treatment inside a single language that we are better off with more consistency. (This situation isn't new to the Latin script, it has been a constant across the Root Zone LGR -- and, again, it is a matter of judgment to get it right). For the Latin script: - If two diacritics are never used in the same language, then there should be very limited cost on making those two a variant pair - unless doing so creates a large variant set with transitivity issues). This is true, in particular, if they occur on the same base character: G with dot above and G with cedilla might be an example; there's little downside and potential benefit of making them variants. - If the language with the G cedilla also has a dot above some other letter, that language would not benefit from the variant, but the other language still might. If both have a dot above on some letters, but the other language has some other mark above some letters, the case would be even weaker: neither set of users would be very likely to accept a substitute of the other language's marks. - If some diacritics are practically only ever used in one language (and look quite distinct) then it makes little sens to make them variants of each other. Native users can be expected to discriminate them and therefore a variant would have substantial cost and little benefit. - If some diacritics are used contrastively in some language(s) on the same base character, the cost of making them a variant is potentially quite high (blocking many labels that users consider distinct). It might be best to limit variants in that case to diacritics that are particularly prone to being misidentified in any other contexts. - If marks are reasonably distinct in printed presentation (as typical for user interfaces), causing in-language variants between them by transitivity should probably be avoided (if possible). Even if there's some hypothetical scenario why certain users might confuse them. Example is tilde vs. o-diaeresis. Even if there's a scenario where these can be substituted in Swedish, that scenario is perhaps not the most obvious and if it were to lead to some other language having two common characters made variants it would not pass the cost/benefit analysis. - Variant sets with cross-script members need to be well-scrutinized for issues of transitivity. - Overlapped variants cause permutations and need to be curtailed sharply; otherwise the attendant cost imposed especially in cross script cases will violate some of the overall principles against complexity and unexpected behavior. Hopefully, this summary is a useful guideline to help reasoning about these variant sets. As always, it's not a recipe or algorithm, but might be an aid in making judgements. _______________________________________________ Latingp mailing list Latingp@icann.org https://mm.icann.org/mailman/listinfo/latingp _______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on._______________________________________________ Latingp mailing list Latingp@icann.org https://mm.icann.org/mailman/listinfo/latingp _______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
participants (3)
-
Bill Jouris -
Meikal Mumin -
Pitinan Kooarmornpatana