Quoting Vladimir:
Why can't we just say that a variant is whatever the registry wants to be a variant? The registry only needs to define a unique way of finding out whether String1 is indeed a variant of String2 or not.
I suspect that we are going to need to spend a bit of time sorting out the relationship between this study and the ICANN policy development process. To be sure, the former is intended to inform the latter, but our contribution is the objective consideration and description of issues attaching to "variance" as we end up defining that concept. Policy itself is, however, not made here. Notwithstanding, a look at policies already in effect -- many of which are well entrenched -- might lend some useful focus to our effort. Here is a snippet from the narrative adjunct to the Latin script table provided jointly by .SE (the host organization for the Latin study) and .MUSEUM: "The repertoire supports numerous languages written using the Latin alphabet and is intended to permit the representation of names derived from European languages, using their native orthographies to the fullest extent possible. There is, however, neither a requirement nor an expectation that a label in a domain name will correspond to a proper name or dictionary word in any language, and many labels deliberately do not have any such attributes. There is therefore no basis for determining the extent to which any word-based restrictions or other language-specific orthographic conventions can be applied here and, in consequence, all registration policies are script based. Any permissible character may appear at any point in a string, with the exception of digits and the hyphen, which may not be in the initial or final positions in a label. [The positional constraint on digits has since be rescinded.] The holder of an IDN is responsible for the orthographic rigor of any proper words or names used as labels. Each representation of a label in an alternative orthography requires separate registration. For example, the prospective holder of the label 'lättöl' is free to register the correlate 'lattol', without either form imposing any restriction on the availability of the other, or on any further variants using the more than twenty diacritically marked forms of the base 'a' in the Unicode chart, or the similar number of marked forms of 'o'. This also applies to marked or ligated characters that can alternately be represented as digraphs. It is again up to the prospective name holder to make an individual determination as to whether or not there is an equivalence between an umlauted 'ä', and an 'ae' digraph or an 'æ' ligature, or if the 'ä' can acceptably also be indicated with an 'a'. Even if lexicographic rules might be contemplated for reducing the inherent ambiguity, their automated implementation would easily be stymied by reasonable differences between the representations of both proper names and dictionary words: 'encyclopaedia' and 'encyclopædia' could be treated as identical, but 'mueller' and 'müller' cannot, and 'öresund' and 'øresund' can be argued either way." /Cary