[nc-idn] Resuming work
NC IDN TF Colleagues,
We had to resume the discussions and work of IDN Task Force.
My apology for being away for a long time (part of which were
spent with ICANN IDN group).
Last December we ended up with some much better understanding on IDN
technical aspects thanks to presentations from various experts.
and Mpeg3 recording.
A. Today domain names code specifications limit the permissible
code points to a restricted subset of 38 signs: the letters a-z
(upper and lower case alike, 26 signs), the digits 0-9,
the hyphen-minus "-" (so called "LDH"), plus the label-separating
period (with additional rules such as no minus at the beginning
or at the end of a label).
In other words when an octet permits 256 values, 0x0 to 0xFF,
only 38 of them are used in domain names.
B. Tomorrow domain names code specifications will extend that
limit of 38 to a significant part of Unicode code points,
which are on 2 octets each, therefore the upper limit is 65536
possible code points.
Today ICANN policy is (A). It will move to (B).
The IETF devised the possible algorithms for usage of Unicode
code points, and its encoding in such a way as it preserve
the continuity, interoperability and stability of the Internet.
At this stage the IETF must determine which Unicode signs could
be acceptable in international domain names (which could be added
to the current "LDH" set), and which are not. There are two ways
to consider it, by exclusion or by inclusion, either determine
what is NOT acceptable or determine what IS acceptable.
This is the most important general policy matter concerning domain
The IETF's IDN works are based on [UNICODE] Unicode consortium works,
see references below. The Unicode tables contain 64K (65536) signs,
which are not only various languages related code points, but also
many others such as box drawings, geometric shapes, dingbats,
mathematical and technical operators, currency symbols, musical
symbols, and similar. Plus punctuation characters, and spacing
characters, a lot of them related to specific languages.
Use of these classes of characters will increase the risks of user
confusion and will create vast opportunities for spoofed names
which would not otherwise exist. The general understanding seems
to be that many of those code points should not be accepted in
1. Assume that IETF works will reduce the permissible code points
for IDN to language related code points, which is still several
tens of thousands of code points.
Let have a glimpse on both end-user and intellectual property
perspectives with an example.
The word "COBET" reads as it is if one assumes it is Latin
alphabet, but spells "soviet" if one assumes it is Cyrillic.
The Unicode code point representation for Cyrillic "C", 0x0421,
is different from code point representation for Latin "C", 0x0043,
but they are identical on a printed paper, business cards
or a screen. Taking into account the above, a usage of Unicode
code points subsequently makes it impossible to communicate
with anybody without knowing which language is _printed_, or,
even worst, which letter or sign is printed in which language.
In the famous TOYS[R]US the R in brackets is a Cyrillic
code point 0x042f spelled "ya", which also happen to be the
letter R seen as in mirror, spelled "are". With the exception
of that letter [R], any other one in TOYS[R]US may be read
either as Latin or as Cyrillic code point, different spellings,
different code points, identical printing on paper or screen.
In an example of a word of 6 code points, with the same
printing but 2 different contents there is 2**6 = 64 possible
combinations It is the number of times a 6 letters word
should be registered to preserve its whole intellectual
property rights in 2 alphabets, Latin and Cyrillic.
It is also the maximal number of tries an end-user should
made to get to a website, if she or he got only a printed
I have no competencies to expand this example to other
alphabets or code points. Hovever, as far as I understand,
the problem of Chinese code points have some similarity.
2. Assume that despite the above difficulties, or because the
above difficulties can be dealt with, the new policy with
regards to new ten thousands of possible code points (characters)
in domain names is adopted.
At the technical level this new policy for domain names
is reinforced by IETF standards being developed and which
interoperability has been tested.
At the policy level such a new policy for domain names may
be contractually reinforced by ICANN only on those TLD
to which ICANN is authoritative.
Several questions arise:
1. Can the NC IDN TF provide some policy advice with regard
to the Unicode code points for IDN use ?
2. Can the NC IDN TF provide some policy advice concerning
Intellectual Property and consummers ? And whois ?
3. What are possible outcomes of IDN implementation with
regard to new TLD to come ?
With regards to existing gTLD and ccTLD ?
Your comments on the dissertation above are sought.
(see http://www.dnso.org/dnso/ncidnindex.html; if anything
is missing, please let me know)
* The Internationalized Domain Names IETF WG
James Seng <firstname.lastname@example.org>
Marc Blanchet <Marc.Blanchet@viagenie.qc.ca>
The IDN IETF WG Web site http://www.i-d-n.net/
* IETF drafts related to IDN:
1. [AMC-ACE-M] Adam Costello (4 Sep 2001)
The choice of AMC-ACE encoding got a significant support within
Internet industry. Subsequently its name became PUNYCODE:
2. [PUNYCODE] Adam Costello (10 Jan 2002)
3. [IDNA] P. Faltstrom, "Internationalizing Domain Names in
Applications" (11 Jan 2002)
4. [NAMEPREP] Paul Hoffman and Marc Blanchet, "Stringprep Profile for
Internationalized Host Names" (17 Jan 2002)
5. [TC/SC] XiaoDong LEE, Hsu Nai-Wen, Deng Xiang, Erin Chen, Zhang Hong,
Sun Guonian, "Traditional and Simplified Chinese Conversion"
(16 Nov 2001)
* IETF RFCs:
1. [RFC3066] H. Alvestrand, "Tags for Identification of Languages"
* Unicode Consortium:
1. [UNICODE] The Unicode Standard, Version 3.1.0: The Unicode
* ICANN Board IDN Committee: