DNSO Archives: [nc-idn]

ICANN/DNSO DNSO Mailling lists archives
[nc-idn]

<<< Chronological Index >>> <<< Thread Index >>>
[nc-idn] Resuming work

To: nc-idn@dnso.org
Subject: [nc-idn] Resuming work
From: Elisabeth Porteneuve <Elisabeth.Porteneuve@cetp.ipsl.fr>
Date: Thu, 31 Jan 2002 22:07:25 +0100 (MET)
Sender: owner-nc-idn@dnso.org


NC IDN TF Colleagues,

We had to resume the discussions and work of IDN Task Force.
My apology for being away for a long time (part of which were
spent with ICANN IDN group).

Last December we ended up with some much better understanding on IDN
technical aspects thanks to presentations from various experts. 
See http://www.dnso.org/dnso/notes/20011203.NCIDN-minutes.html
and Mpeg3 recording.


In summary:
A. Today domain names code specifications limit the permissible 
   code points to a restricted subset of 38 signs: the letters a-z 
   (upper and lower case alike, 26 signs), the digits 0-9,
   the hyphen-minus "-" (so called "LDH"), plus the label-separating
   period (with additional rules such as no minus at the beginning 
   or at the end of a label). 
   In other words when an octet permits 256 values, 0x0 to 0xFF, 
   only 38 of them are used in domain names.
B. Tomorrow domain names code specifications will extend that 
   limit of 38 to a significant part of Unicode code points, 
   which are on 2 octets each, therefore the upper limit is 65536 
   possible code points.
Today ICANN policy is (A). It will move to (B). 


The IETF devised the possible algorithms for usage of Unicode
code points, and its encoding in such a way as it preserve
the continuity, interoperability and stability of the Internet. 
At this stage the IETF must determine which Unicode signs could 
be acceptable in international domain names (which could be added 
to the current "LDH" set), and which are not. There are two ways 
to consider it, by exclusion or by inclusion, either determine 
what is NOT acceptable or determine what IS acceptable.

This is the most important general policy matter concerning domain
names.

The IETF's IDN works are based on [UNICODE] Unicode consortium works, 
see references below. The Unicode tables contain 64K (65536) signs, 
which are not only various languages related code points, but also 
many others such as box drawings, geometric shapes, dingbats, 
mathematical and technical operators, currency symbols, musical 
symbols, and similar. Plus punctuation characters, and spacing 
characters, a lot of them related to specific languages.

Use of these classes of characters will increase the risks of user 
confusion and will create vast opportunities for spoofed names 
which would not otherwise exist. The general understanding seems
to be that many of those code points should not be accepted in 
domain names.

1. Assume that IETF works will reduce the permissible code points 
   for IDN to language related code points, which is still several 
   tens of thousands of code points.

   Let have a glimpse on both end-user and intellectual property 
   perspectives with an example.

   The word "COBET" reads as it is if one assumes it is Latin 
   alphabet, but spells "soviet" if one assumes it is Cyrillic. 
   The Unicode code point representation for Cyrillic "C", 0x0421, 
   is different from code point representation for Latin "C", 0x0043, 
   but they are identical on a printed paper, business cards 
   or a screen. Taking into account the above, a usage of Unicode 
   code points subsequently makes it impossible to communicate 
   with anybody without knowing which language is _printed_, or, 
   even worst, which letter or sign is printed in which language. 

   In the famous TOYS[R]US the R in brackets is a Cyrillic 
   code point 0x042f spelled "ya", which also happen to be the 
   letter R seen as in mirror, spelled "are". With the exception 
   of that letter [R], any other one in TOYS[R]US may be read 
   either as Latin or as Cyrillic code point, different spellings, 
   different code points, identical printing on paper or screen. 
   In an example of a word of 6 code points, with the same 
   printing but 2 different contents there is 2**6 = 64 possible 
   combinations  It is the number of times a 6 letters word 
   should be registered to preserve its whole intellectual 
   property rights in 2 alphabets, Latin and Cyrillic. 
   It is also the maximal number of tries an end-user should 
   made to get to a website, if she or he got only a printed 
   information. 
   I have no competencies to expand this example to other 
   alphabets or code points. Hovever, as far as I understand,
   the problem of Chinese code points have some similarity.

2. Assume that despite the above difficulties, or because the 
   above difficulties can be dealt with, the new policy with 
   regards to new ten thousands of possible code points (characters)
   in domain names is adopted. 

   At the technical level this new policy for domain names 
   is reinforced by IETF standards being developed and which 
   interoperability has been tested.

   At the policy level such a new policy for domain names may 
   be contractually reinforced by ICANN only on those TLD 
   to which ICANN is authoritative.

Several questions arise:

1. Can the NC IDN TF provide some policy advice with regard 
   to the Unicode code points for IDN use ?
2. Can the NC IDN TF provide some policy advice concerning
   Intellectual Property and consummers ? And whois ?
3. What are possible outcomes of IDN implementation with
   regard to new TLD to come ? 
   With regards to existing gTLD and ccTLD ?

Your comments on the dissertation above are sought.

Elisabeth Porteneuve
--

IDN bibliography: 
(see http://www.dnso.org/dnso/ncidnindex.html; if anything
is missing, please let me know)

*  The Internationalized Domain Names IETF WG 
   http://www.ietf.org/html.charters/idn-charter.html 
   Chair(s): 
   James Seng <jseng@pobox.org.sg> 
   Marc Blanchet <Marc.Blanchet@viagenie.qc.ca> 
   The IDN IETF WG Web site http://www.i-d-n.net/ 

*  IETF drafts related to IDN: 
   1. [AMC-ACE-M] Adam Costello (4 Sep 2001) 
      http://www.ietf.org/internet-drafts/draft-ietf-idn-amc-ace-z-01.txt 
      The choice of AMC-ACE encoding got a significant support within
      Internet industry. Subsequently its name became PUNYCODE: 
   2. [PUNYCODE] Adam Costello (10 Jan 2002) 
      http://www.ietf.org/internet-drafts/draft-ietf-idn-punycode-00.txt 
   3. [IDNA] P. Faltstrom, "Internationalizing Domain Names in
      Applications" (11 Jan 2002) 
      http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-06.txt 
   4. [NAMEPREP] Paul Hoffman and Marc Blanchet, "Stringprep Profile for
      Internationalized Host Names" (17 Jan 2002) 
      http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-07.txt 
   5. [TC/SC] XiaoDong LEE, Hsu Nai-Wen, Deng Xiang, Erin Chen, Zhang Hong,
      Sun Guonian, "Traditional and Simplified Chinese Conversion"
      (16 Nov 2001) 
      http://www.ietf.org/internet-drafts/draft-ietf-idn-tsconv-02.txt 

*  IETF RFCs: 
   1. [RFC3066] H. Alvestrand, "Tags for Identification of Languages" 
      http://www.ietf.org/rfc/rfc3066.txt 

*  Unicode Consortium: 
   1. [UNICODE] The Unicode Standard, Version 3.1.0: The Unicode
      Consortium. http://www.unicode.org/charts 

*  ICANN Board IDN Committee: 
   http://www.icann.org/committees/idn/
<<< Chronological Index >>> <<< Thread Index >>>