draft-ietf-ltru-matching-01.txt   draft-ietf-ltru-matching-02.txt 
Network Working Group A. Phillips, Ed. Network Working Group A. Phillips, Ed.
Internet-Draft Quest Software Internet-Draft Quest Software
Expires: December 1, 2005 M. Davis, Ed. Expires: December 12, 2005 M. Davis, Ed.
IBM IBM
May 30, 2005 June 10, 2005
Matching Language Identifiers Matching Language Identifiers
draft-ietf-ltru-matching-01 draft-ietf-ltru-matching-02
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on December 1, 2005. This Internet-Draft will expire on December 12, 2005.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2005). Copyright (C) The Internet Society (2005).
Abstract Abstract
This document describes different mechanisms for comparing and This document describes different mechanisms for comparing and
matching the tags for the identification of languages defined by [RFC matching the tags for the identification of languages defined by [RFC
3066bis] [1]. Possible algorithms for language negotiation and 3066bis] [1]. Possible algorithms for language negotiation and
content selection are described. This document obsoletes portions of content selection are described. This document obsoletes portions of
[RFC 3066] [19]. [RFC 3066] [19].
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Basic Language Range . . . . . . . . . . . . . . . . . . . 4 2.1 Basic Language Range . . . . . . . . . . . . . . . . . . . 4
2.1.1 Matching . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Matching . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Lookup . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Lookup . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Extended Language Range . . . . . . . . . . . . . . . . . 6 2.2 Extended Language Range . . . . . . . . . . . . . . . . . 6
2.2.1 Extended Range Matching . . . . . . . . . . . . . . . 7 2.2.1 Extended Range Matching . . . . . . . . . . . . . . . 7
2.2.2 Extended Range Lookup . . . . . . . . . . . . . . . . 8 2.2.2 Extended Range Lookup . . . . . . . . . . . . . . . . 8
2.2.3 Scored Matching . . . . . . . . . . . . . . . . . . . 9 2.2.3 Scored Matching . . . . . . . . . . . . . . . . . . . 9
2.3 Meaning of Language Tags and Ranges . . . . . . . . . . . 10 2.3 Meaning of Language Tags and Ranges . . . . . . . . . . . 10
2.4 Choosing Between Alternate Matching Schemes . . . . . . . 11 2.4 Choosing Between Alternate Matching Schemes . . . . . . . 11
2.5 Considerations for Private Use Subtags . . . . . . . . . . 11 2.5 Considerations for Private Use Subtags . . . . . . . . . . 12
2.6 Length Considerations in Matching . . . . . . . . . . . . 12 2.6 Length Considerations in Matching . . . . . . . . . . . . 12
3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14
4. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5. Security Considerations . . . . . . . . . . . . . . . . . . . 16 5. Security Considerations . . . . . . . . . . . . . . . . . . . 16
6. Character Set Considerations . . . . . . . . . . . . . . . . . 17 6. Character Set Considerations . . . . . . . . . . . . . . . . . 17
7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.1 Normative References . . . . . . . . . . . . . . . . . . . 18 7.1 Normative References . . . . . . . . . . . . . . . . . . . 18
7.2 Informative References . . . . . . . . . . . . . . . . . . 19 7.2 Informative References . . . . . . . . . . . . . . . . . . 19
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 19 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 19
A. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 20 A. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 20
skipping to change at page 4, line 35 skipping to change at page 4, line 35
sequence of subtags. A basic language range can be represented by a sequence of subtags. A basic language range can be represented by a
'language-range' tag, by using the definition from HTTP/1.1 [10] : 'language-range' tag, by using the definition from HTTP/1.1 [10] :
language-range = language-tag / "*" language-range = language-tag / "*"
That is, a language-range has the same syntax as a language-tag or is That is, a language-range has the same syntax as a language-tag or is
the single character "*". This definition of language-range implies the single character "*". This definition of language-range implies
that there is a semantic relationship between tags that share the that there is a semantic relationship between tags that share the
same prefix. same prefix.
In particular, the set of language tags that match a specific In particular, the set of language tags that match a specific
language-range may not all be mutually intelligible. The use of a language-range might not all be mutually intelligible. The use of a
prefix when matching tags to language ranges does not imply that prefix when matching tags to language ranges does not imply that
language tags are assigned to languages in such a way that it is language tags are assigned to languages in such a way that it is
always true that if a user understands a language with a certain tag, always true that if a user understands a language with a certain tag,
then this user will also understand all languages with tags for which then this user will also understand all languages with tags for which
this tag is a prefix. The prefix rule simply allows the use of this tag is a prefix. The prefix rule simply allows the use of
prefix tags if this is the case. prefix tags if this is the case.
When working with tags and ranges you should also note the following: When working with tags and ranges you SHOULD also note the following:
1. Private-use and Extension subtags are normally orthogonal to 1. Private-use and Extension subtags are normally orthogonal to
language tag fallback. Implementations should ignore language tag fallback. Implementations SHOULD ignore
unrecognized private-use and extension subtags when performing unrecognized private-use and extension subtags when performing
language tag fallback. Since these subtags are always at the end language tag fallback. Since these subtags are always at the end
of the sequence of subtags, they don't normally interfere with of the sequence of subtags, they don't normally interfere with
the use of prefixes for matching in the schemes described below. the use of prefixes for matching in the schemes described below.
2. Implementations that choose not to interpret one or more private- 2. Implementations that choose not to interpret one or more private-
use or extension subtags should not remove or modify these use or extension subtags SHOULD NOT remove or modify these
extensions in content that they are processing. When a language extensions in content that they are processing. When a language
tag instance is to be used in a specific, known protocol, and is tag instance is to be used in a specific, known protocol, and is
not being passed through to other protocols, language tags may be not being passed through to other protocols, language tags MAY be
filtered to remove subtags and extensions that are not supported filtered to remove subtags and extensions that are not supported
by that protocol. This should be done with caution, since it is by that protocol. Such filtering SHOULD be avoided, if possible,
removing information that may be relevant if services on the since it removes information that might be relevant if services
other end of the protocol would make use of that information. on the other end of the protocol would make use of that
information.
3. Some applications of language tags may want or need to consider 3. Some applications of language tags might want or need to consider
extensions and private-use subtags when matching tags. If extensions and private-use subtags when matching tags. If
extensions and private-use subtags are included in a matching or extensions and private-use subtags are included in a matching or
filtering process that utilizes the one of the schemes described filtering process that utilizes the one of the schemes described
in this document, then the implementation should canonicalize the in this document, then the implementation SHOULD canonicalize the
language tags and/or ranges before performing the matching. Note language tags and/or ranges before performing the matching. Note
that language tag processors that claim to be "well-formed" that language tag processors that claim to be "well-formed"
processors as defined in [1] generally fall into this category. processors as defined in [1] generally fall into this category.
There are two matching schemes that are commonly associated with There are two matching schemes that are commonly associated with
basic language ranges: matching and lookup. basic language ranges: matching and lookup.
2.1.1 Matching 2.1.1 Matching
Language tag matching is used to select all content that matches a Language tag matching is used to select all content that matches a
skipping to change at page 5, line 45 skipping to change at page 5, line 46
a web page in a particular language, it might use language tag a web page in a particular language, it might use language tag
matching to select the content to which the style is applied. matching to select the content to which the style is applied.
A language-range matches a language-tag if it exactly equals the tag, A language-range matches a language-tag if it exactly equals the tag,
or if it exactly equals a prefix of the tag such that the first or if it exactly equals a prefix of the tag such that the first
character following the prefix is "-". (That is, the language-range character following the prefix is "-". (That is, the language-range
"en-de" matches the language tag "en-DE-boont", but not the language "en-de" matches the language tag "en-DE-boont", but not the language
tag "en-Deva".) tag "en-Deva".)
The special range "*" matches any tag. A protocol which uses The special range "*" matches any tag. A protocol which uses
language ranges may specify additional rules about the semantics of language ranges MAY specify additional rules about the semantics of
"*"; for instance, HTTP/1.1 specifies that the range "*" matches only "*"; for instance, HTTP/1.1 specifies that the range "*" matches only
languages not matched by any other range within an "Accept-Language:" languages not matched by any other range within an "Accept-Language:"
header. header.
2.1.2 Lookup 2.1.2 Lookup
Content lookup is used to select the single information item that Content lookup is used to select the single information item that
best matches the language range for a given request. In lookup, the best matches the language range for a given request. In lookup, the
language range represents the most specific tag which is an language range represents the most specific tag which is an
acceptable match and only the closest matching item is returned. acceptable match and only the closest matching item is returned.
skipping to change at page 6, line 36 skipping to change at page 6, line 40
This scheme allows some flexibility in finding content. It also This scheme allows some flexibility in finding content. It also
typically provides better results when data is not available at a typically provides better results when data is not available at a
specific level of tag granularity or is sparsely populated (than if specific level of tag granularity or is sparsely populated (than if
the default language for the system or content were used). the default language for the system or content were used).
2.2 Extended Language Range 2.2 Extended Language Range
Prefix matching using a Basic Language Range, as described above, is Prefix matching using a Basic Language Range, as described above, is
not always the most appropriate way to access the information not always the most appropriate way to access the information
contained in language tags when selecting or filtering content. Some contained in language tags when selecting or filtering content. Some
applications may wish to define a more granular matching scheme and applications might wish to define a more granular matching scheme and
such a matching scheme requires the ability to specify the various such a matching scheme requires the ability to specify the various
attributes of a language tag in the language range. An extended attributes of a language tag in the language range. An extended
language range can be represented by the following ABNF: language range can be represented by the following ABNF:
extended-language-range = grandfathered / privateuse / range extended-language-range = grandfathered / privateuse / range
range = ( lang [ "-" script ] [ "-" region ] *( "-" variant ) range = ( lang [ "-" script ] [ "-" region ] *( "-" variant )
[ "-" privateuse ] ) [ "-" privateuse ] )
lang = ( 2*8ALPHA *[ "-" extlang ] ) / "*" lang = ( 2*8ALPHA *[ "-" extlang ] ) / "*"
extlang = 3ALPHA / "*" extlang = 3ALPHA / "*"
script = 4ALPHA / "*" script = 4ALPHA / "*"
region = 2ALPHA / 3DIGIT / "*" region = 2ALPHA / 3DIGIT / "*"
variant = 5*8alphanum / ( DIGIT 3alphanum ) / "*" variant = 5*8alphanum / ( DIGIT 3alphanum ) / "*"
privateuse = ( "x" / "X" ) 1*( "-" ( 1*8alphanum ) ) privateuse = ( "x" / "X" ) 1*( "-" ( 1*8alphanum ) )
grandfathered = 1*3ALPHA 1*2( "-" ( 2*8alphanum ) ) grandfathered = 1*3ALPHA 1*2( "-" ( 2*8alphanum ) )
skipping to change at page 7, line 14 skipping to change at page 7, line 27
In an extended language range, the identifier takes the form of a In an extended language range, the identifier takes the form of a
series of subtags which must consist of well-formed subtags or the series of subtags which must consist of well-formed subtags or the
special subtag "*". For example, the language range "en-*-US" special subtag "*". For example, the language range "en-*-US"
specifies a primary language of 'en', followed by any script subtag, specifies a primary language of 'en', followed by any script subtag,
followed by the region subtag 'US'. followed by the region subtag 'US'.
A field not present in the middle of an extended language range MAY A field not present in the middle of an extended language range MAY
be treated as if the field contained a "*". For example, the range be treated as if the field contained a "*". For example, the range
"en-US" MAY be considered to be equivalent to the range "en-*-US". "en-US" MAY be considered to be equivalent to the range "en-*-US".
There are several matching algorithms or schemes which may be applied There are several matching algorithms or schemes which can be applied
when matching extended language ranges to language tags. when matching extended language ranges to language tags.
2.2.1 Extended Range Matching 2.2.1 Extended Range Matching
In extended range matching, the subtags in a language tag are In extended range matching, the subtags in a language tag are
compared to the corresponding subtags in the extended language range. compared to the corresponding subtags in the extended language range.
A subtag is considered to match if it exactly matches the A subtag is considered to match if it exactly matches the
corresponding subtag in the range or the range contains a subtag with corresponding subtag in the range or the range contains a subtag with
the value "*" (which matches all subtags, including the empty the value "*" (which matches all subtags, including the empty
subtag). Extended Range Matching is an extension of basic matching subtag). Extended Range Matching is an extension of basic matching
(Section 2.1.1): the language range represents the least specific tag (Section 2.1.1): the language range represents the least specific tag
which is an acceptable match. which is an acceptable match.
By default all extensions and their subtags are ignored for extended By default all extensions and their subtags are ignored for extended
language range matching. language range matching.
Private use subtags may be specified in the language range and MUST Private use subtags MAY be specified in the language range and MUST
NOT be ignored when matching. NOT be ignored when matching.
Subtags not specified, including those at the end of the language Subtags not specified, including those at the end of the language
range, are assigned the value "*". This makes each range into a range, are assigned the value "*". This makes each range into a
prefix much like that used in basic language range matching. For prefix much like that used in basic language range matching. For
example, the extended language range "zh-*-CN" matches all of the example, the extended language range "zh-*-CN" matches all of the
following tags because the unspecified variant field is expanded to following tags because the unspecified variant field is expanded to
"*": "*":
zh-Hant-CN zh-Hant-CN
skipping to change at page 8, line 19 skipping to change at page 8, line 29
subtag is considered to match if it exactly matches the corresponding subtag is considered to match if it exactly matches the corresponding
subtag in the range or the range contains a subtag with the value "*" subtag in the range or the range contains a subtag with the value "*"
(which matches all subtags, including the empty subtag). Extended (which matches all subtags, including the empty subtag). Extended
language range lookup is an extension of basic lookup language range lookup is an extension of basic lookup
(Section 2.1.2): the language range represents the most specific tag (Section 2.1.2): the language range represents the most specific tag
which will form an acceptable match. which will form an acceptable match.
Subtags not specified are assigned the value "*" prior to performing Subtags not specified are assigned the value "*" prior to performing
tag matching. Unlike in extended range matching, however, fields at tag matching. Unlike in extended range matching, however, fields at
the end of the range MUST NOT be expanded in this manner. For the end of the range MUST NOT be expanded in this manner. For
example, "en-US" must not be considered to be the same as the range example, "en-US" MUST NOT be considered to be the same as the range
"en-US-*". This allows ranges to be specific. The "*" wildcard MUST "en-US-*". This allows ranges to be specific. The "*" wildcard MUST
be used at the end of the range to indicate that all tags with the be used at the end of the range to indicate that all tags with the
range as a prefix are allowable matches. That is, the range "zh-*" range as a prefix are allowable matches. That is, the range "zh-*"
matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh" matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh"
matches neither of those tags. matches neither of those tags.
The wildcard "*" at the end of a range SHOULD be considered to match The wildcard "*" at the end of a range SHOULD be considered to match
any private use subtag sequences (making extended language range any private use subtag sequences (making extended language range
lookup function exactly like extended range matching Section 2.2.1). lookup function exactly like extended range matching Section 2.2.1).
By default all extensions and their subtags SHOULD be ignored for By default all extensions and their subtags SHOULD be ignored for
extended language range lookup. Private use subtags may be specified extended language range lookup. Private use subtags MAY be specified
in the language range and MUST NOT be ignored when performing lookup. in the language range and MUST NOT be ignored when performing lookup.
The wildcard "*" at the end of a range SHOULD be considered to match The wildcard "*" at the end of a range SHOULD be considered to match
any private use subtag sequences in addition to variants. any private use subtag sequences in addition to variants.
For example, the range "*-US" matches all of the following tags: For example, the range "*-US" matches all of the following tags:
en-US en-US
en-Latn-US en-Latn-US
en-US-r-extends (extensions are ignored) en-US-r-extends (extensions are ignored)
fr-US fr-US
For example, the range "en-*-US" matches _none_ of the following For example, the range "en-*-US" matches _none_ of the following
tags: tags:
fr-US fr-US
en (missing region US) en (missing region US)
skipping to change at page 9, line 20 skipping to change at page 9, line 31
en-Latn en-Latn
en-Latn-US en-Latn-US
en-Latn-US-scouse en-Latn-US-scouse
en-US en-US
en-scouse en-scouse
It should be noted that the ability to be specific in extended range Note that the ability to be specific in extended range lookup can
lookup may make this matching scheme a more appropriate replacement make this matching scheme a more appropriate replacement for basic
for basic matching than the extended range matching scheme. matching than the extended range matching scheme.
2.2.3 Scored Matching 2.2.3 Scored Matching
In the "scored matching" scheme, the extended language range and the In the "scored matching" scheme, the extended language range and the
language tags are pre-normalized by mapping grandfathered and language tags are pre-normalized by mapping grandfathered and
obsolete tags into modern equivalents. obsolete tags into modern equivalents.
The language range and the language tags are normalized into The language range and the language tags are normalized into
quadruples of the form (language, script, country, variant), where quadruples of the form (language, script, country, variant), where
extended language is considered part of language and x-private-codes extended language is considered part of language and x-private-codes
skipping to change at page 9, line 41 skipping to change at page 10, line 4
quadruples of the form (language, script, country, variant), where quadruples of the form (language, script, country, variant), where
extended language is considered part of language and x-private-codes extended language is considered part of language and x-private-codes
are considered part of the language if they are initial and part of are considered part of the language if they are initial and part of
the variant if not initial. Missing components are set to "*". An the variant if not initial. Missing components are set to "*". An
"*" pattern becomes the quadruple ("*", "*", "*", "*"). "*" pattern becomes the quadruple ("*", "*", "*", "*").
Each language tag being matched or filtered is assigned a "quality Each language tag being matched or filtered is assigned a "quality
value" such that higher values indicate better matches and lower value" such that higher values indicate better matches and lower
values indicate worse ones. If the language matches, add 8 to the values indicate worse ones. If the language matches, add 8 to the
quality value. If the script matches, add 4 to the quality value. quality value. If the script matches, add 4 to the quality value.
If the region matches, add 2 to the quality value. If the variant If the region matches, add 2 to the quality value. If the variant
matches, add 1 to the quality value. Elements of the quadruples are matches, add 1 to the quality value. Elements of the quadruples are
considered to match if they are the same or if one of them is "*". considered to match if they are the same or if one of them is "*".
A value of 15 is a perfect match; 0 is no match at all. Different A value of 15 is a perfect match; 0 is no match at all. Different
values may be more or less appropriate for different applications and values could be more or less appropriate for different applications
implementations should probably allow users to choose the most and implementations SHOULD probably allow users to choose the most
appropriate selection value. appropriate selection value.
2.3 Meaning of Language Tags and Ranges 2.3 Meaning of Language Tags and Ranges
A language tag defines a language as spoken (or written, signed or A language tag defines a language as spoken (or written, signed or
otherwise signaled) by human beings for communication of information otherwise signaled) by human beings for communication of information
to other human beings. to other human beings.
If a language tag B contains language tag A as a prefix, then B is If a language tag B contains language tag A as a prefix, then B is
typically "narrower" or "more specific" than A. For example, "zh- typically "narrower" or "more specific" than A. For example, "zh-
Hant-TW" is more specific than "zh-Hant". Hant-TW" is more specific than "zh-Hant".
This relationship is not guaranteed in all cases: specifically, This relationship is not guaranteed in all cases: specifically,
languages that begin with the same sequence of subtags are NOT languages that begin with the same sequence of subtags are NOT
guaranteed to be mutually intelligible, although they may be. guaranteed to be mutually intelligible, although they might be.
For example, the tag "az" shares a prefix with both "az-Latn" For example, the tag "az" shares a prefix with both "az-Latn"
(Azerbaijani written using the Latin script) and "az-Cyrl" (Azerbaijani written using the Latin script) and "az-Cyrl"
(Azerbaijani written using the Cyrillic script). A person fluent in (Azerbaijani written using the Cyrillic script). A person fluent in
one script may not be able to read the other, even though the text one script might not be able to read the other, even though the text
might be otherwise identical. Content tagged as "az" most probably might be otherwise identical. Content tagged as "az" most probably
is written in just one script and thus might not be intelligible to a is written in just one script and thus might not be intelligible to a
reader familiar with the other script. reader familiar with the other script.
Variant subtags in particular seem to represent specific divisions in Variant subtags in particular seem to represent specific divisions in
mutual understanding, since they often encode dialects or other mutual understanding, since they often encode dialects or other
idiosyncratic variations within a language. idiosyncratic variations within a language.
The relationship between the language tag and the information it The relationship between the language tag and the information it
relates to is defined by the standard describing the context in which relates to is defined by the standard describing the context in which
it appears. Accordingly, this section can only give possible it appears. Accordingly, this section can only give possible
examples of its usage. examples of its usage.
o For a single information object, the associated language tags o For a single information object, the associated language tags
might be interpreted as the set of languages that is required for might be interpreted as the set of languages that are necessary
a complete comprehension of the complete object. Example: Plain for a complete comprehension of the complete object. Example:
text documents. Plain text documents.
o For an aggregation of information objects, the associated language o For an aggregation of information objects, the associated language
tags could be taken as the set of languages used inside components tags could be taken as the set of languages used inside components
of that aggregation. Examples: Document stores and libraries. of that aggregation. Examples: Document stores and libraries.
o For information objects whose purpose is to provide alternatives, o For information objects whose purpose is to provide alternatives,
the associated language tags could be regarded as a hint that the the associated language tags could be regarded as a hint that the
content is provided in several languages, and that one has to content is provided in several languages, and that one has to
inspect each of the alternatives in order to find its language or inspect each of the alternatives in order to find its language or
languages. In this case, the presence of multiple tags might not languages. In this case, the presence of multiple tags might not
skipping to change at page 11, line 29 skipping to change at page 11, line 38
Implementations MAY choose to implement different styles of matching Implementations MAY choose to implement different styles of matching
for different kinds of processing. For example, an implementation for different kinds of processing. For example, an implementation
could treat an absent script subtag as a "wildcard" field; thus could treat an absent script subtag as a "wildcard" field; thus
"az-AZ" would match "az-AZ", "az-Cyrl-AZ", "az-Latn-AZ", etc. but not "az-AZ" would match "az-AZ", "az-Cyrl-AZ", "az-Latn-AZ", etc. but not
"az" (this is extended range lookup). If one item is to be chosen, "az" (this is extended range lookup). If one item is to be chosen,
the implementation could pick among those matches based on other the implementation could pick among those matches based on other
information, such as the most likely script used in the language/ information, such as the most likely script used in the language/
region in question or the script used by other content selected. region in question or the script used by other content selected.
Because the primary language subtag cannot be absent in a language Because the primary language subtag cannot be absent in a language
tag, the 'UND' subtag may sometimes be used as a 'wildcard' in basic tag, the 'UND' subtag is sometimes be used as a 'wildcard' in basic
matching. For example, in a query where you want to select all matching. For example, in a query where you want to select all
language tags that contain 'Latn' as the script code and 'AZ' as the language tags that contain 'Latn' as the script code and 'AZ' as the
region code, you could use the range "und-Latn-AZ". This requires an region code, you could use the range "und-Latn-AZ". This requires an
implementation to examine the actual values of the subtags, though. implementation to examine the actual values of the subtags, though.
The matching schemes described elsewhere in this document do not The matching schemes described elsewhere in this document are
require implementations to examine the values supplied and, except designed such that implementations do not have to examine the values
for scored matching, they do not require access to the Language or subtags supplied and, except for scored matching, they do not need
Subtag Registry nor the use of valid subtags in language tags or access to the Language Subtag Registry nor the use of valid subtags
ranges. This has great benefit for speed and simplicity of in language tags or ranges. This has great benefit for speed and
implementation. simplicity of implementation.
Implementations may also wish to use semantic information external to Implementations might also wish to use semantic information external
the langauge tags when performing fallback. For example, the primary to the langauge tags when performing fallback. For example, the
language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal Norwegian) primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal
might both be usefully matched to the more general subtag 'no' Norwegian) might both be usefully matched to the more general subtag
(Norwegian). Or an application might infer that content labeled 'no' (Norwegian). Or an application might infer that content labeled
"zh-CN" is morely likely to match the range "zh-Hans" than equivalent "zh-CN" is morely likely to match the range "zh-Hans" than equivalent
content labeled "zh-TW". content labeled "zh-TW".
2.5 Considerations for Private Use Subtags 2.5 Considerations for Private Use Subtags
Private-use subtags require private agreement between the parties Private-use subtags require private agreement between the parties
that intend to use or exchange language tags that use them and great that intend to use or exchange language tags that use them and great
caution should be used in employing them in content or protocols caution SHOULD be used in employing them in content or protocols
intended for general use. Private-use subtags are simply useless for intended for general use. Private-use subtags are simply useless for
information exchange without prior arrangement. information exchange without prior arrangement.
The value and semantic meaning of private-use tags and of the subtags The value and semantic meaning of private-use tags and of the subtags
used within such a language tag are not defined. Matching private used within such a language tag are not defined. Matching private
use tags using language ranges or extended language ranges may result use tags using language ranges or extended language ranges can result
in unpredictable content being returned. in unpredictable content being returned.
2.6 Length Considerations in Matching 2.6 Length Considerations in Matching
Although there is no upper bound on the number of subtags in a RFC 3066 [19] did not provide an upper limit on the size of language
language tag and it is possible to envision quite long and complex tags or ranges. RFC 3066 did define the semantics of particular
subtag sequences, in practice these are rare because of the various subtags in such a way that most language tags or ranges consisted of
considerations discussed in Section 2.1.1 of [1]. language and region subtags with a combined total length of up to six
characters. Larger tags and ranges (in terms of both subtags and
characters) did exist, however.
A matching implementation MAY choose not to support the storage or [1] also does not impose a fixed upper limit on the number of subtags
matching of language tags and ranges which exceed a specified length. in a language tag or range (and thus an upper bound on the size of
Any such limitation SHOULD be clearly documented, and such either). The syntax in that document suggests that, depending on the
documentation SHOULD include the disposition of any longer tags or specific language or range of languages, more subtags (and thus
ranges (for example, whether an error value is generated or the characters) are sometimes necessary as a result. Length
language tag is truncated). If truncation is permitted it must not considerations and their impact on the selection and processing of
permit a subtag to be divided, since this changes the semantics of tags are described in Section 2.1.1 of that document.
the tag or range being matched and may result in false positives or
negatives. Implementations that restrict storage should consider
removing extensions before matching. A protocol that allows tags or
ranges to be truncated at an arbitrary limit, without giving any
indication of what that limit is, has the potential for causing harm
by changing the meaning of values in substantial ways.
In practice, tags and ranges are limited to a sequence of four A matching implementation MAY choose to limit the length of the
subtags, and thus a maximum length of 26 characters (excluding any language tags or ranges used in matching. Any such limitation SHOULD
extensions or private use sequences). This is because subtags are be clearly documented, and such documentation SHOULD include the
limited to a length of eight characters and the extlang, script, and disposition of any longer tags or ranges (for example, whether an
region subtags are additionally limited to even fewer characters. In error value is generated or the language tag or range is truncated).
addition, the Language Subtag Registry provides guidance on the use If truncation is permitted it MUST NOT permit a subtag to be divided,
of subtags (via fields such as Suppress-Script and Recommended- since this changes the semantics of the subtag being matched and can
Prefix) which further limit useful combination of subtags in a result in false positives or negatives.
language tag or range.
Longer tags are possible. The longest practical tags (excluding Implementations that restrict storage SHOULD consider the impact of
extensions) could have a length of up to 58 characters, as shown tag or range truncation on the resulting matches. For example,
below. Implementations MUST be able to handle matching tags of this removing the "*" from the end of an extended language range (see
length. Support for tags and ranges of up to 64 characters is Section 2.2) can greatly modify the set of returned matches. A
RECOMMENDED. Implementations MAY support longer tags, including protocol that allows tags or ranges to be truncated at an arbitrary
matching extensive sets of private use or extension subtags. limit, without giving any indication of what that limit is, has the
potential for causing harm by changing the meaning of values in
substantial ways.
Here is how the 58-character length of the longest practical tag In practice, most tags do not require additional subtags or
(excluding extensions) is derived: substantially more characters. Additional subtags sometimes add
useful distinguishing information, but extraneous subtags interfere
with the meaning, understanding, and especially matching of language
tags. Since language tags or ranges MAY be truncated by an
application or protocol that limits storage, when choosing language
tags or ranges users and applications SHOULD avoid adding subtags
that add no distinguishing value. In particular, users and
implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
fields in the registry (defined in Section 3.6 of [1]): these fields
provide guidance on when specific additional subtags SHOULD (and
SHOULD NOT) be used.
language = 3 Implementations MUST support a limit of at least 33 characters. This
extlang1 = 4 (currently undefined) limit includes at least one subtag of each non-extension, non-private
extlang2 = 4 (unlikely) use type. When choosing a buffer limit, a length of at least 42
script = 5 characters is strongly RECOMMENDED.
region = 4 (UN M.49)
variant = 9
variant = 9 (unlikely)
private use 1 = 11
private use 2 = 9
total = 58 characters
Figure 4: Derviation of the Longest Tag The practical limit on tags or ranges derived solely from registered
values is 42 characters. Implementations MUST be able to handle tags
and ranges of this length. Support for tags and ranges of at least
62 characters in length is RECOMMENDED. Implementations MAY support
longer values, including matching extensive sets of private use or
extension subtags.
Applications or protocols which have to truncate a tag MUST do so by
progressively removing subtags along with their preceding "-" from
the right side of the language tag until the tag is short enough for
the given buffer. If the resulting tag ends with a single-character
subtag, that subtag and its preceding "-" MUST also be removed. For
example:
Tag to truncate: zh-Hant-CN-variant1-a-extend1-x-wadegile-private1
1. zh-Hant-CN-variant1-a-extend1-x-wadegile
2. zh-Hant-CN-variant1-a-extend1
3. zh-Hant-CN-variant1
4. zh-Hant-CN
5. zh-Hant
6. zh
Figure 4: Example of Tag Truncation
3. IANA Considerations 3. IANA Considerations
This document presents no new or existing considerations for IANA. This document presents no new or existing considerations for IANA.
4. Changes 4. Changes
This is the first version of this document. This is the first version of this document.
The following changes were put into this document since draft-00: The following changes were put into this document since draft-00:
skipping to change at page 16, line 5 skipping to change at page 15, line 29
'variant' paragraph and some tidying of the text. (A.Phillips) 'variant' paragraph and some tidying of the text. (A.Phillips)
Fixed a minor glitch in the ABNF caused by taking the output of Fixed a minor glitch in the ABNF caused by taking the output of
Bill Fenner's parser and not looking too closely at it (M. Patton) Bill Fenner's parser and not looking too closely at it (M. Patton)
Fixed some minor reference problems. (M.Patton) Fixed some minor reference problems. (M.Patton)
Added Section 2.6 on length considerations in matching. Added Section 2.6 on length considerations in matching.
(R.Presuhn) (R.Presuhn)
Copied various materials from the length considerations section of
the registry draft to keep the two documents in sync.
(A.Phillips)
5. Security Considerations 5. Security Considerations
The only security issue that has been raised with language tags since The only security issue that has been raised with language tags since
the publication of RFC 1766, which stated that "Security issues are the publication of RFC 1766, which stated that "Security issues are
believed to be irrelevant to this memo", is a concern with language believed to be irrelevant to this memo", is a concern with language
ranges used in content negotiation - that they may be used to infer ranges used in content negotiation - that they might be used to infer
the nationality of the sender, and thus identify potential targets the nationality of the sender, and thus identify potential targets
for surveillance. for surveillance.
This is a special case of the general problem that anything you send This is a special case of the general problem that anything you send
is visible to the receiving party. It is useful to be aware that is visible to the receiving party. It is useful to be aware that
such concerns can exist in some cases. such concerns can exist in some cases.
The evaluation of the exact magnitude of the threat, and any possible The evaluation of the exact magnitude of the threat, and any possible
countermeasures, is left to each application protocol. countermeasures, is left to each application protocol.
skipping to change at page 17, line 19 skipping to change at page 17, line 19
tags. These characters are present in most character sets, so tags. These characters are present in most character sets, so
presentation of language tags should not have any character set presentation of language tags should not have any character set
issues. issues.
Rendering of characters based on the content of a language tag is not Rendering of characters based on the content of a language tag is not
addressed in this memo. Historically, some languages have relied on addressed in this memo. Historically, some languages have relied on
the use of specific character sets or other information in order to the use of specific character sets or other information in order to
infer how a specific character should be rendered (notably this infer how a specific character should be rendered (notably this
applies to language and culture specific variations of Han ideographs applies to language and culture specific variations of Han ideographs
as used in Japanese, Chinese, and Korean). When language tags are as used in Japanese, Chinese, and Korean). When language tags are
applied to spans of text, rendering engines may use that information applied to spans of text, rendering engines sometimes use that
in deciding which font to use in the absence of other information, information in deciding which font to use in the absence of other
particularly where languages with distinct writing traditions use the information, particularly where languages with distinct writing
same characters. traditions use the same characters.
7. References 7. References
7.1 Normative References 7.1 Normative References
[1] Phillips, A., Ed. and M. Davis, Ed., "Tags for the [1] Phillips, A., Ed. and M. Davis, Ed., "Tags for the
Identification of Languages (Internet-Draft)", February 2005, < Identification of Languages (Internet-Draft)", June 2005, <http
http://www.ietf.org/internet-drafts/ ://www.ietf.org/internet-drafts/
draft-ietf-ltru-registry-01.txt>. draft-ietf-ltru-registry-03.txt>.
[2] Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO 10021 [2] Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO 10021
and RFC 822", RFC 1327, May 1992. and RFC 822", RFC 1327, May 1992.
[3] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet Mail [3] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet Mail
Extensions) Part One: Mechanisms for Specifying and Describing Extensions) Part One: Mechanisms for Specifying and Describing
the Format of Internet Message Bodies", RFC 1521, the Format of Internet Message Bodies", RFC 1521,
September 1993. September 1993.
[4] Hovey, R. and S. Bradner, "The Organizations Involved in the [4] Hovey, R. and S. Bradner, "The Organizations Involved in the
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/