draft-ietf-ltru-matching-05.txt   draft-ietf-ltru-matching-06.txt 
Network Working Group A. Phillips, Ed. Network Working Group A. Phillips, Ed.
Internet-Draft Quest Software Internet-Draft Quest Software
Obsoletes: 3066 (if approved) M. Davis, Ed. Obsoletes: 3066 (if approved) M. Davis, Ed.
Expires: April 10, 2006 IBM Expires: May 20, 2006 IBM
October 7, 2005 November 16, 2005
Matching Tags for the Identification of Languages Matching Tags for the Identification of Languages
draft-ietf-ltru-matching-05 draft-ietf-ltru-matching-06
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on April 10, 2006. This Internet-Draft will expire on May 20, 2006.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2005). Copyright (C) The Internet Society (2005).
Abstract Abstract
This document describes different mechanisms for comparing, matching, This document describes different mechanisms for comparing, matching,
and evaluating language tags. Possible algorithms for language and evaluating language tags. Possible algorithms for language
negotiation and content selection are described. negotiation and content selection are described. This document, in
combination with RFC 3066bis (replace "3066bis" with the RFC number
assigned to draft-ietf-ltru-registry-14), replaces RFC 3066, which
replaced RFC 1766.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4
2.1. Lists of Language Ranges . . . . . . . . . . . . . . . . . 4 2.1. Lists of Language Ranges . . . . . . . . . . . . . . . . . 4
2.2. Basic Language Range . . . . . . . . . . . . . . . . . . . 4 2.2. Basic Language Range . . . . . . . . . . . . . . . . . . . 4
2.2.1. Matching . . . . . . . . . . . . . . . . . . . . . . . 5 2.3. Extended Language Range . . . . . . . . . . . . . . . . . 5
2.2.2. Lookup . . . . . . . . . . . . . . . . . . . . . . . . 6 3. Types of Matching . . . . . . . . . . . . . . . . . . . . . . 8
2.3. Extended Language Range . . . . . . . . . . . . . . . . . 7 3.1. Choosing a Type of Matching . . . . . . . . . . . . . . . 8
2.3.1. Extended Range Matching . . . . . . . . . . . . . . . 9 3.2. Filtering . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2. Extended Range Lookup . . . . . . . . . . . . . . . . 10 3.2.1. Filtering with Basic Language Ranges . . . . . . . . . 10
2.3.3. Distance Metric Scheme . . . . . . . . . . . . . . . . 11 3.2.2. Filtering with Extended Language Ranges . . . . . . . 10
2.4. Meaning of Language Tags and Ranges . . . . . . . . . . . 13 3.2.3. Distance Metric Filtering . . . . . . . . . . . . . . 11
2.5. Choosing Between Alternate Matching Schemes . . . . . . . 14 3.3. Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6. Considerations for Private Use Subtags . . . . . . . . . . 15 4. Other Considerations . . . . . . . . . . . . . . . . . . . . . 16
2.7. Length Considerations in Matching . . . . . . . . . . . . 16 4.1. Meaning of Language Tags and Ranges . . . . . . . . . . . 16
3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 4.2. Considerations for Private Use Subtags . . . . . . . . . . 17
4. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3. Length Considerations in Matching . . . . . . . . . . . . 17
5. Security Considerations . . . . . . . . . . . . . . . . . . . 20 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20
6. Character Set Considerations . . . . . . . . . . . . . . . . . 21 6. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7. Security Considerations . . . . . . . . . . . . . . . . . . . 22
7.1. Normative References . . . . . . . . . . . . . . . . . . . 22 8. Character Set Considerations . . . . . . . . . . . . . . . . . 23
7.2. Informative References . . . . . . . . . . . . . . . . . . 23 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 24 9.1. Normative References . . . . . . . . . . . . . . . . . . . 24
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 25 9.2. Informative References . . . . . . . . . . . . . . . . . . 24
Intellectual Property and Copyright Statements . . . . . . . . . . 26 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 25
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 26
Intellectual Property and Copyright Statements . . . . . . . . . . 27
1. Introduction 1. Introduction
Human beings on our planet have, past and present, used a number of Human beings on our planet have, past and present, used a number of
languages. There are many reasons why one would want to identify the languages. There are many reasons why one would want to identify the
language used when presenting or requesting information. language used when presenting or requesting information.
Information about a user's language preferences commonly needs to be Information about a user's language preferences commonly needs to be
identified so that appropriate processing can be applied. For identified so that appropriate processing can be applied. For
example, the user's language preferences in a browser can be used to example, the user's language preferences in a browser can be used to
select web pages appropriately. A choice of language preference can select web pages appropriately. Language preferences can also be
also be used to select among tools (such as dictionaries) to assist used to select among tools (such as dictionaries) to assist in the
in the processing or understanding of content in different languages. processing or understanding of content in different languages.
Given a set of language identifiers, such as those defined in [draft- Given a set of language identifiers, such as those defined in
registry], various mechanisms can be envisioned for performing [RFC3066bis], various mechanisms can be envisioned for performing
language negotiation and tag matching. The suitability of a language negotiation and tag matching. Applications, protocols, or
particular mechanism to a particular application depends on the needs specifications will have varying needs and requirements that will
of that application. affect the choice of a suitable mechanism. Protocols and
specifications SHOULD clearly indicate the particular mechanism used
in selecting or matching language tags.
This document defines several mechanisms for matching and filtering This document defines several mechanisms for matching, selecting, or
natural language content identified using Language Tags [draft- filtering content whose natural language is identified using Language
registry]. It also defines the syntax (called a "language range") Tags [RFC3066bis], as well as the syntax (called a "language range")
associated with each of these mechanisms for specifying user language associated with each of these mechanisms for specifying the user's
preferences. language preferences.
This document, in combination with [RFC3066bis] (replace "3066bis"
globally in this document with the RFC number assigned to
draft-ietf-ltru-registry-14), replaces [RFC3066], which replaced
[RFC1766].
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
2. The Language Range 2. The Language Range
Language Tags [draft-registry] are used to identify the language of Language Tags [RFC3066bis] are used to identify the language of some
some information item or content. Applications that use language information item or content. Applications or protocols that use
tags are often faced with the problem of identifying sets of content language tags are often faced with the problem of identifying sets of
that share certain language attributes. For example, HTTP 1.1 content that share certain language attributes. For example, HTTP
[RFC2616] describes language ranges in its discussion of the Accept- 1.1 [RFC2616] describes language ranges in its discussion of the
Language header (Section 14.4), which is used for selecting content Accept-Language header (Section 14.4), which is used for selecting
from servers based on the language of that content. content from servers based on the language of that content.
When selecting content according to its language, it is useful to When selecting content according to its language, it is useful to
have a mechanism for identifying sets of language tags that share have a mechanism for identifying sets of language tags that share
specific attributes. This allows users to select or filter content specific attributes. This allows users to select or filter content
based on specific requirements. Such an identifier is called a based on specific requirements. Such an identifier is called a
"Language Range". "Language Range".
2.1. Lists of Language Ranges 2.1. Lists of Language Ranges
When users specify a language preference they often need to specify a When users specify a language preference they often need to specify a
skipping to change at page 4, line 44 skipping to change at page 4, line 44
"Accept-Language" header defined in RFC 2616 [RFC2616] (see Section "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section
14.4) and RFC 3282 [RFC3282]. The various matching operations 14.4) and RFC 3282 [RFC3282]. The various matching operations
described in this document include considerations for using a described in this document include considerations for using a
language priority list. language priority list.
2.2. Basic Language Range 2.2. Basic Language Range
A "Basic Language Range" identifies the set of content whose language A "Basic Language Range" identifies the set of content whose language
tags begin with the same sequence of subtags. A basic language range tags begin with the same sequence of subtags. A basic language range
is identified by its 'language-range' tag, by adapting the is identified by its 'language-range' tag, by adapting the
ABNF[RFC2234bis] from HTTP/1.1 [RFC2616] : ABNF[RFC4234] from HTTP/1.1 [RFC2616] :
language-range = language-tag / "*" language-range = language-tag / "*"
language-tag = 1*8[alphanum] *["-" 1*8alphanum] language-tag = 1*8[alphanum] *["-" 1*8alphanum]
alphanum = ALPHA / DIGIT alphanum = ALPHA / DIGIT
That is, a language-range has the same syntax as a language-tag or is That is, a language-range has the same syntax as a language-tag or is
the single character "*". Basic Language Ranges imply that there is the single character "*". Basic Language Ranges imply that there is
a semantic relationship between language tags that share the same a semantic relationship between language tags that share the same
prefix. While this is often the case, it is not always true and prefix. While this is often the case, it is not always true and
users should note that the set of language tags that match a specific users should note that the set of language tags that match a specific
language-range may not be mutually intelligible. language-range may not be mutually intelligible.
Basic language ranges were originally described in [RFC3066] and HTTP Basic language ranges were originally described in [RFC3066] and HTTP
1.1 [RFC2616] (where they are referred to as simply a "language 1.1 [RFC2616] (where they are referred to as simply a "language
range"). range").
Users SHOULD avoid subtags that add no distinguishing value to a Users SHOULD avoid subtags that add no distinguishing value to a
language range. For example, script subtags SHOULD NOT be used to language range. For example, script subtags SHOULD NOT be used to
form a language range with language subtags which have a matching form a language range with language subtags which have a matching
Suppress-Script field in their registry record. Thus the language Suppress-Script field in their registry record. Thus the language
range "en-Latn" is probably inappropriate for most applications range "en-Latn" is probably inappropriate in most cases (because the
(because the vast majority English documents are written in the Latin vast majority English documents are written in the Latin script and
script and thus the 'en' language subtag has a Suppress-Script field thus the 'en' language subtag has a Suppress-Script field for 'Latn'
for 'Latn' in the registry). in the registry).
Language tags and thus language ranges are to be treated as case Language tags and thus language ranges are to be treated as case
insensitive: there exist conventions for the capitalization of some insensitive: there exist conventions for the capitalization of some
of the subtags, but these MUST NOT be taken to carry meaning. of the subtags, but these MUST NOT be taken to carry meaning.
Matching of language tags to language ranges MUST be done in a case Matching of language tags to language ranges MUST be done in a case
insensitive manner. insensitive manner.
When working with tags and ranges, note that extensions and most When working with tags and ranges, note that extensions and most
private use subtags are generally orthogonal to language tag fallback private use subtags are generally orthogonal to language tag fallback
and users SHOULD avoid using these subtags in language ranges, since and users SHOULD avoid using these subtags in language ranges, since
they will often interfere with the selection of available language they will often interfere with the selection of available language
content. Since these subtags are always at the end of the sequence content. Since these subtags are always at the end of the sequence
of subtags, they don't normally interfere with the use of prefixes of subtags, they don't normally interfere with the use of prefixes
for matching in the schemes described below. for matching in the schemes described below.
There are two matching schemes that are commonly associated with Note that when working with basic language ranges, no attempt is made
basic language ranges: matching and lookup. to process the semantics of the tags or ranges in any way. The
language tag and language range are compared in a case insensitive
Note that neither matching nor lookup using basic language ranges manner using basic string processing. Thus the choice of subtags in
attempt to process the semantics of the tags or ranges in any way. both the language tag and language range may affect the results
The language tag and language range are compared in a case produced as a result.
insensitive manner using basic string processing. The choice of
subtags in both the language tag and language range may affect the
results produced as a result.
2.2.1. Matching
Language tag matching is used to select all content that matches a
given prefix. In matching, the language range represents the least
specific tag which is an acceptable match and every piece of content
that matches is returned. If the language priority list contains
more than one range, the matches returned are typically ordered in
descending level of preference.
For example, if an application is applying a style to all content in
a document in a particular language, it might use language tag
matching to select the content to which the style is applied.
A language-range matches a language-tag if it exactly equals the tag,
or if it exactly equals a prefix of the tag such that the first
character following the prefix is "-". (That is, the language-range
"de-de" matches the language tag "de-DE-1996", but not the language
tag "de-Deva".)
The special range "*" matches any tag. A protocol which uses
language ranges MAY specify additional rules about the semantics of
"*"; for instance, HTTP/1.1 specifies that the range "*" matches only
languages not matched by any other range within an "Accept-Language"
header.
2.2.2. Lookup
Content lookup is used to select the single information item that
best matches the language priority list for a given request. In
lookup, each language range in the language priority list represents
the most specific tag which is an acceptable match; only the closest
matching item according the user's priority is returned.
For example, if an application inserts some dynamic content into a
document, returning an empty string if there is no exact match is not
an option. Instead, the application "falls back" until it finds a
suitable piece of content to insert.
When performing lookup, the language range is progressively truncated
from the end until a matching piece of content is located. For
example, starting with the range "zh-Hant-CN-x-wadegile", the lookup
would progressively search for content as shown below:
Range to match: zh-Hant-CN-x-wadegile
1. zh-Hant-CN-x-wadegile
2. zh-Hant-CN
3. zh-Hant
4. zh
5. (default content or the empty tag)
Figure 2: Default Fallback Pattern Example
This scheme allows some flexibility in finding content. It also
typically provides better results when data is not available at a
specific level of tag granularity or is sparsely populated (than if
the default language for the system or content were used).
When performing lookup using a language priority list, the 2.3. Extended Language Range
progressive search MUST proceed to consider each language range
before finding the default content or empty tag. For example, for
the list "fr-FR; zh-Hant" would search for content as follows:
1. fr-FR
2. fr
3. zh-Hant // next language
4. zh
5. (default content or the empty tag)
Figure 3: Lookup Using a Language Priority List A Basic Language Range does not always provide the most appropriate
way to specify a user's preferences. Sometimes it is beneficial to
define a more granular matching scheme that takes advantage of the
internal structure of language tags, by allowing the user to specify,
for example, the value of a specific field in a language tag or to
indicate which values are of interest in filtering or selecting the
content.
2.3. Extended Language Range In an extended language range, the identifier takes the form of a
series of subtags which must consist of well-formed subtags or the
special subtag "*". For example, the language range "en-*-US"
specifies a primary language of 'en', followed by any script subtag,
followed by the region subtag 'US'.
Prefix matching using a Basic Language Range, as described above, is An extended language range can be represented by the following ABNF:
not always the most appropriate way to access the information
contained in language tags when selecting or filtering content. Some
applications might wish to define a more granular matching scheme and
such a matching scheme requires the ability to specify the various
attributes of a language tag in the language range. An extended
language range can be represented by the following ABNF:
extended-language-range = range ; a range extended-language-range = range ; a range
/ privateuse ; private use tag / privateuse ; private use tag
/ grandfathered ; grandfathered registrations / grandfathered ; grandfathered registrations
range = (language range = (language
["-" script] ["-" script]
["-" region] ["-" region]
*("-" variant) *("-" variant)
*("-" extension) *("-" extension)
["-" privateuse]) ["-" privateuse])
skipping to change at page 8, line 19 skipping to change at page 7, line 4
/ "*" ; ... or wildcard / "*" ; ... or wildcard
extension = singleton *("-" (2*8alphanum)) [ "-*" ] extension = singleton *("-" (2*8alphanum)) [ "-*" ]
; extension subtags ; extension subtags
; wildcard can only appear ; wildcard can only appear
; at the end ; at the end
singleton = %x41-57 / %x59-5A / %x61-77 / %x79-7A / DIGIT singleton = %x41-57 / %x59-5A / %x61-77 / %x79-7A / DIGIT
; "a"-"w" / "y"-"z" / "A"-"W" / "Y"-"Z" / "0"-"9" ; "a"-"w" / "y"-"z" / "A"-"W" / "Y"-"Z" / "0"-"9"
; Single letters: x/X is reserved for private use ; Single letters: x/X is reserved for private use
privateuse = ("x"/"X") 1*("-" (1*8alphanum)) privateuse = ("x"/"X") 1*("-" (1*8alphanum))
grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum)) grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum))
; grandfathered registration ; grandfathered registration
; Note: i is the only singleton ; Note: i is the only singleton
; that starts a grandfathered tag ; that starts a grandfathered tag
alphanum = (ALPHA / DIGIT) ; letters and numbers alphanum = (ALPHA / DIGIT) ; letters and numbers
In an extended language range, the identifier takes the form of a
series of subtags which must consist of well-formed subtags or the
special subtag "*". For example, the language range "en-*-US"
specifies a primary language of 'en', followed by any script subtag,
followed by the region subtag 'US'.
A field not present in the middle of an extended language range MAY A field not present in the middle of an extended language range MAY
be treated as if the field contained a "*". For example, the range be treated as if the field contained a "*". For example, the range
"en-US" MAY be considered to be equivalent to the range "en-*-US". "en-US" MAY be considered to be equivalent to the range "en-*-US".
This also means that multiple wildcards can be collapsed (so that This also means that multiple wildcards can be collapsed (so that
"en-*-*-US" is equivalent to "en-*-US"). "en-*-*-US" is equivalent to "en-*-US").
When working with tags and ranges users SHOULD note the following: When working with tags and ranges users SHOULD note the following:
1. Private-use and Extension subtags are normally orthogonal to 1. Private-use and Extension subtags are normally orthogonal to
language tag fallback. Implementations SHOULD ignore language tag fallback. Implementations or specifications that
use a lookup (Section 3.3) matching scheme SHOULD ignore
unrecognized private-use and extension subtags when performing unrecognized private-use and extension subtags when performing
language tag fallback. Since these subtags are always at the end language tag fallback. Since these subtags are always at the end
of the sequence of subtags, they don't normally interfere with of the sequence of subtags, they don't normally interfere with
the use of prefixes for matching in the schemes described below. the use of prefixes for matching in the schemes described below.
2. Implementations that choose not to interpret one or more private- 2. Applications, specifications, or protocols that choose not to
use or extension subtags SHOULD NOT remove or modify these interpret one or more private-use or extension subtags SHOULD NOT
extensions in content that they are processing. When a language remove or modify these extensions in content that they are
tag instance is to be used in a specific, known protocol, and is processing. When a language tag instance is to be used in a
not being passed through to other protocols, language tags MAY be specific, known protocol, and is not being passed through to
filtered to remove subtags and extensions that are not supported other protocols, language tags MAY be filtered to remove subtags
by that protocol. Such filtering SHOULD be avoided, if possible, and extensions that are not supported by that protocol. Such
since it removes information that might be relevant if services filtering SHOULD be avoided, if possible, since it removes
on the other end of the protocol would make use of that information that might be relevant if services on the other end
information. of the protocol would make use of that information.
3. Some applications of language tags might want or need to consider 3. Some applications of language tags might want or need to consider
extensions and private-use subtags when matching tags. If extensions and private-use subtags when matching tags. If
extensions and private-use subtags are included in a matching or extensions and private-use subtags are included in a matching or
filtering process that utilizes the one of the schemes described filtering process that utilizes the one of the schemes described
in this document, then the implementation SHOULD canonicalize the in this document, then the implementation SHOULD canonicalize the
language tags and/or ranges before performing the matching. Note language tags and/or ranges before performing the matching. Note
that language tag processors that claim to be "well-formed" that language tag processors that claim to be "well-formed"
processors as defined in [draft-registry] generally fall into processors as defined in [RFC3066bis] generally fall into this
this category. category.
There are several matching algorithms or schemes which can be applied There are several matching algorithms or schemes which can be applied
when matching extended language ranges to language tags. when matching extended language ranges to language tags.
2.3.1. Extended Range Matching 3. Types of Matching
In extended range matching, each extended language range in the Matching language ranges to language tags can be done in a number of
language priority list is considered in turn, according to priority. different ways. This section describes the different types of
The subtags in each extended language range are compared to the matching scheme, as well as the considerations for choosing between
corresponding subtags in the language tag being examined. The subtag them.
from the range is considered to match if it exactly matches the
corresponding subtag in the tag or the range's subtag has the value
"*" (which matches all subtags, including the empty subtag).
Extended Range Matching is an extension of basic matching
(Section 2.2.1): the language range represents the least specific tag
which is an acceptable match.
Private use subtags MAY be specified in the language range and MUST There are two basic types of matching scheme: those that produce an
NOT be ignored when matching. open-ended set of content (called "filtering") and those that produce
a single information item for a given request (called "lookup").
Subtags not specified, including those at the end of the language A key difference between these two types of matching scheme is that
range, are assigned the value "*". This makes each range into a the language range for filtering operations is always the _least_
prefix much like that used in basic language range matching. For specific tag one will accept as a match, while for lookup operations
example, the extended language range "zh-*-CN" matches all of the the language range is always the _most_ specific tag.
following tags because the unspecified variant field is expanded to
"*":
zh-Hant-CN 3.1. Choosing a Type of Matching
zh-CN
zh-Hans-CN Applications, protocols, and specifications are faced with the
decision of what type of matching to use. Sometimes, different
styles of matching might be suited for different kinds of processing
within a particular application or protocol.
zh-CN-x-wadegile Filtering can be used to produce a set of results (such as a
collection of documents). For example, if using a search engine, one
might use filtering to limit the results to documents written in
French. It can also be used when deciding whether to perform some
processing that is language sensitive on some content. For example,
a process might cause paragraphs whose language tag matched the
language range "nl" to be displayed in italics within a document.
zh-Latn-CN-boont This document describes three types of filtering:
zh-cmn-Hans-CN-x-wadegile 1. Basic Filtering (Section 3.2.1) is used to match content using
basic language rangesSection 2.2. It is compatible with
implementations that do not produce extended language ranges.
2.3.2. Extended Range Lookup 2. Extended Range Filtering (Section 3.2.2) is used to match content
using extended language rangesSection 2.3. Newer implementations
SHOULD use this form of filtering in preference to basic
filtering.
In extended range lookup, each extended language range in the 3. Scored Filtering (Section 3.2.3) produces an ordered set of
language priority list is considered in turn. The subtags in each content using either basic or extended language ranges. It
extended language range are compared to the corresponding subtags in should be used when the quality of the match within a specific
the language tag being examined. A subtag is considered to match if language range is important, as when presenting a list of
it exactly matches the corresponding subtag in the tag or the range's documents resulting from a search.
subtag has the value "*" (which matches all subtags, including the
empty subtag). Extended language range lookup is an extension of
basic lookup (Section 2.2.2): each language range represents the most
specific tag which will form an acceptable match. If no match is
found, the default content or content with the empty language tag is
usually returned (or the search can be considered to have failed).
Subtags not specified are assigned the value "*" prior to performing Lookup (Section 3.3) is used when each request MUST produce exactly
tag matching. Unlike in extended range matching, however, fields at one piece of content. For example, a Web server might use the
the end of the range MUST NOT be expanded in this manner. For Accept-Language HTTP header to choose which language to return a
example, "en-US" MUST NOT be considered to be the same as the range custom 404 page in: since it can return only one page, it must choose
"en-US-*". This allows ranges to be specific. The "*" wildcard MUST a single item and it must return some item, even if no content
be used at the end of the range to indicate that all tags with the matches the language ranges supplied by the user.
range as a prefix are allowable matches. That is, the range "zh-*"
matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh"
matches neither of those tags.
The wildcard "*" at the end of a range SHOULD be considered to match Most types of matching in this document are designed so that
any private use subtag sequences (making extended language range implementations do not have to examine the values of the subtags
lookup function exactly like extended range matching Section 2.3.1). supplied and, except for scored filtering, they do not need access to
the Language Subtag Registry nor do they require the use of valid
subtags in either language tags or language ranges. This has great
benefit for speed and simplicity of implementation.
By default all extensions and their subtags SHOULD be ignored for Implementations might also wish to use semantic information external
extended language range lookup. Private use subtags MAY be specified to the langauge tags when performing fallback. For example, the
in the language range and MUST NOT be ignored when performing lookup. primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal
The wildcard "*" at the end of a range SHOULD be considered to match Norwegian) might both be usefully matched to the more general subtag
any private use subtag sequences in addition to variants. 'no' (Norwegian). Or an implementation might infer that content
labeled "zh-CN" is morely likely to match the range "zh-Hans" than
equivalent content labeled "zh-TW".
For example, the range "*-US" matches all of the following tags: 3.2. Filtering
en-US Filtering is used to select the set of content that matches a given
en-Latn-US prefix. It is called "filtering" because this set of content may
contain no items at all or it may return an arbitrary number of
matching items--as many as match the language range used to specify
the items, thus filtering out the non-matching content.
en-US-r-extends (extensions are ignored) In filtering, the language range represents the _least_ specific tag
which is an acceptable match. That is, all of the language tags in
the set of filtered content will have an equal or greater number of
subtags than the language range. For example, if the language range
is "de-CH", one might see matching content with the tag "de-CH-1996"
but one will never see a match with the tag "de".
fr-US If the language priority list (see Section 2.1) contains more than
one range, the content returned is typically ordered in descending
level of preference.
For example, the range "en-*-US" matches _none_ of the following Some examples where filtering might be appropriate include:
tags:
fr-US o Applying a style to sections of a document in a particular
language range.
en (missing region US) o Displaying the set of documents containing a particular set of
keywords written in a specific language.
en-Latn (missing region US) o Selecting all email items written in specific range of languages.
en-Latn-US-scouse (variant field is present) Filtering can produce either ordered or unordered set of results.
For example, applying formatting to a document based on the language
of specific pieces of content does not require the content to be
ordered. It is sufficient to know whether a specific piece of
content matches or does not match. A search application, on the
other hand, probably would put the results into a priority order.
For example, the range "en-*" matches all of the following tags: If an ordered set is desired, as described above, then the
application or protocol needs to determine the relative "quality" of
the match between different language tags and the language range.
en-Latn This measurment is called a "distance metric". A distance metric
assigns a numeric value to the comparison of each language tag to a
language range and represents the 'distance' between the two. A
distance of zero means that they are identical, a small distance
indicates that they are very similar, and a large distance indicated
that they are very different. Using a distance metric,
implementations can, for example, allow users to select a threshold
distance for a match to be "successful" while filtering or it can use
the numeric value to order the results.
en-Latn-US 3.2.1. Filtering with Basic Language Ranges
en-Latn-US-scouse When filtering using a basic language range, the language range
matches a language tag if it exactly equals the tag, or if it exactly
equals a prefix of the tag such that the first character following
the prefix is "-". (That is, the language-range "de-de" matches the
language tag "de-DE-1996", but not the language tag "de-Deva".)
en-US The special range "*" matches any tag. A protocol which uses
language ranges MAY specify additional rules about the semantics of
"*"; for instance, HTTP/1.1 specifies that the range "*" matches only
languages not matched by any other range within an "Accept-Language"
header.
en-scouse 3.2.2. Filtering with Extended Language Ranges
Note that the ability to be specific in extended range lookup can In the Extended Range Matching scheme, each extended language range
make this matching scheme a more appropriate replacement for basic in the language priority list is considered in turn, according to
matching than the extended range matching scheme. priority. The subtags in each extended language range are compared
to the corresponding subtags in the language tag being examined. The
subtag from the range is considered to match if it exactly matches
the corresponding subtag in the tag or the range's subtag has the
value "*" (which matches all subtags, including the empty subtag).
Extended Range Matching is an extension of basic matching
(Section 3.2.1): the language range represents the least specific tag
which is an acceptable match.
2.3.3. Distance Metric Scheme Private use subtags MAY be specified in the language range and MUST
NOT be ignored when matching.
Both Basic and Extended Language Ranges produce simple boolean Subtags not specified, including those at the end of the language
matches. Some applications may benefit by providing an array of range, are assigned the value "*". This makes each range into a
results with different levels of matching, for example, sorting prefix much like that used in basic language range matching. For
results based on the overall "quality" of the match. example, the extended language range "zh-*-CN" matches all of the
following tags because the unspecified variant field is expanded to
"*":
This type of matching is sometimes called a "distance metric". A zh-Hant-CN
distance metric assigns a pair of language tags a numeric value
representing the 'distance' between the two. A distance of zero
means that they are identical, a small distance indicates that they
are very similar, and a large distance indicated that they are very
different. Using a distance metric, implementations can, for
example, allow users to select a threshold distance for a match to be
successful or a filter to be applied.
The first step in the process is to normalize the extended language zh-CN
range and the language tags to be matched to it by canonicalizing
them, mapping grandfathered and obsolete tags into modern zh-Hans-CN
equivalents.
zh-CN-x-wadegile
zh-Latn-CN-boont
zh-cmn-Hans-CN-x-private
3.2.3. Distance Metric Filtering
Both basic and extended language range filtering produce simple
boolean matches. Sometimes it may be beneficial to provide an array
of results with different levels of matching, for example, sorting
results based on the overall "quality" of the match. Distance metric
filtering provides a way to generate these quality values.
First both the extended language range and the language tags to be
matched to it must be canonicalized by mapping grandfathered and
obsolete tags into modern equivalents.
The language range and the language tags are then transformed into The language range and the language tags are then transformed into
quintuples of elements of the form (language, script, country, quintuples of elements of the form (language, script, country,
variant, extension). Any extended language subtags are considered variant, extension). Any extended language subtags are considered
part of the language element; private use subtag sequences are part of the language element; private use subtag sequences are
considered part of the language element if in the initial position in considered part of the language element if in the initial position in
the tag and part of the variant element if not. Language subtags the tag and part of the variant element if not. Language subtags
'und', 'mul', and the script subtag 'Zyyy' are converted to "*". 'und', 'mul', and the script subtag 'Zyyy' are converted to "*".
Missing components in the language-tag are set to "*"; thus a "*" Missing components in the language-tag are set to "*"; thus a "*"
skipping to change at page 12, line 43 skipping to change at page 12, line 26
x-foo ("x-foo","*","*","*","*") x-foo ("x-foo","*","*","*","*")
en-x-foo ("en","*","*","x-foo","*") en-x-foo ("en","*","*","x-foo","*")
i-default ("i-default","*","*","*","*") i-default ("i-default","*","*","*","*")
sl-Latn-IT-roazj ("sl","Latn","IT","rozaj","*") sl-Latn-IT-roazj ("sl","Latn","IT","rozaj","*")
zh-r-wadegile ("zh","*","*","*","r-wadegile") // hypothetical zh-r-wadegile ("zh","*","*","*","r-wadegile") // hypothetical
Each language-range/language-tag pair being matched or filtered is Each language-range/language-tag pair being compared is assigned a
assigned a distance value, whereby small values indicate better distance value, whereby small values indicate better matches and
matches and large values indicate worse ones. The distance between large values indicate worse ones. The distance between the pair is
the pair is the sum of the distances for each of the corresponding the sum of the distances for each of the corresponding elements of
elements of the quintuple. If the elements are identical or one is the quintuple. If the elements are identical or one is '*', then the
'*', then the distance value between them is zero. Otherwise, it is distance value between them is zero. Otherwise, it is given by the
given by the following table: following table:
256 language mismatch 256 language mismatch
128 script mismatch 128 script mismatch
32 region mismatch 32 region mismatch
4 variant mismatch 4 variant mismatch
1 extension mismatch 1 extension mismatch
A value of 0 is a perfect match; 421 is no match at all. Different A value of 0 is a perfect match; 421 is no match at all. Different
threshold values might be appropriate for different applications and threshold values might be appropriate for different applications or
implementations will probably allow users to choose the most protocols. Implementations will usually allow users to choose the
appropriate selection value, ranking the selections based on score. most appropriate selection value, ranking the matched items based on
score.
Examples of various tag's distances from the range "en-US": Examples of various tag's distances from the range "en-US":
"fr" 256 (language mismatch, region match) "fr" 256 (language mismatch, region match)
"en-GB" 384 (language, region mismatch) "en-GB" 384 (language, region mismatch)
"en-Latn-US" 0 (all fields match) "en-Latn-US" 0 (all fields match)
"en-Brai" 32 (region mismatch) "en-Brai" 32 (region mismatch)
"en-US-x-foo" 4 (variant mismatch: range is the empty string) "en-US-x-foo" 4 (variant mismatch: range is the empty string)
"en-US-r-wadegile" 1 (extension mismatch: range is the empty string) "en-US-r-wadegile" 1 (extension mismatch: range is the empty string)
Implementations or protocols sometimes might wish to use more
sophisticated weights that depend on the values of the corresponding
elements. For example, depending on the domain, an implemenation
might give a small distance to the difference between the language
subtag 'no' and the closely related language subtags 'nb' or 'nn'; or
between the script subtags 'Kata' and 'Hira'; or between the region
subtags 'US' and 'UM'.
Implementations may want to use more sophisticated weights that 3.3. Lookup
depend on the values of the corresponding elements. For example,
depending on the domain, an implemenation might give a small distance
to the difference between the language subtag 'no' and the closely
related language subtags 'nb' or 'nn'; or between the script subtags
'Kata' and 'Hira'; or between the region subtags 'US' and 'UM'.
2.4. Meaning of Language Tags and Ranges Lookup is used to select the single information item that best
matches the language priority list for a given request. In lookup,
each language range in the language priority list represents the
_most_ specific tag which is an acceptable match; only the closest
matching item according the user's priority is returned. For
example, if the language range is "de-CH", one might expect to
receive an information item with the tag "de" but never one with the
tag "de-CH-1996". Usually if no content matches the request, a
"default" item is returned.
A language tag defines a language as spoken (or written, signed or For example, if an application inserts some dynamic content into a
otherwise signaled) by human beings for communication of information document, returning an empty string if there is no exact match is not
to other human beings. an option. Instead, the application "falls back" until it finds a
suitable piece of content to insert. Other examples of lookup might
include:
o Selection of a template containing the text for an automated email
response.
o Selection of a graphic containing text for inclusion in a
particular Web page.
o Selection of a string of text for inclusion in an error log.
In the Lookup scheme, the language range is progressively truncated
from the end until a matching piece of content is located. For
example, starting with the range "zh-Hant-CN-x-private", the lookup
would progressively search for content as shown below:
Range to match: zh-Hant-CN-x-private
1. zh-Hant-CN-x-private
2. zh-Hant-CN
3. zh-Hant
4. zh
5. (default content or the empty tag)
Figure 5: Example of a Lookup Fallback Pattern
This scheme allows some flexibility in finding content. It also
typically provides better results when data is not available at a
specific level of tag granularity or is sparsely populated (than if
the default language for the system or content were used).
The language range "*" matches any language tag. In the lookup
scheme, this language range does not convey enough information to
determine which content is most appropriate. If this language range
is the only one in the language priority list, it matches the default
content. If this language range is followed by other language
ranges, it should be skipped.
When performing lookup using a language priority list, the
progressive search MUST proceed to consider each language range
before finding the default content or empty tag. The default content
might be content with no language tag (or with an empty value, as
with xml:lang in the XML specification), or it might be a particular
language designated for that bit of content.
One common way to provide for default content is to allow a specific
language range to be set as the default for a specific type of
request. This language range is then treated as if it were appended
to the end of the language priority list, rather than after each item
in the language priority list.
For example, if a particular user's language priority list were
"fr-FR; zh-Hant" and the program doing the matching had a default
language range of "ja-JP", the program would search for content as
follows:
1. fr-FR
2. fr
3. zh-Hant // next language
4. zh
5. (return default content)
a. ja-JP
b. ja
c. (empty tag or other default content)
Figure 6: Lookup Using a Language Priority List
In some cases, the language priority list might contain one or more
extended language ranges (as, for example, when the same language
priority list is used as input for both lookup and filtering
operations). Wildcard values in an extended language range are
supposed to match any value that occurs in that position in a
language tag. Since only one item can be returned for any given
lookup request, the wildcards must be processed in a predictable
manner (or the same request might produce widely varying results).
Thus, for each range in the language priority list, the following
rules must be applied to produce a basic language range for use in
the fallback mechanism:
1. If the first subtag in the extended language range is a "*" then
entire range is converted to "*".
2. For each subsequent subtag, if the value is a "*" then that
subtag and its preceeding hyphen are removed.
For example:
*-US becomes *
en-*-US becomes en-US
en-Latn-* becomes en-Latn
Figure 7: Transformation of Extended Language Ranges
For the language priority list "*-US; fr-*-FR; zh-Hant", the fallback
pattern would be:
1. * (skipped)
2. fr-FR
3. fr
4. zh-Hant
5. zh
6. (default content)
Figure 8: Extended Language Range Fallback Example
4. Other Considerations
When working with language ranges and matching schemes, there are
some additional points that may influence the choice of either.
4.1. Meaning of Language Tags and Ranges
Selecting content using language ranges requires some understanding
by users of what they are selecting. A language tag or range
identifies a language as spoken (or written, signed or otherwise
signaled) by human beings for communication of information to other
human beings.
If a language tag B contains language tag A as a prefix, then B is If a language tag B contains language tag A as a prefix, then B is
typically "narrower" or "more specific" than A. For example, "zh- typically "narrower" or "more specific" than A. For example, "zh-
Hant-TW" is more specific than "zh-Hant". Hant-TW" is more specific than "zh-Hant".
This relationship is not guaranteed in all cases: specifically, This relationship is not guaranteed in all cases: specifically,
languages that begin with the same sequence of subtags are NOT languages that begin with the same sequence of subtags are NOT
guaranteed to be mutually intelligible, although they might be. guaranteed to be mutually intelligible, although they might be.
For example, the tag "az" shares a prefix with both "az-Latn" For example, the tag "az" shares a prefix with both "az-Latn"
(Azerbaijani written using the Latin script) and "az-Cyrl" (Azerbaijani written using the Latin script) and "az-Arab"
(Azerbaijani written using the Cyrillic script). A person fluent in (Azerbaijani written using the Arabic script). A person fluent in
one script might not be able to read the other, even though the text one script might not be able to read the other, even though the text
might be otherwise identical. Content tagged as "az" most probably might be otherwise identical. Content tagged as "az" most probably
is written in just one script and thus might not be intelligible to a is written in just one script and thus might not be intelligible to a
reader familiar with the other script. reader familiar with the other script.
Variant subtags in particular seem to represent specific divisions in Variant subtags in particular seem to represent specific divisions in
mutual understanding, since they often encode dialects or other mutual understanding, since they often encode dialects or other
idiosyncratic variations within a language. idiosyncratic variations within a language.
The relationship between the language tag and the information it The relationship between the language tag and the information it
relates to is defined by the standard describing the context in which relates to is defined by the standard describing the context in which
it appears. Accordingly, this section can only give possible it appears. Accordingly, this section can only give possible
examples of its usage. examples of its usage:
o For a single information object, the associated language tags o For a single information object, the associated language tags
might be interpreted as the set of languages that are necessary might be interpreted as the set of languages that are necessary
for a complete comprehension of the complete object. Example: for a complete comprehension of the complete object. Example:
Plain text documents. Plain text documents.
o For an aggregation of information objects, the associated language o For an aggregation of information objects, the associated language
tags could be taken as the set of languages used inside components tags could be taken as the set of languages used inside components
of that aggregation. Examples: Document stores and libraries. of that aggregation. Examples: Document stores and libraries.
skipping to change at page 14, line 44 skipping to change at page 17, line 26
structure (including the whole document itself). For example, one structure (including the whole document itself). For example, one
could write <span lang="FR">C'est la vie.</span> inside a could write <span lang="FR">C'est la vie.</span> inside a
Norwegian document; the Norwegian-speaking user could then access Norwegian document; the Norwegian-speaking user could then access
a French-Norwegian dictionary to find out what the marked section a French-Norwegian dictionary to find out what the marked section
meant. If the user were listening to that document through a meant. If the user were listening to that document through a
speech synthesis interface, this formation could be used to signal speech synthesis interface, this formation could be used to signal
the synthesizer to appropriately apply French text-to-speech the synthesizer to appropriately apply French text-to-speech
pronunciation rules to that span of text, instead of misapplying pronunciation rules to that span of text, instead of misapplying
the Norwegian rules. the Norwegian rules.
2.5. Choosing Between Alternate Matching Schemes 4.2. Considerations for Private Use Subtags
Implementers are faced with the decision of what form of matching to
use in a specific application. An application can choose to
implement different styles of matching for different kinds of
processing.
The most basic choice is between schemes that produce an open-ended
set of content (a "matching" application) and those that usually
produce a single information item (a "lookup" application). Note
that lookup applications can produce multiple items, but usually only
a single item for any given piece of content, and they can be used to
order content (the later in the overall fallback that the content
appears to match, the more distant the match).
Matching applications can produce an ordered or unordered set of
results. For example, applying formatting to a document based on the
language of specific pieces of content does not require the content
to be ordered. It is sufficient to know whether a specific piece of
content matches or does not match. A search application, on the
other hand, probably would put the results into a priority order.
If single item is to be chosen, it may sometimes be useful to apply
additional information, such as the most likely script used in the
language or region in question or the script used by other content
selected, in order to make a more "informed" choice.
The matching schemes in this document are designed so that
implementations do not have to examine the values of the subtags
supplied and, except for scored matching, they do not need access to
the Language Subtag Registry nor do they require the use of valid
subtags in language tags or ranges. This has great benefit for speed
and simplicity of implementation.
Implementations might also wish to use semantic information external
to the langauge tags when performing fallback. For example, the
primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal
Norwegian) might both be usefully matched to the more general subtag
'no' (Norwegian). Or an application might infer that content labeled
"zh-CN" is morely likely to match the range "zh-Hans" than equivalent
content labeled "zh-TW".
2.6. Considerations for Private Use Subtags
Private-use subtags require private agreement between the parties Private-use subtags require private agreement between the parties
that intend to use or exchange language tags that use them and great that intend to use or exchange language tags that use them and great
caution SHOULD be used in employing them in content or protocols caution SHOULD be used in employing them in content or protocols
intended for general use. Private-use subtags are simply useless for intended for general use. Private-use subtags are simply useless for
information exchange without prior arrangement. information exchange without prior arrangement.
The value and semantic meaning of private-use tags and of the subtags The value and semantic meaning of private-use tags and of the subtags
used within such a language tag are not defined. Matching private used within such a language tag are not defined. Matching private
use tags using language ranges or extended language ranges can result use tags using language ranges or extended language ranges can result
in unpredictable content being returned. in unpredictable content being returned.
2.7. Length Considerations in Matching 4.3. Length Considerations in Matching
RFC 3066 [RFC3066] did not provide an upper limit on the size of RFC 3066 [RFC3066] did not provide an upper limit on the size of
language tags or ranges. RFC 3066 did define the semantics of language tags or ranges. RFC 3066 did define the semantics of
particular subtags in such a way that most language tags or ranges particular subtags in such a way that most language tags or ranges
consisted of language and region subtags with a combined total length consisted of language and region subtags with a combined total length
of up to six characters. Larger tags and ranges (in terms of both of up to six characters. Larger tags and ranges (in terms of both
subtags and characters) did exist, however. subtags and characters) did exist, however.
[draft-registry] also does not impose a fixed upper limit on the [RFC3066bis] also does not impose a fixed upper limit on the number
number of subtags in a language tag or range (and thus an upper bound of subtags in a language tag or range (and thus an upper bound on the
on the size of either). The syntax in that document suggests that, size of either). The syntax in that document suggests that,
depending on the specific language or range of languages, more depending on the specific language or range of languages, more
subtags (and thus characters) are sometimes necessary as a result. subtags (and thus characters) are sometimes necessary as a result.
Length considerations and their impact on the selection and Length considerations and their impact on the selection and
processing of tags are described in Section 2.1.1 of that document. processing of tags are described in Section 2.1.1 of that document.
A matching implementation MAY choose to limit the length of the An application or protocol MAY choose to limit the length of the
language tags or ranges used in matching. Any such limitation SHOULD language tags or ranges used in matching. Any such limitation SHOULD
be clearly documented, and such documentation SHOULD include the be clearly documented, and such documentation SHOULD include the
disposition of any longer tags or ranges (for example, whether an disposition of any longer tags or ranges (for example, whether an
error value is generated or the language tag or range is truncated). error value is generated or the language tag or range is truncated).
If truncation is permitted it MUST NOT permit a subtag to be divided, If truncation is permitted it MUST NOT permit a subtag to be divided,
since this changes the semantics of the subtag being matched and can since this changes the semantics of the subtag being matched and can
result in false positives or negatives. result in false positives or negatives.
Implementations that restrict storage SHOULD consider the impact of Applications or protocols that restrict storage SHOULD consider the
tag or range truncation on the resulting matches. For example, impact of tag or range truncation on the resulting matches. For
removing the "*" from the end of an extended language range (see example, removing the "*" from the end of an extended language range
Section 2.3) can greatly modify the set of returned matches. A (see Section 2.3) can greatly modify the set of returned matches. A
protocol that allows tags or ranges to be truncated at an arbitrary protocol that allows tags or ranges to be truncated at an arbitrary
limit, without giving any indication of what that limit is, has the limit, without giving any indication of what that limit is, has the
potential for causing harm by changing the meaning of values in potential for causing harm by changing the meaning of values in
substantial ways. substantial ways.
In practice, most tags do not require additional subtags or In practice, most tags do not require additional subtags or
substantially more characters. Additional subtags sometimes add substantially more characters. Additional subtags sometimes add
useful distinguishing information, but extraneous subtags interfere useful distinguishing information, but extraneous subtags interfere
with the meaning, understanding, and especially matching of language with the meaning, understanding, and especially matching of language
tags. Since language tags or ranges MAY be truncated by an tags. Since language tags or ranges MAY be truncated by an
application or protocol that limits storage, when choosing language application or protocol that limits storage, when choosing language
tags or ranges users and applications SHOULD avoid adding subtags tags or ranges users and applications SHOULD avoid adding subtags
that add no distinguishing value. In particular, users and that add no distinguishing value. In particular, users and
implementations SHOULD follow the 'Prefix' and 'Suppress-Script' implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
fields in the registry (defined in Section 3.6 of [draft-registry]): fields in the registry (defined in Section 3.6 of [RFC3066bis]):
these fields provide guidance on when specific additional subtags these fields provide guidance on when specific additional subtags
SHOULD (and SHOULD NOT) be used. SHOULD (and SHOULD NOT) be used.
Implementations MUST support a limit of at least 33 characters. This Implementations MUST support a limit of at least 33 characters. This
limit includes at least one subtag of each non-extension, non-private limit includes at least one subtag of each non-extension, non-private
use type. When choosing a buffer limit, a length of at least 42 use type. When choosing a buffer limit, a length of at least 42
characters is strongly RECOMMENDED. characters is strongly RECOMMENDED.
The practical limit on tags or ranges derived solely from registered The practical limit on tags or ranges derived solely from registered
values is 42 characters. Implementations MUST be able to handle tags values is 42 characters. Implementations MUST be able to handle tags
skipping to change at page 17, line 24 skipping to change at page 19, line 9
longer values, including matching extensive sets of private use or longer values, including matching extensive sets of private use or
extension subtags. extension subtags.
Applications or protocols which have to truncate a tag MUST do so by Applications or protocols which have to truncate a tag MUST do so by
progressively removing subtags along with their preceding "-" from progressively removing subtags along with their preceding "-" from
the right side of the language tag until the tag is short enough for the right side of the language tag until the tag is short enough for
the given buffer. If the resulting tag ends with a single-character the given buffer. If the resulting tag ends with a single-character
subtag, that subtag and its preceding "-" MUST also be removed. For subtag, that subtag and its preceding "-" MUST also be removed. For
example: example:
Tag to truncate: zh-Hant-CN-variant1-a-extend1-x-wadegile-private1 Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1
1. zh-Hant-CN-variant1-a-extend1-x-wadegile 1. zh-Latn-CN-variant1-a-extend1-x-wadegile
2. zh-Hant-CN-variant1-a-extend1 2. zh-Latn-CN-variant1-a-extend1
3. zh-Hant-CN-variant1 3. zh-Latn-CN-variant1
4. zh-Hant-CN 4. zh-Latn-CN
5. zh-Hant 5. zh-Latn
6. zh 6. zh
Figure 7: Example of Tag Truncation Figure 9: Example of Tag Truncation
3. IANA Considerations 5. IANA Considerations
This document presents no new or existing considerations for IANA. This document presents no new or existing considerations for IANA.
4. Changes 6. Changes
This is the first version of this document. This is the first version of this document.
The following changes were put into this document since draft-03: The following changes were put into this document since draft-05:
Modified the ABNF to match changes in [draft-registry] Modified the ABNF to match changes in [RFC3066bis] (K.Karlsson)
(K.Karlsson)
Matched the references and reference formats to [draft-registry] Matched the references and reference formats to [RFC3066bis]
(K.Karlsson) (K.Karlsson)
Various edits, additions, and emendations to deal with changes in Various edits, additions, and emendations to deal with changes in
the Last Call of draft-registry as well as cleaning up the text. the Last Call of draft-registry as well as cleaning up the text.
5. Security Considerations Changed from 'defined' to 'identifies' in Section 4.1. (M.Gunn)
Reorganized the text and broke it into sections (M.Duerst)
Modified occurences of the word "application" to refer to
"applications or protocols" or otherwise be specific (E. van der
Poel)
Removed "Extended Language Range Lookup", merging it with other
text on lookup to form a single scheme. (M.Davis)
Fixed or removed obsolete or dangling references (Ed.)
Added an introduction to section 4 and added one sentence to make
it flow better to the start of section 4.1. (Ed.)
7. Security Considerations
Language ranges used in content negotiation might be used to infer Language ranges used in content negotiation might be used to infer
the nationality of the sender, and thus identify potential targets the nationality of the sender, and thus identify potential targets
for surveillance. In addition, unique or highly unusual language for surveillance. In addition, unique or highly unusual language
ranges or combinations of language ranges might be used to track ranges or combinations of language ranges might be used to track
specific individual's activities. specific individual's activities.
This is a special case of the general problem that anything you send This is a special case of the general problem that anything you send
is visible to the receiving party. It is useful to be aware that is visible to the receiving party. It is useful to be aware that
such concerns can exist in some cases. such concerns can exist in some cases.
The evaluation of the exact magnitude of the threat, and any possible The evaluation of the exact magnitude of the threat, and any possible
countermeasures, is left to each application protocol. countermeasures, is left to each application or protocol.
6. Character Set Considerations 8. Character Set Considerations
The syntax of language tags and language ranges permit only the The syntax of language tags and language ranges permit only the
characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D). These characters characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D). These characters
are present in most character sets, so presentation of language tags are present in most character sets, so presentation of language tags
should not present any character set issues. should not present any character set issues.
7. References 9. References
7.1. Normative References
[ID.ietf-ltru-initial]
Ewell, D., Ed., "Language Tags Initial Registry (work in
progress)", August 2005, <http://www.ietf.org/
internet-drafts/draft-ietf-ltru-initial-04.txt>.
[RFC1327] Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO
10021 and RFC 822", RFC 1327, May 1992.
[RFC1521] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet
Mail Extensions) Part One: Mechanisms for Specifying and
Describing the Format of Internet Message Bodies",
RFC 1521, September 1993.
[RFC2028] Hovey, R. and S. Bradner, "The Organizations Involved in 9.1. Normative References
the IETF Standards Process", BCP 11, RFC 2028,
October 1996.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2231] Freed, N. and K. Moore, "MIME Parameter Value and Encoded
Word Extensions: Character Sets, Languages, and
Continuations", RFC 2231, November 1997.
[RFC2234bis]
Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", draft-crocker-abnf-rfc2234bis-00
(work in progress), March 2005.
[RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
Resource Identifiers (URI): Generic Syntax", RFC 2396,
August 1998.
[RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an
IANA Considerations Section in RFCs", BCP 26, RFC 2434,
October 1998.
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
[RFC2860] Carpenter, B., Baker, F., and M. Roberts, "Memorandum of [RFC3066bis]
Understanding Concerning the Technical Work of the
Internet Assigned Numbers Authority", RFC 2860, June 2000.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, RFC 3629, November 2003.
[draft-registry]
Phillips, A., Ed. and M. Davis, Ed., "Tags for the Phillips, A., Ed. and M. Davis, Ed., "Tags for the
Identification of Languages (work in progress)", Identification of Languages", October 2005, <http://
August 2005, <http://www.ietf.org/internet-drafts/ www.ietf.org/internet-drafts/
draft-ietf-ltru-registry-12.txt>. draft-ietf-ltru-registry-14.txt>.
7.2. Informative References
[ISO15924]
"ISO 15924:2004. Information and documentation -- Codes
for the representation of names of scripts", January 2004.
[ISO3166-1]
"ISO 3166-1:1997. Codes for the representation of names of
countries and their subdivisions -- Part 1: Country
codes", 1997.
[ISO639-1] [RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
"ISO 639-1:2002. Codes for the representation of names of Specifications: ABNF", RFC 4234, October 2005.
languages -- Part 1: Alpha-2 code", 2002.
[ISO639-2] 9.2. Informative References
"ISO 639-2:1998. Codes for the representation of names of
languages -- Part 2: Alpha-3 code, first edition", 1998.
[RFC1766] Alvestrand, H., "Tags for the Identification of [RFC1766] Alvestrand, H., "Tags for the Identification of
Languages", RFC 1766, March 1995. Languages", RFC 1766, March 1995.
[RFC3066] Alvestrand, H., "Tags for the Identification of [RFC3066] Alvestrand, H., "Tags for the Identification of
Languages", BCP 47, RFC 3066, January 2001. Languages", BCP 47, RFC 3066, January 2001.
[RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282, [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282,
May 2002. May 2002.
[RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet:
Timestamps", RFC 3339, July 2002.
[UN_M.49] Statistics Division, United Nations, "Standard Country or
Area Codes for Statistical Use", UN Standard Country or
Area Codes for Statistical Use, Revision 4 (United Nations
publication, Sales No. 98.XVII.9, June 1999.
Appendix A. Acknowledgements Appendix A. Acknowledgements
Any list of contributors is bound to be incomplete; please regard the Any list of contributors is bound to be incomplete; please regard the
following as only a selection from the group of people who have following as only a selection from the group of people who have
contributed to make this document what it is today. contributed to make this document what it is today.
The contributors to [draft-registry], [RFC3066] and [RFC1766], each The contributors to [RFC3066bis], [RFC3066] and [RFC1766], each of
of which is a precursor to this document, made enormous contributions which is a precursor to this document, made enormous contributions
directly or indirectly to this document and are generally responsible directly or indirectly to this document and are generally responsible
for the success of language tags. for the success of language tags.
The following people (in alphabetical order by family name) The following people (in alphabetical order by family name)
contributed to this document: contributed to this document:
Jeremy Carroll, John Cowan, Frank Ellermann, Doug Ewell, Kent Jeremy Carroll, John Cowan, Martin Duerst, Frank Ellermann, Doug
Karlsson, Ira McDonald, M. Patton, Randy Presuhn and many, many Ewell, Marion Gunn, Kent Karlsson, Ira McDonald, M. Patton, Randy
others. Presuhn, Eric van der Poel, and many, many others.
Very special thanks must go to Harald Tveit Alvestrand, who Very special thanks must go to Harald Tveit Alvestrand, who
originated RFCs 1766 and 3066, and without whom this document would originated RFCs 1766 and 3066, and without whom this document would
not have been possible. not have been possible.
For this particular document, John Cowan originated the scheme For this particular document, John Cowan originated the scheme
described in Section 2.3.3. Mark Davis originated the scheme described in Section 3.2.3. Mark Davis originated the scheme
described in the Section 2.2.2. described in the Section 3.3.
Authors' Addresses Authors' Addresses
Addison Phillips (editor) Addison Phillips (editor)
Quest Software Quest Software
Email: addison dot phillips at quest dot com Email: addison dot phillips at quest dot com
Mark Davis (editor) Mark Davis (editor)
IBM IBM
 End of changes. 91 change blocks. 
417 lines changed or deleted 450 lines changed or added

This html diff was produced by rfcdiff 1.27, available from http://www.levkowetz.com/ietf/tools/rfcdiff/