diff --git a/index.html b/index.html index bf9dedb..1fb6abb 100644 --- a/index.html +++ b/index.html @@ -50,6 +50,27 @@ +
Tags for identifying the natural language of content or the international preferences of users are one of the fundamental building blocks of the Web. The language tags found in Web and Internet formats and protocols are defined by [[BCP47]]. Consistent use of language tags provides applications the ability to perform language-specific formatting or processing. For example, a user-agent might use the language to select an appropriate font for displaying text or a Web page designer might style text differently in one language than in another.
+Tags for identifying the natural language of content or the locale of users are one of the fundamental building blocks of the Web. The language tags found in Web and Internet formats and protocols are defined by [[BCP47]]. Consistent use of language tags provides applications the ability to perform language-specific formatting or processing. For example, a user-agent might use the language to select an appropriate font for displaying text or a Web page designer might style text differently in one language than in another.
Many of the core standards for the Web include support for language tags; these include the xml:lang attribute in [[XML10]], the lang and hreflang atttributes in [[HTML]], the language property in [[XSL10]], and the :lang pseudo-class in CSS [[CSS3-SELECTORS]], and many others, including SVG, TTML, SSML, etc.
Natural Language (or, in this document, just language). The spoken, written, or signed communications used by human beings.
+ +Natural Language (or, in this document, just language). The spoken, written, or signed communications used by human beings.
There are many ways that languages might be identified and many reasons that software might need to identify the language of content on the Web. Document formats and protocols on the Web generally use the identifiers used in most other parts of the Internet, consisting of the language tags defined in [[BCP47]]. "BCP" nomenclature refers to the current set of IETF RFCs that form the "best current practice".
-Language tag. A string used as an identifier for a language. In this document, the term language tag always refers explicitly to a [[BCP47]] language tag. These language tags consist of one or more subtags.
+ +Language tag. A string used as an identifier for a language. In this document, the term language tag always refers explicitly to a [[BCP47]] language tag. These language tags consist of one or more subtags.
Specifications for the Web that require language identification MUST refer to [[BCP47]].
Specifications SHOULD NOT refer to specific component RFCs of [[BCP47]].
-[[BCP47]] is a multipart document consisting, at the time this document was published, of two separate RFCs. The first part, called Tags for Identifying Languages [[RFC5646]], defines the grammar, form, and terminology of language tags. The second part, called Matching of Language Tags [[RFC4647]], describes several schemes for matching, comparing, and selecting content using language tags and includes useful terminology related to comparison of language preferences to tagged content.
-Formulations such as "RFC 5646 or its successor" MAY be used, but only in cases where the specific document version is necessary.
+[[BCP47]] is a multipart document consisting, at the time this document was published, of two separate RFCs. The first part, called Tags for Identifying Languages [[RFC5646]], defines the grammar, form, and terminology of language tags. The second part, called Matching of Language Tags [[RFC4647]], describes several schemes for matching, comparing, and selecting content using language tags and includes useful terminology related to comparison of language preferences to tagged content.
+ +Formulations such as "RFC 5646 or its successor" MAY be used, but only in cases where the specific document version is necessary.
While this style of reference was once popular, using the BCP reference is more accurate. Since the grammar of language tags has been fixed since [[RFC4646]], referring to the BCP will not incur additional compliance risk to most implementations.
@@ -193,8 +217,6 @@For example, JavaScript internationalization [[ECMA-402]] and [[CLDR]] provide a "best fit" algorithm which can be tailored by implementers.
- -This section defines basic terminology related to internationalization and localization.
-Users who speak different languages or come from different cultural backgrounds usually require software and services that are adapted to correctly process information using their native languages, writing systems, measurement systems, calendars, and other linguistic rules and cultural conventions.
-Language tags can also be used to identify international preferences associated with a given piece of content or user because these preferences are linked to the natural language, regional association, or culture of the end user. Such preferences are applied to processes such as presenting numbers, dates, or times; sorting lists linguistically; providing defaults for items such as the presentation of a calendar, or common units of measurement; selecting between 12- vs. 24-hour time presentation; and many other details that users might find too tedious to set individually. Collectively, an identifier for these preferences is usually called a locale. The extensions to [[BCP47]] that define Unicode locales [[CLDR]] provide the basis for internationalization APIs on the Web, notably the JavaScript language [[ECMASCRIPT]] uses Unicode locales as the basis for the APIs found in [[ECMA-402]].
- -International Preferences. A user's particular set of language and formatting preferences and associated cultural conventions. Software can use these preferences to correctly process or present information exchanged with that user.
- -Many kinds of international preference may be offered - on the Web in order for a content or a service to be considered usable - and acceptable by users around the world. Some of these preferences - might include: -
Internationalization. The design and development of a product that is enabled for target audiences that vary in culture, region, or language. Internationalization is sometimes abbreviated i18n because there are eighteen letters between the "I" and the "N" in the English word.
Localization. The tailoring of a given software component to the individual cultural expectations of a specific target market or group of individuals. Localization includes, but is not limited to, the translation of user-facing text and messages. Localization is sometimes abbreviated as l10n because there are ten letters between the "L" and the "N" in the English word. When a particular set of content and preferences corresponding to a specific locale is operationally available, then the software is said to be localized.
Users who speak different languages or come from different cultural backgrounds usually require software and services that are adapted to correctly process or present information using their native languages, writing systems, measurement systems, calendars, and other linguistic rules and cultural conventions. Some of these preferences might include: +
The work to enable software to have language and culturally sensitive processing is called internationalization. It is primarily the work of the software developers, since it involves using the correct APIs and libraries and making appropriate modifications to the source code.
-Internationalization. The design and development of a product that is enabled for target audiences that vary in culture, region, or language. Internationalization is sometimes abbreviated i18n because there are eighteen letters between the "I" and the "N" in the English word.
Localizable Resource. The portion of the software's source code that can be translated or adjusted to suit local language or cultural requirements. This term is usually shortened to "resource" or "resources", which is confusingly similar to the more generic term resource used commonly on the Web.
+ +Internationalization includes separating the portion of the software that requires translation or adaptation from the portion of the software that is common to all language or regional versions. These separated portions of the software are the localizable resources and are often stored in special files ("resource files") in specific locations in the source tree. The most common localizable resources are user interface strings and messages.
+ +The work to create specific language versions (including versions tailored for specific regions) is called localization. This includes translation of the localizable resources created during internationalization.
-Localization. The tailoring of a system to the individual cultural expectations of a specific target market or group of individuals. Localization includes, but is not limited to, the translation of user-facing text and messages. Localization is sometimes abbreviated as l10n because there are ten letters between the "L" and the "N" in the English word. When a particular set of content and preferences corresponding to a specific set of international preferences is operationally available, then the system is said to be localized.
International Preferences. A user's particular language, formatting, regional preferences or associated cultural preferences.
-Locale. An identifier (such as a language tag) for a set of international preferences. Usually this identifier indicates the preferred language of the user and possibly includes other information, such as a geographic region (such as a country). A locale is passed in APIs or set in the operating environment to obtain culturally-affected behavior within a system or process.
+When constructing an internationalized piece of software, the developer has to be aware of and account for the various ways in which a user might wish for display or functionality to be adapted to suit local needs. Each such individual setting might be considered to be its own international preference. Groups of preferences might be necessary in order for the content or software to be considered usable and acceptable by users around the world.
+ + + -Locale-aware (or Enabled). A system that can respond to changes in the locale with culturally and language-specific behavior or content. Generally, systems that are internationalized can support a wide range of locales in order to meet the international preferences of many kinds of users.
-Language tags can provide information about the language, script, region, and various specially-registered variants using subtags. But sometimes there are international preferences that do not correlate directly with any of these. For example, many cultures have more than one way of sorting content items, and so the appropriate sort ordering cannot always be inferred from the language tag by itself. Thus a German language user might want to choose between the sort ordering used in a dictionary versus that used in a phone book.
+Locale. A particular set of international preferences for a specific language, including regional or other tailorings.
-Historically, locales were associated with and specific to the programming language or operating environment of the user. These application-specific identifiers often could be inferred from or converted into language tags. Some examples of locale models include Java's java.util.Locale, POSIX (with identifiers such as de_CH@utf8), Oracle databases (AMERICAN_AMERICA.AL32UTF8), or Microsoft's LCIDs (which used numeric codes such as 0x0409). The relationship between several of these models, the underlying standards such as ISO639 or ISO3166, and early language tags (such as [[RFC1766]]) was entirely intentional. Implementations often mapped (and continue to map) language tags from an existing protocol, such as HTTP's Accept-Language header, to proprietary or platform-specific locale models.
There can be hundreds or thousands of different individual international preferences required to internationalize and localize a bit of software. However, most users share their preferences with other speakers of the same language within a given region. While specific overrides are sometimes useful, in most cases software represents the pre-tailored complete set of international preferences using a single, all-encompassing setting called a locale.
-Since the adoption of the current [[BCP47]] identifier syntax, a number of locale models have adopted BCP47 directly or provided adaptation or mappings between proprietary models and language tags. Notably, the development and adoption of the open-source repository of locale data known as [[CLDR]] has led to wider general adoption of language tags as locale identifiers.
+The locale is applied to processes such as selecting content in the most appropriate language; presenting numbers, dates, or times shown next to or within such content; sorting lists linguistically; providing defaults for items such as the presentation of a calendar, or common units of measurement; selecting between 12- vs. 24-hour time presentation; and many other details that users might find too tedious to set individually.
-Common Locale Data Repository (or [[CLDR]]). The Common Locale Data Repository is a Unicode Consortium project that defines, collects, and curates sets of data needed to enable locales in systems or operating environments. CLDR data and its locale model are widely adopted, particularly in browsers.
+Locale Identifier. An identifier (often a language tag) for a locale. Usually this identifier indicates the preferred language of the user. It can also possibly include other information, such as a geographic region (such as a country). A locale is passed in APIs or set in the operating environment to obtain culturally-affected behavior within a system or process.
+ +Each locale is associated with a locale identifier. Some APIs allow the identifier to be passed directly. Others require the user to create a locale data objection (such as JavaScript's Intl.Locale or Java's java.util.Locale) that is passed to APIs. Often there is a global environment setting that contain the current program or thread-specific locale or locale identifier.
Historically, locale identifiers were associated with (and specific to) the programming language or operating environment of the user. These application-specific identifiers might contain a language tag (although some did not). They often could be inferred from or converted into language tags. Some examples of locale models include Java's java.util.Locale, POSIX (with identifiers such as de_CH@utf8), Oracle databases (AMERICAN_AMERICA.AL32UTF8), or Microsoft's LCIDs (which used numeric codes such as 0x0409). The relationship between several of these models, the underlying standards such as ISO639 or ISO3166, and early language tags (such as [[RFC1766]]) was entirely intentional. Implementations often mapped (and continue to map) language tags from an existing protocol, such as HTTP's Accept-Language header, to proprietary or platform-specific locale models.
Language tags can provide information about the language, script, region, and various specially-registered variants using subtags. But sometimes there are international preferences that do not correlate directly with any of these. For example, many cultures have more than one way of sorting content items, and so the appropriate sort ordering cannot always be inferred from the language tag by itself. Thus a German language user might want to choose between the sort ordering used in a dictionary versus that used in a phone book.
+ +Since the adoption of the current [[BCP47]] identifier syntax, a number of locale models have adopted BCP47 directly or provided adaptation or mappings between proprietary models and language tags. Notably, the development and adoption of the open-source repository of locale data known as [[CLDR]] has led to wider general adoption of language tags as locale identifiers. The extensions to [[BCP47]] that define Unicode locales [[CLDR]] provide the basis for internationalization APIs on the Web. For example, the JavaScript language [[ECMASCRIPT]] uses Unicode locales as the basis for the APIs found in [[ECMA-402]].
+ +Locale-aware (or Enabled). Any system or component of a system (such as a library or API) that can respond to changes in the locale with cultural- or language-specific behavior or content. Generally, software that is internationalized can support a wide range of locales in order to meet the international preferences of many kinds of users.
+ + + + + + + +Common Locale Data Repository (or [[CLDR]]). The Common Locale Data Repository is a Unicode Consortium project that defines, collects, and curates sets of data needed to enable locales in systems or operating environments. CLDR data and its locale model are widely adopted, particularly in browsers.
Unicode Locale Identifier or Unicode Locale. A language tag that follows the additional rules and restrictions on subtag choice defined in UTR#35 [[LDML]]. Any valid Unicode locale identifier is also a valid [[BCP47]] language tag, but a few valid language tags are not also valid Unicode locale identifiers.
@@ -492,7 +590,7 @@Language negotiation. The process of matching a user's international preferences to available locales, localized resources, content, or processing.
+Language negotiation. The process of matching a user's locale to the available locales, localized resources, content, or processing.
Locale fallback. The process of searching for translated content, locale data, or other resources by "falling back" from more-specific resources to more-general ones following a deterministic pattern.
@@ -510,7 +608,59 @@Users expect form fields and other data inputs to use a presentation for non-linguistic fields that is consistent with the document or application where the values appear. User's usually expect their input to match the document's context rather than the user-agent or operating environments and input validation, prompting, or controls are also thus consistent with the content. This gives content authors the ability to create a wholly localized customer experience and is generally in keeping with customer expectations.
There are two common uses for language tags in document formats, protocols, and specifications. In some cases, language tags are used to provide metadata about intended audience for collections of content, such as at the record or document level. In other cases, language tags are used to identify the language of specific bits of text in order to facilitate text processing.
+ +Metadata that describes the language of the intended audience is about the document as a whole. Such metadata may be used for searching, serving the right language version, classification, etc. Where there are language changes in a document, information about the language of the intended audience is not specific enough to support text-processing, that is to say, in a way that would be needed for the application of text-to-speech, styling, automatic font assignment, etc.
+ +The language of the intended audience does not include every language used in a document. Many documents on the Web contain embedded fragments of content in different languages, whereas the page is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but it is aimed at a German-speaking audience, not a Chinese one.
+ +On the other hand, it is also possible to imagine a situation where a document contains the same or parallel content in more than one language. For example, a Web page may welcome Canadian readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two audience languages. This situation is not as common on the Web as in printed material since it is easy to link to separate pages on the Web for different audiences, but it does occur where there are multilingual communities. Another use case is a blog or a news page aimed at a multilingual community, where some articles on a page are in one language and some in another.
+ +There are also pages where the navigational information, including the page title, is in one language but the real content of the page is in another. While this is not necessarily good practice, it doesn't change the fact that the language of the intended audience is usually that of the content, regardless of the language at the top of the document source.
+ +Metadata about the language of the intended audience is usually best declared outside the document, such as in the HTTP Content-Language header.
+When specifying the text-processing language you are declaring the language in which a specific range of text is actually written, so that user agents or applications that manipulate the text (such as voice browsers, spell checkers, or style processors) can process the text in a language-appropriate manner. So we are, by necessity, talking about associating a single language with a specific range of text.
+ +This specificity distinguishes the declaration of the language for text-processing from that of the language of the intended audience.
+ +The language for text-processing is usually best declared using attributes on elements, including setting a document-wide default.
+ + + +