In the world of multilingual
computing, locales are a necessary evil. But what exactly is a locale?
Let's start with a few formal definitions. The Sun Global Application
Developer Corner site defines locale as "a specific geographical,
political, or cultural region. It is usually identified by a combination
of language and country, for example, en_US represents the locale US
English." The Microsoft Global Software Development site defines
locale as "a set of rules and data specific to a given language
and geographic area. These rules and data include information on character
classification; date and time formatting; numeric, currency, weight
and measure conventions; and sorting rules." The IBM International
Components for Unicode (ICU) glossary defines locale as "a set
of conventions affected or determined by human language and customs,
as defined within a particular geopolitical region. These conventions
include (but are not necessarily limited to) the written language, formats
for dates, numbers and currency, sorting orders, etc."
Essentially, locales are sets of rules which are usually based on a geographic
location and which are used to localize software settings for users in regions
around the world.
Debates on several e-mail discussion lists in recent months prompted the creation
of a new list, devoted entirely to the subject of locales. But why does this particular
topic warrant a list all its own? This article will tell you exactly that.
Why It's an Issue
As illustrated in the Sun definition, locale specifications currently take the
form of a language/country combination. Therein lies the problem. A set connection
between the two is inherently problematic given the diversity of the world's
use of language. Graham Rhind, founder of GRC Database Information, summarizes
the problem by saying, "The name of a country is not the same as a language,
which is not the same as a currency, which is not the same as an address format,
which is not the same as a name format, and so on."
David Possin, Globalization Engineer for Heartlab, Inc., concurs: "We need
to break the tie between language and locale, and create real-world minimal defaults
for the locale definition. While IBM, Sun, and Microsoft use predefined language_country locales for their operating systems, that does not mean they are right."
The bottom line is that existing approaches for locale handling are too limited
and inflexible. Once a locale is specified, down comes a hard and fast set of
characteristics which may or may not match actual user needs. Michael Kaplan,
president of Trigeminal Software and author of the book Internationalization with
Visual Basic, says, "The challenge with managing locales is that (1) every
user has a specific idea of what he or she wants. (2) Individual users do not
all agree about what they want, and so they end up with what they don't want.
(3) Unhappy users do not understand the user interface well enough to fix the
problem. Even if they know the user interface well, it may not be flexible enough
to allow the degree of customization they desire."
Let's take a look at a real-world example of how this works. A global e-commerce
site such as Amazon.com is localized in many languages, can handle different currencies
and shipping methods, and can provide offers customized according to the user's
locale setting. For example, a customer with an en_US locale would see text in
English, view best-selling titles for the United States and see prices in US dollars.
But many people in the United States may prefer to use a language other than English.
These users might want to view the site in their native language, see the best-seller
list for their native country, and still want pricing in dollars and shipping
information appropriate to a US address.
Let's say that this situation applied to a Mexican-American. A locale of
es_US is invalid because Spanish is not an official language of the United States,
and it also uses the wrong Spanish variant (Spain's, not Mexico's).
A locale of mx_US is still invalid because mx is not a language. A locale of es_MX would provide the correct Spanish variant, but the wrong region.
This is just one example of the potential complexities regarding locale-related
options. But are these really locale issues, or do they cross over into some sort
of user preference no-man's land?
Who Cares?
An interesting variety of groups and individuals are deeply interested in dealing
with locale-related issues. Library of Congress software developers must deal
with a staggering range of language and script issues. SIL International, producer
of the Ethnologue (a comprehensive collection of information about the languages
of the world), has a huge need for locale-related support, particularly due to
its work with minority languages. Government agencies in multilingual regions
must produce information which can be read by their various constituents. Commercial
software developers create products that are used around the world. Companies
from many industries want to sell products in other countries over the Web and
need to offer information to potential customers in the most usable way. Each
of these groups has a slightly different view on what is important, which makes
the challenge of creating a single workable locale model even bigger.
What Should Be Included?
Now that we've established that better methods for dealing with locales are
needed, let's take a closer look at what a locale is and isn't. Or perhaps
more accurately, what it should and shouldn't be.
Mark Davis, Chief Globalization Architect for IBM and President of the Unicode
Consortium, comments, "What people want is some structured way to indicate
and/or communicate a raft of information about a user's preferences. That
would presumably include the traditional features of a locale, such as how dates
are formatted, but may also include currency, time zone, preferred character set,
smoker/non-smoker, vegetarian or not, music preference, religion, party affiliation,
favorite charity, window or aisle seat and so on."
Given the nearly infinite degree to which preference categorization could occur,
just how far should a locale model go? Which of the myriad variables for localization
belong in a locale and which belong in some other structure? Should some variables
be associated with the locale, others with the data, and still others with the
application?
Language, country, and region identification. While language identification is
a critical piece in the locale management puzzle, a common concern is that language
codes need to be improved to reflect dialects and local rules. A number of languages
are not yet included in the ISO 639 language group standard. While new codes will
undoubtedly be added and while four-character codes are also being evaluated,
the standard still offers a fairly restricted language specification approach.
To complicate matters, many people believe that the concept of regions would be
extremely useful in many situations, perhaps more so even than country codes.
Countries within a particular area of the world often have elements in common
with relatively small variations. Examples of regions are North America, Central
America, South America, Western Europe, Central Asia, Western Asia, South Asia,
the Pacific, North Africa and so on. But regional classifications can get tricky
because definitions can be geographic, economic, political and so forth, with
each classification potentially introducing variations in currency, time zone
or other variables.
Clearly, it's a complicated subject.
User preference attributes. As Davis' comment points out, how far should
a "locale" preference definition go? What attributes should be included?
Some probable inclusions in a locale-related model are:
Language
Dialect
Script
Personal name format
Address format
Date/time format
Calendar
Time zone
Data casing (upper/lower case)
Telephone number format
Currency format
Measurement format
Numeric format
Collation/sorting
Graham Rhind comments in relation to locale attributes: "I cannot think of
a single good reason why any of these unique characteristics should be inextricably
linked to any other in any way or standard whatsoever, nor why they should be
made unchangeable."
Given this view, a number of issues must be addressed. For example, should every
attribute be modifiable? How many attributes are needed for a minimal default
locale description? Should every attribute have at least one value and allow a
null value? Is a fallback mechanism required for handling missing or invalid attributes?
Should a minimal set of default attributes be required?
In addition to answering these questions, relationship subcategories may need
to be defined. For example, there may not be any relationship between time zone
and language, but dialect is definitely a subcategory of language and may not
merit a separate category.
When one is generating a locale framework, the issues to be addressed are really
quite complex.
Suggested Approaches
As the on-line locale discussion progresses, a few approaches have been offered
as starting points. Carl W. Brown, president of X.Net, Inc., discusses single-
and multiple-parameter approaches. He described a single-parameter locale specification
which supplies more information than the existing methods for example,
es-mx_US.iso-8859-1#America/Los_Angeles
Since many applications would require a lot of change to track multiple parameters
rather than a single locale specification, some people feel that this is the best
approach.
Brown also describes using separate parameters which could be treated independently
to avoid problems like the ones resulting from the current language_country coupling.
An example is lang_es-mx#loc_US#tz_America/Los_Angeles#char_iso-8859-1 or lang[es-mx]#loc[US]#tz[America/Los_Angeles]#char[iso-8859-1]
Possin suggests that locale parameters be full-text-based rather than limiting
specification to codes of a few characters as is currently in place. His view
is that translations of many full-text parameters already exist in operating systems,
programming languages, databases and so on, and they simply need to be standardized.
For example,
*lang#German#Bavarian# would be equivalent to *lang#Deutsch#Bayrisch#
*region#Germany#Bavaria# would be equivalent to *region#Deutschland#Bayern#
*tz#Middle_European# would be equivalent to *tz#mitteleuropäisch#
Another suggestion runs along an entirely different line. A fear had been expressed
that even if a standard were created, there would be no "industry-strength
implementation" to provide a delivery mechanism for it. To deal with this
situation, one person commented that he hoped to eventually see a Web "preference
server" which would define a "preferences content model." This
approach, however, relies on an implementer to develop a solution, rather than
having a cross-industry standardized method for locale handling.
Conclusion
Building a better locale mousetrap is a complex problem. Existing platforms, operating
systems, programming languages and markup languages all handle locales in slightly
different ways. Microsoft methods are different from Linux methods, and HTML handling
is different from XML handling. Because of this, when a sound framework is developed,
widespread implementation will still be a bit down the road.
As Barry Caplan, founder of www.i18n.com, states, "one of the strengths of
Unicode over the years has been the early recognition that the scheme decided
upon would need to be extensible without breaking. I think at least as much thought
needs to be given up front to the locale issue if we expect people to adopt it
as widely as Unicode."
The issue of locales was discussed at the World Wide Web Consortium (W3C) Internationalization Workshop held in February 2002. Many participants felt that locale is an area that would benefit from W3C attention; however, as of the writing of this article,
it is not yet known whether the consortium will choose to take it on. If this
eventually does happen, it would go a long way in helping to ensure widespread
adoption of an agreed-upon locale framework.
In the meantime, we'll all just keep trying to figure out what to call it,
what to include in it, and how to deal with all the other pesky problems of delivering
information electronically to diverse populations around the world. 
|