|
(Where xxxxxx represents the encoding your HTML page is in.)
Terminology relating to character sets and encoding is confusing. The two things are not the same, but the terms are often used inaccurately. Unicode is not an encoding method. It is a character set, and UTF-8 is the encoding method which is commonly used with it. Big5 is a character set, and ISO-2022-CN is an encoding method which is used for Big5.
There are two basic approaches for encoding. Native encoding refers to the individual standards for a specific language or set of languages, such as ISO-2022-CN for Simplified Chinese, Russian and other characters (Big5). UTF-8 encoding is used with Unicode. The choice of which encoding scheme to use is very important. Both approaches have advantages and disadvantages, and your selection should be based on the browsers and versions that you plan to support for your Web site.
Browser support. Web browsers, such as Netscape Communicator 4.0 and Microsoft Internet Explorer 4.0, contain Unicode as their basis.
Internet Explorer 5 has advanced language support. If your target users use it, you can specify UTF-8 for all Web pages and not have to worry about language-specific code pages.
Other browsers offer lesser levels of Unicode support. You may have to specify individual charsets for each target language (in other words, use native encodings).
Fonts. The third piece in the character display puzzle is fonts. If you define fonts that don’t exist on your users’ systems, it won’t matter whether the encoding method you choose is supported by their browser or not. They still will not be able to view the text. There are a variety of potential solutions to this problem.
One way to make sure users have the required fonts is to distribute a full set once so that they can be installed on user systems. The downside to this approach is file size. Most Unicode fonts are several megabytes in size.
Glyph servers are proxy servers that substitute in-line bitmaps for non-ASCII characters in the current page. The glyph server retrieves a document and then parses the HTML to replace non-displayable characters with an <IMG> element. Each <IMG> element points to a bitmap image of the glyph. The client eventually receives the edited HTML along with all the new images. The resulting display is fairly accurate, but retrieval time is long, and the text can no longer be treated as text since it is now stored graphically.
Embedding fonts allows you to send fonts with individual Web pages. Unfortunately, different browsers supply varying levels of support for embedded fonts. Some browsers control font display themselves, while others rely more heavily on the operating system’s font display handling.
Microsoft provides embedding information and also provides a font embedding SDK for downloading at www.microsoft.com/typography/web/default.htm.
The World Wide Web Consortium (W3C) has developed a font acquisition approach for dealing with fonts through Cascading Style Sheets (CSS). Their solution gives user agents four ways to select fonts for HTML elements. Fonts listed in style sheets can be matched exactly to fonts installed on the system or to a similar font if the specified font does not exist on the system; fonts can be downloaded if a match can’t be made and if a URL for downloading is included; or fonts can be created or synthesized as needed, based on the font’s description in the style sheet.
The CSS Level 2 Specification (www.w3.org/TR/REC-CSS2/fonts.html) describes this font matching process in detail.
URLs
URLs in the United States are composed of words which convey concepts rather than merely being alphanumeric identifiers. Unfortunately, URLs are limited to a subset of about 60 ASCII characters. English-speaking users are content because they can create and read understandable URLs with ease. But that’s where the ease of use ends. Speakers of other Latin-based languages must do without extended characters (for example, á, ü and ç), while people who read non-Latin based systems such as Chinese or Arabic are forced to deal with incomprehensible URLs.
The Internet Society’s URI Generic Syntax (RFC2396) document (ftp://ftp.isi.edu/in-notes/rfc2396.txt) provides more information about proper formatting of URLs.
If you choose to set up a unique URL for each language, you should consider registering country-specific domains, for example, .fr for France and .it for Italy. idNames International Web Address Services (www.idnames.com/) can register all of your international URLs for you, or you can register each one individually by working with the registrars for specific domains.
You can find country codes and specific registrars on Web sites provided by the Internet Assigned Numbers Authority (www.iana.org/) and by the Internet Corporation for Assigned Names and Numbers (www.icann.org/).
Testing
A critical element for delivering localized Web sites is testing them to make sure they look good and work properly for the intended users. Several different levels of testing should be used for localized sites.
Linguistic review testing checks for language problems.
User interface validation testing checks for visual problems such as text truncation or overlap, graphics issues or other visual problems.
Functional testing ensures that no functionality was broken during the localization process.
Interoperability testing ensures that the site works as expected with targeted platforms, operating systems, browsers (and versions), applications (and versions), equipment and so on. It focuses on the interchange between the localized components and other pieces.
Given the variation of language support provided by various browsers, interoperability testing is a critical element for ensuring that target users can read your localized Web sites.
An agreement must be reached between the localization company and the customer about who will perform each type of testing. As part of this agreement, the localization company should detail the types of activities to be performed for each level so that the customer understands what he is getting and, more importantly, what he is not getting.
These are some of the issues you will need to address when developing a multilingual Web site. Before beginning this type of project, however, you should research each of these issues to get the detailed information you need for your specific site. 
Suzanne Topping is owner of Localization Unlimited and can be reached at stopping@rochester.rr.com
This article reprinted from #30 Volume 11 Issue 2 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc., 319 North First Ave., Sandpoint, Idaho, USA, 208-263-8178, Fax: 208-263-6310.
|