MultiLingual Computing, Inc., Magazine
menu 1
menu 2
menu 3
menu 4
menu 5
menu 6
menu 7
menu 8
About Us
Magazine
News
Guides
Calendar
Careers
Resources
Downloads
MultiLingual Computing Home Page

MultiLingual Article

Search Articles


Search for keyword:

Search for author:


 
 
Featured Article
Thursday, September 2, 2010


A Primer for Building
Multilingual Web Sites

HTML translation is only the first of many
issues to consider in Web localization


SUZANNE TOPPING


You have undoubtedly read statistics about the percentage of Web users from outside the United States. The Internet boom and the new global economy have created an environment where increasing numbers of companies want to create multilingual Web sites. Unfortunately, many of these companies think that Web localization equals HTML translation.

This article defines some aspects of localization that you need to consider in addition to translation of HTML.

E-commerce Issues

If you will conduct sales on the Web site, you must consider a number of factors. For example, how will payments be made? Will you list prices in local currencies or provide a currency converter? How will products be delivered? What are the legal issues and regulatory requirements for various countries? How will products be supported in other regions? What are the export restrictions for your product type?

These are just a few examples of the types of concerns that you’ll need to address when selling products through your Web site. The best way to get the answers to these questions is to turn to experts. A variety of organizations can advise you about exporting issues or can help you develop international markets, and a number of companies provide financial services for dealing with overseas customers.

Internationalization

As with any item to be localized, internationalization is critical. Web sites can be created so that localization is either a dream or a nightmare.

A fundamental aspect of internationalization is the concept of separation of content and code. Content is the information that users are looking for on Web sites. Code is the set of directions for how to handle the content and perform functions. When content is intermixed with code, there is a pretty good chance that something will get broken during translation. Separating the content makes it easier for translators to focus on only the text that needs translation. This not only makes translation simpler, but also protects the code.

There are two basic structural approaches for dealing with content in Web sites: dynamic and static.

Dynamic Web sites store content in a database so that it is completely separate from the code. They tend to be more difficult to create originally, but are easier to maintain and update. A variety of commercial products is now available for creating and maintaining these multilingual databases.

Static sites are structured so that content is stored with the HTML rather than in a database. There are, however, still methods for separating the content from the code in static sites. For example, you can create HTML templates to define a page’s layout and then call in the content from an “include file.” This is done by using the #include statement:


<!-- #include
virtual="/CommonContent/ContactInfo.inc" -->

The include file here is ContactInfo.inc.

Using include files is particularly helpful when dealing with text that appears in multiple places throughout the site. The files get translated only once and then are called in as needed for each page.

When working with scripts, you can create sections labeled “To Be Translated” so that translators don’t have to search through the whole file when trying to figure out what to translate.

Another good principle when developing Web sites is to include comments in your code. This helps the people dealing with files for translation to understand what is going on with the code.

Other Internationalization Issues

All of the standard software internationalization concepts apply to Web sites as well. For example, you’ll need to ensure that time, date, currency and other locale-specific formats are handled automatically through Natural Language Support (NLS) data provided by the user’s system or are at least modifiable at the time of localization. Win32 NLS APIs can be used to access this data and to provide functions such as formatting and sorting.

One of the earliest decisions you’ll need to make is how users will access the language that they need. Language selection can be automatic or manual.

Automatic selection. Language selection can be automated through the use of Transparent Content Negotiation (TCN), which was developed as an HTTP extension by the Internet Engineering Task Force. The Internet Society’s document Transparent Content Negotiation in HTTP (RFC2295) (ftp://ftp.isi.edu/in-notes/rfc2295.txt) describes content negotiation in detail.

With TCN, each user must have a TCN-enabled browser and must configure it with a language preference. Each time the browser sends a request to the server, it specifies the language preference by including an Accept-Language statement in the HTTP header.

The server must also be configured for content negotiation so that when it reads the header, it can respond with a list of available language options which the browser can review against its preference settings.

Manual selection. If you don’t want to deal with the complexities of TCN, you can leave it to users to select languages. One approach to dealing with the language selection issue is to have a separate URL for each language version of a site. (URLs are discussed in greater detail later in this article.)

If maintaining multiple URLs is not the approach you want to take, you will need to find a way to allow users to see what their language options are and make a selection. What the best method for indicating available languages may be, however, is a hotly debated issue.

Using flags as language indicators may initially seem like a good idea, but you may find that the approach creates more problems than the graphic value is worth. Flags can be offensive for a number of reasons, such as political sensitivity (for example, in Taiwan) or geographical inaccuracy (for example, with French-speaking Canadians).

The method that is likely to create the least confusion is to list each language in its own language (for example, Français, Italiano, Español). In order to prevent character display problems, you may want to store the language strings as images.

Character Display

Getting characters to display properly around the world is a major challenge. Three main components that contribute to correct display are character encoding, browser support and fonts.

Character encoding is a complex subject which has been the topic of previous articles in MultiLingual Computing & Technology.

Each HTML page should define the encoding method in the <HEAD> section. For example, the entry should read:


<META HTTP-EQUIV="content-Type"
    CONTENT="text/html;
charset=xxxxxx">

(Where xxxxxx represents the encoding your HTML page is in.)

Terminology relating to character sets and encoding is confusing. The two things are not the same, but the terms are often used inaccurately. Unicode is not an encoding method. It is a character set, and UTF-8 is the encoding method which is commonly used with it. Big5 is a character set, and ISO-2022-CN is an encoding method which is used for Big5.

There are two basic approaches for encoding. Native encoding refers to the individual standards for a specific language or set of languages, such as ISO-2022-CN for Simplified Chinese, Russian and other characters (Big5). UTF-8 encoding is used with Unicode. The choice of which encoding scheme to use is very important. Both approaches have advantages and disadvantages, and your selection should be based on the browsers and versions that you plan to support for your Web site.

Browser support. Web browsers, such as Netscape Communicator 4.0 and Microsoft Internet Explorer 4.0, contain Unicode as their basis.

Internet Explorer 5 has advanced language support. If your target users use it, you can specify UTF-8 for all Web pages and not have to worry about language-specific code pages.

Other browsers offer lesser levels of Unicode support. You may have to specify individual charsets for each target language (in other words, use native encodings).

Fonts. The third piece in the character display puzzle is fonts. If you define fonts that don’t exist on your users’ systems, it won’t matter whether the encoding method you choose is supported by their browser or not. They still will not be able to view the text. There are a variety of potential solutions to this problem.

One way to make sure users have the required fonts is to distribute a full set once so that they can be installed on user systems. The downside to this approach is file size. Most Unicode fonts are several megabytes in size.

Glyph servers are proxy servers that substitute in-line bitmaps for non-ASCII characters in the current page. The glyph server retrieves a document and then parses the HTML to replace non-displayable characters with an <IMG> element. Each <IMG> element points to a bitmap image of the glyph. The client eventually receives the edited HTML along with all the new images. The resulting display is fairly accurate, but retrieval time is long, and the text can no longer be treated as text since it is now stored graphically.

Embedding fonts allows you to send fonts with individual Web pages. Unfortunately, different browsers supply varying levels of support for embedded fonts. Some browsers control font display themselves, while others rely more heavily on the operating system’s font display handling.

Microsoft provides embedding information and also provides a font embedding SDK for downloading at www.microsoft.com/typography/web/default.htm.

The World Wide Web Consortium (W3C) has developed a font acquisition approach for dealing with fonts through Cascading Style Sheets (CSS). Their solution gives user agents four ways to select fonts for HTML elements. Fonts listed in style sheets can be matched exactly to fonts installed on the system or to a similar font if the specified font does not exist on the system; fonts can be downloaded if a match can’t be made and if a URL for downloading is included; or fonts can be created or synthesized as needed, based on the font’s description in the style sheet.

The CSS Level 2 Specification (www.w3.org/TR/REC-CSS2/fonts.html) describes this font matching process in detail.

URLs

URLs in the United States are composed of words which convey concepts rather than merely being alphanumeric identifiers. Unfortunately, URLs are limited to a subset of about 60 ASCII characters. English-speaking users are content because they can create and read understandable URLs with ease. But that’s where the ease of use ends. Speakers of other Latin-based languages must do without extended characters (for example, á, ü and ç), while people who read non-Latin based systems such as Chinese or Arabic are forced to deal with incomprehensible URLs.

The Internet Society’s URI Generic Syntax (RFC2396) document (ftp://ftp.isi.edu/in-notes/rfc2396.txt) provides more information about proper formatting of URLs.

If you choose to set up a unique URL for each language, you should consider registering country-specific domains, for example, .fr for France and .it for Italy. idNames International Web Address Services (www.idnames.com/) can register all of your international URLs for you, or you can register each one individually by working with the registrars for specific domains.

You can find country codes and specific registrars on Web sites provided by the Internet Assigned Numbers Authority (www.iana.org/) and by the Internet Corporation for Assigned Names and Numbers (www.icann.org/).

Testing

A critical element for delivering localized Web sites is testing them to make sure they look good and work properly for the intended users. Several different levels of testing should be used for localized sites.

Linguistic review testing checks for language problems.

User interface validation testing checks for visual problems such as text truncation or overlap, graphics issues or other visual problems.

Functional testing ensures that no functionality was broken during the localization process.

Interoperability testing ensures that the site works as expected with targeted platforms, operating systems, browsers (and versions), applications (and versions), equipment and so on. It focuses on the interchange between the localized components and other pieces.

Given the variation of language support provided by various browsers, interoperability testing is a critical element for ensuring that target users can read your localized Web sites.

An agreement must be reached between the localization company and the customer about who will perform each type of testing. As part of this agreement, the localization company should detail the types of activities to be performed for each level so that the customer understands what he is getting and, more importantly, what he is not getting.

These are some of the issues you will need to address when developing a multilingual Web site. Before beginning this type of project, however, you should research each of these issues to get the detailed information you need for your specific site. globe1.gif




Suzanne Topping is owner of Localization Unlimited and can be reached at stopping@rochester.rr.com


This article reprinted from #30 Volume 11 Issue 2 of
MultiLingual Computing & Technology published by MultiLingual Computing, Inc., 319 North First Ave., Sandpoint, Idaho, USA, 208-263-8178, Fax: 208-263-6310.

March, 2000


 
     

 


webmaster@multilingual.com ©1998-2010, Copyright MultiLingual Computing, Inc. No duplication or reproduction without expressed written permission.