Multilingual site development: Part II the lang-attribute

A while back I started writing about this topic. Now reading throught the RDF-IG mailing list archives I came across some discussion on how to use the xml:lang-attribute in RDF-documents. HTML documents have a similar attribue, lang, that can be used.

In many cases, the language of an HTML document is set in the opening html-tag as generally the whole document will be in one language. There are some cases in which other languages are used in parts of documents in which the identification of different languages would be semantically correct. However, I seriously doubt that most authors do not use them. One of the problems that authors face when using the lang-attribute (in either or both forms) is deciding when to use it.

The RDF-IG discussion raised several points that cause problems when even thinking about the use. Should authors, for example, markup the of the word internet in Finnish documents as a word in English? This of course, is a more linguistic matter at heart as in when does a foreign word become a word in another language? As a strictly technical matter, one should also try to decide how to represent the value of the lang-attribute. While the value of the attribute is specified by RFC 3066, it’s semantic meaning is still unclear. The example used in the RDF-IG mailing list was of en-IT: does it mean English as used in an Italian context (i.e. using an Italian dialect when spoken through a speech reader) or English as used by native speakers in Italy.

Whatever the technical difficulties are in using the attribute, it does have some substantial benefits. Most people probably know that you can limit the pages that Google uses as search results by the language used in the page. Specifying that a snippet in a document is in a different language allows the snippet to be correctly identified and the page can be returned as a search result.

For example, we specify all of the pages in the Life of Jalo as being in English. But any visitor to the site will see that the caption of each picture is placed on the page in two languages, Finnish and English. The Finnish caption is in a paragraph that has the lang-attribute set to fi. Now searching Google with the word Jalo with only Finnish documents as the search scope will return about 113 000 hits as opposed to 263 000.

There are other benefits as well, but these relate to the styling of different elements in browsers that support the necessary CSS. You can use attribute selectors to style elements in a different language in a different way with the rule

p[lang|=en] { background: white; }
p[lang|=fi] { background: blue; }

Or you can do even more advanced magic by the rules used in an example in the CSS 2.1 spec on how to use the lang()-pseudo-class.

html:lang(fr-ca) { quotes: '« ' ' »' }
html:lang(de) { quotes: '»' '«' '\2039' '\203A' }
:lang(fr) > Q { quotes: '« ' ' »' }
:lang(de) > Q { quotes: '»' '«' '\2039' '\203A' }

The difference between the two selectors comes mainly from browser support, the specification states that the interpretation of the two should be the same.

But whatever the difficulties are in using the lang-attribute, its use is strongly recommended. Even more so when more and more tools become available that can react to different attributes of the markup. Think for example of a Firefox extension that would allow you to select a piece of text in a different language and try to translate it automatically for you. And of course, we mustn’t forget the effect that language information can have on accessibility issues.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.