XML4PharmaWiki

— Jozef Aerts 2008/12/31 11:36

Internationalization in ODM

Author: Jozef Aerts, XML4Pharma

Applicable to: ODM version 1.3

Introduction

Version ODM 1.3 has a good number of new features for supporting internationalization (i18n) and localization (L10n) in multi-language studies. In the study design part, labels and descriptions can be added in different languages at all levels of the study design. This for example enables to generate eCRFs on the fly in the different languages that the study uses, and to generate an internationalized EDC or CDM system.

This “Use Case” document describes how internationalization can be implemented in ODM 1.3

The “TranslatedText” element

For the internationalization features of ODM 1.3, the “TranslatedText” element plays a central role. The “TranslatedText” element is typically a child element of another element (see further) allowing to give different instances of the same text, but in different languages.

A first very simple example:

One sees four instances of the TranslatedText element, giving the text in four languages: English, French, German and Korean ¹⁾.

The “xml:lang” attribute describes the language the text is given in. It uses a two-letter code (ISO-639-1) or three-letter code (ISO-639-2), which is standardized by ISO and the W3C. In some cases, one may add extra characters the specify the regional dialect or the script used. The value of the “xml:lang” attribute is usually designated as the “language tag”. Examples are:

[table to come as image]

Mostly, one uses two-letter codes, but the use of three letter-codes is surely not prohibited.

It is important to note that XML is case-sensitive, so the language tags “FR” and “Fr” are invalid!

It must be remarked that the use of the “xml:lang” attribute is not mandatory. It if is absent, the TranslatedText element gives the text for the “default” language. What the “default” language is, is of course a matter of agreement between sender and receiver of the ODM file, so it is not automatically english. So one can have:

When the study was originally designated to only run in France (agreed as default) and later, it was decided to also run the study in Germany.

Another case where the xml:lang attribute can be omitted is when the text is not “translatable”, i.e. when the text content is the same anyway for every language in the study. For example in a codelist, used in the question “number of packages of cigarettes smoked per day”.

There are three choices here: “<1”, “1-2” and “>2”. Also remark the use of the entities < and > (for “<” and “>”).

In case one has non-translatable choices in a codelist, and when using ODM 1.3, it is often better to use the “EnumeratedItem” elements:

Now that we understand how to use language tags for the value of the “xml:lang” attribute, let us look where the “TranslatedText” element is used in the ODM 1.3

“TranslatedText” may be a subelement of the following ODM 1.3 elements:

Description
Decode
Question
Symbol
ErrorMessage

A. The Description element

The Description element can be a child element of following ODM elements:

StudyEventDef
FormDef
ItemGroupDef
ItemDef
ConditionDef
MethodDef

In StudyEventDef, FormDef, ItemGroupDef, the “Description” element allows the designer of the study to give a description on what the visit (StudyEvent), form or subform (Form, ItemGroup) is about, and this in different languages. Often (but not at all required by the ODM specification) ²⁾, this description is used as a label in a paper or eCRF. For example, when the following snippet is used in the ODM study design, and eCRFs are (automatically) created from this ODM in the German language and Korean language, the eCRF may look like:

The German eCRF:

The Korean eCRF:

For the ItemDef element, the Description element is also usually used to give a description of the question. The question itself, as needs to appear on the form, should however be given in the “Question” element (see further) and not in the “Description” element. The text given in the “Description” element can however be used as an extra label to the question.

For the ConditionDef and MethodDef, the “Description” method is used to give a human-readable description of the condition or method, in parallel with the “FormalExpression” element, which is used to give a machine-readable representation of the condition or method. So the “FormalExpression” element will usually contain a snippet of computer code (source code or compiled code), whereas the “MethodDef” element contains it human-readable counterpart. For example:

As described in the ODM 1.3 specification, when a receiving system (e.g. an EDC system) is not able to interprete the machine-readable expression, it should use the information in the “TranslatedText” elements instead, and use that for generating the paper or eCRF, i.e. in that case, the text in one of the “TranslatedText” elements (depending on the language of the CRF) should be used on the form to describe the condition or method.

B. The Decode element

The “TranslatedText” element is also used as child elements of the “Decode” element, which itself is a child element of “CodeListItem” within “CodeList”. An example snippet is given below.

In this case, the TranslatedText element is given to enable the user of the (e)CRF to see the decoded value of a codelist (i.e. a list containing possible answers) in his own language. For example, if the question is about the sex of the subject, and the French eCRF is used, the user may see the following on the screen

When submitting to the server, the value stored in the database will however be one of the coded values, so “M” or “F”.

C. The Question element

The use of “TranslatedText” in the “Question” element (child element of “ItemDef”) is very similar to the previous use, only that this time, it represents the question that appears on the form. So the ODM snippet that represents the question for the question regarding the sex of the subject (see image above) is:

D. The Symbol element

The “Symbol” element is used in ODM 1.3 within the “MeasurementUnit” element. The latter defines the possible units of measure that can be used in different regions of the world. A typical example is the weight of a subject, which is typically measured in pounds in the English-speaking parts of the world, whereas in most other countries, the weight is measured in kilograms.

The MeasurementUnit element itself is referenced from the “ItemDef” element, to indicate which possible units of measurement can be used for the item, and from the “ItemData” element (used when transferring or storing clinical data, i.e. collected data), to indicate which unit of measurement was used when the data was collected. For example:

As the unit of measurement must be displayed to the user in his own language, the information for the display needs to be retrieved from the “TranslatedText” element. A typical example is the measurement unit “Pounds” (MU_POUNDS in the above example) which will be displayed to French speaking users as “Livres” and to German speaking users as “Pfund”.

E. The ErrorMessage element

Error messages can be used in queries and to display a message when an investigator enters an impossible or out-of-range value in an eCRF. “ErrorMessage” is used in the “RangeCheck” element. It allows the designer of the eCRF to define ranges within which the collected answer is expected. The constraint can be defined as being “hard” (the value should be rejected) or “soft” (a warning is produced).

For example, a range check that the height of the subject must be below 220cm, with the error messages in different languages, is defined in ODM by:

Other elements that use language tags

Another ODM element using the “xml:lang” attribute is the “Presentation” element. It defines how the information about the study is presented to the user. As this element has been defined as “a placeholder for future development”, we will not go into further details here.

XML and Unicode

Unicode stands for “Universal Character Encoding“. Unicode provides a unique number for every character, no matter what the language, no matter what the program, no matter what the platform. As XML can use Unicode, this means that XML is able to cover all written languages used in the entire world.

The way the unique number for each character is stored internally (encoding) can however differ, and this is usually where the problems start. Also, XML can support other encodings than Unicode, making the confusion even bigger.

Some Unicode character encodings are UTF-8, UTF-16 and UTF-32. When an XML file does not contain an “encoding” attribute on the xml statement, an encoding of UTF-8, UTF-16 or UTF-21 is assumed, depending on the byte order mark (BOM – see e.g. Wikipedia). Mostly however, UTF-8 is used.

Some other encodings that one finds in XML files are:

UTF-16
UTF-32
ISO-8859-1 (Latin-1)
Windows-1252
Shift-JIS (Japanese)

In an XML file, the encoding is defined by the “encoding” attribute. As already stated, the absence of this attribute means that the receiving system may expect that the encoding is UTF-8. In all other cases, the encoding must be provided, e.g.:

When developing software applications for working with ODM files, the different expected (and unexpected) encodings can make life complicated. As from the contents of the XML file itself, it is not possible to detect which encoding has been used, one must rely on the value of the “encoding” attribute (which can be incorrect !). So if different encodings are expected, it may be necessary to read the first line of the XML file to grab the value of the “encoding” attribute, and then use that encoding for reading in the bytes from the file or for parsing the ODM file. An excellent article about possible problems with encodings in server based software is: “Multibyte-character processing in J2EE” from Wang Yu (see Literature section).

Literature

Language tags in HTML and XML: http://www.w3.org/International/articles/language-tags/
FAQ: Two-letter or three-letter language codes:

http://www.w3.org/International/questions/qa-lang-2or3.en.php

Tags for Identifying Languages: http://www.rfc-editor.org/rfc/rfc4646.txt
IANA list of language codes: http://www.iana.org/assignments/language-subtag-registry
XML and Localization – FAQ – Encoding: http://www.opentag.com/xfaq_enc.htm
The Unicode home page: http://unicode.org/
Multibyte-character processing in J2EE: http://www.javaworld.com/javaworld/jw-04-2004/jw-0419-multibytes.html

Acknowledgements

Very special and sincere thanks to prof. Inyoung Choi, The Catholic University of Korea, College of Medicine, Department of Preventive Medicine, for providing the Korean translations and adding them to the ODM file.

¹⁾ footnote: Please see the Acknowledgements

²⁾ footnote: The CDISC ODM 1.3 specification says that the Description element is “A free-text description of the containing study metadata component”, so not suggesting at all that it should be used as a label in an (e)CRF). Many vendors developed explicit vendor extension elements to store the labels that should appear on the (e)CRF