This is part 4 in a multiple part series. Read part3 part 2 or part 1
What is a "translation serialization format"
"Translation serialization format" is my name for the several file formats that computer-assisted translation (CAT) tools use to convey localizable data from tool to tool.
What do we need
Django’s serialization framework allows for custom serialization formats. That meant we only needed to implement the specifics of the format, not all the logic involved. The goal was to find a format that good for serializing and translating. The file format needed to:
- Support lots of objects in one file. Since the exporter will gather up all the object required to translate and Django supported dumping all the objects into one file (and also pulling them back out), it would be nice for this format to do that as well.
- Support extra metadata. There is going to be Django-specific metadata for the records. The file format must allow for arbitrary metadata.
- Has tools that support it. If there were tools out there to help us and the translators, that would give us a big boost. If we used a translation service, we wanted a format that was popular enough that they used it.
We decided on XLIFF
The XML Localisation Interchange File Format (XLIFF) seemed to fit our needs.
It is mature. It has been around awhile and updated several times.
Used in several CMSes already. While we weren’t going to use this to exchange data with different CMSes, it was nice to know that several others had made the same choice.
Many 3rd-party translators support it. Looking at professional translation companies and services, XLIFF was included in all of them.
External tool support. There are several open source and commercial tools available to translate XLIFF files. One tool, Pootle, was very helpful. It is written in Django and provided us with a way to test out our XLIFF files for compatibility.
Extensible by nature. Part of XLIFF's initial design was for the serialization and translation of database tables and rows. Although XML isn't fun, it did allow for extensions.
We made it open source
We released Django XLIFF under the Apache 2.0 license. It has all the tests that Django's built-in serializers have.
Comparing Django XML to Django XLIFF
XLIFF is a bit more verbose than the Django's standard XML output. Sample output looks like (extra white space added for readability:
<?xml version="1.0" encoding="utf-8"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:1.2"
version="1.2"
xmlns:d="https://docs.djangoproject.com/">
<file datatype="database"
source-language="en-us"
original="simpleapp.tag.2">
<body>
<group resname="simpleapp.tag.2" restype="row">
<trans-unit resname="data"
translate="no"
id="data"
restype="x-SlugField">
<source>tag2</source>
</trans-unit>
<trans-unit d:keytype="pk"
d:to="contenttypes.contenttype"
restype="x-ForeignKey"
d:rel="ManyToOneRel"
resname="content_type"
translate="no"
id="content_type">
<source>37</source>
</trans-unit>
<trans-unit resname="object_id"
translate="no"
id="object_id"
restype="x-PositiveIntegerField">
<source>200</source>
</trans-unit>
</group>
</body>
</file>
</xliff>
File Heading
Django XML:
<django-objects version="1.0">
Django XLIFF:
<xliff xmlns="urn:oasis:names:tc:xliff:document:1.2"
version="1.2"
xmlns:d="https://docs.djangoproject.com/">
The XLIFF header is technically one line, just like the <django-objects>
tag. I've added a Django namespace (d
) for a few Django-specific attributes used later on.
An Object
Django XML:
<object pk="1" model="simpleapp.article">
Django XLIFF:
<file datatype="database"
source-language="en-us"
original="simpleapp.article.1">
<body>
<group resname="simpleapp.article.1" restype="row">
The source-language
attribute is assigned Django's LANGUAGE_CODE
setting. I decided the incorporate the id or primary key into the naming and identification within the XLIFF file, thus <file originial="">
and <group resname="">
both use the app.model.pk
dotted notation for identity.
Why are <file originial="">
and <group resname="">
the same? Initially <file>
was going to be equivalent to a database and use app.model
and the <group>
's would be rows and use app.model.pk
. However several translating tools I used for testing got confused, so now every object is redundantly enclosed in <file><body><group>
tags.
A Field
Django XML:
<field to="simpleapp.author" name="author" rel="ManyToOneRel">2</field>
<field type="CharField" name="headline">Poker has no place on ESPN</field>
<field type="DateTimeField" name="pub_date">2006-06-16T11:00:00</field>
<field to="simpleapp.category" name="categories" rel="ManyToManyRel">
<object pk="3"></object>
<object pk="1"></object>
</field>
Django XLIFF:
<trans-unit restype="x-ForeignKey"
d:keytype="pk"
d:to="simpleapp.author"
d:rel="ManyToOneRel"
translate="no"
resname="author"
id="author">
<source>2</source>
</trans-unit>
<trans-unit restype="x-CharField"
maxwidth="50"
size-unit="char"
translate="yes"
resname="headline"
id="headline">
<source>Poker has no place on ESPN</source>
</trans-unit>
<trans-unit restype="x-DateTimeField"
translate="no"
resname="pub_date"
id="pub_date">
<source>2006-06-16T11:00:00</source>
</trans-unit>
<trans-unit restype="x-ManyToManyField"
d:keytype="pk"
d:to="simpleapp.category"
d:rel="ManyToManyRel"
translate="no"
resname="categories"
id="1.3">
<source>3</source>
</trans-unit>
<trans-unit restype="x-ManyToManyField"
d:keytype="pk"
d:to="simpleapp.category"
d:rel="ManyToManyRel"
translate="no"
resname="categories"
id="1.1">
<source>1</source>
</trans-unit>
Common attributes.
Django's default XML export doesn't contain any indication of the field's type. That metadata might be useful during translation, so I put in the restype
attribute with an x-
prefix, the allowable way to extend that attribute.
The resname
attribute contains the field's name. The id
attribute is also the field's name, except for many-to-many fields. The id
s need to be unique across all the siblings, so I use the record's id and the relation's id.
Most fields do not require translation, so only character fields and text fields have the translate
attribute set to "yes"
. All other fields have it set to "no"
.
Character fields.
Character fields also include two additional attributes: maxwidth
and size-unit
. maxwidth
is the length allowed for that character field, (you don't want to make a translation that won't fit) and size-unit
specifies that the width is in Unicode characters.
Relations.
Foreign keys and many-to-many fields require a few extra attributes. The default Django XML uses the rel
and to
attributes of the <field>
tag to specify the type of relation and the related model respectively. And, if the export is using natural keys, it includes extra tags for those.
We achieve this in XLIFF using a separate Django namespace (d
) and adding attributes to the <trans-unit>
tag. d:keytype
specifies if the value is using the primary key (pk
) or the natural key (natural
). d:to
and d:rel
are equivalent to the to
and rel
attributes in the default Django XML.
Natural keys are encoded in the <source>
tag by concatenating multiple values with the NATURAL_KEY_SEPARATOR
setting. It defaults to ;
.