What is a "translation serialization format"
"Translation serialization format" is my name for the several file formats that computer-assisted translation (CAT) tools use to convey localizable data from tool to tool.
What do we need
Django’s serialization framework allows for custom serialization formats. That meant we only needed to implement the specifics of the format, not all the logic involved. The goal was to find a format that good for serializing and translating. The file format needed to:
- Support lots of objects in one file. Since the exporter will gather up all the object required to translate and Django supported dumping all the objects into one file (and also pulling them back out), it would be nice for this format to do that as well.
- Support extra metadata. There is going to be Django-specific metadata for the records. The file format must allow for arbitrary metadata.
- Has tools that support it. If there were tools out there to help us and the translators, that would give us a big boost. If we used a translation service, we wanted a format that was popular enough that they used it.
We decided on XLIFF
The XML Localisation Interchange File Format (XLIFF) seemed to fit our needs.
It is mature. It has been around awhile and updated several times.
Used in several CMSes already. While we weren’t going to use this to exchange data with different CMSes, it was nice to know that several others had made the same choice.
Many 3rd-party translators support it. Looking at professional translation companies and services, XLIFF was included in all of them.
External tool support. There are several open source and commercial tools available to translate XLIFF files. One tool, Pootle, was very helpful. It is written in Django and provided us with a way to test out our XLIFF files for compatibility.
Extensible by nature. Part of XLIFF's initial design was for the serialization and translation of database tables and rows. Although XML isn't fun, it did allow for extensions.
We made it open source
We released Django XLIFF under the Apache 2.0 license. It has all the tests that Django's built-in serializers have.
Comparing Django XML to Django XLIFF
XLIFF is a bit more verbose than the Django's standard XML output. Sample output looks like (extra white space added for readability:
<?xml version="1.0" encoding="utf-8"?> <xliff xmlns="urn:oasis:names:tc:xliff:document:1.2" version="1.2" xmlns:d="https://docs.djangoproject.com/"> <file datatype="database" source-language="en-us" original="simpleapp.tag.2"> <body> <group resname="simpleapp.tag.2" restype="row"> <trans-unit resname="data" translate="no" id="data" restype="x-SlugField"> <source>tag2</source> </trans-unit> <trans-unit d:keytype="pk" d:to="contenttypes.contenttype" restype="x-ForeignKey" d:rel="ManyToOneRel" resname="content_type" translate="no" id="content_type"> <source>37</source> </trans-unit> <trans-unit resname="object_id" translate="no" id="object_id" restype="x-PositiveIntegerField"> <source>200</source> </trans-unit> </group> </body> </file> </xliff>
<xliff xmlns="urn:oasis:names:tc:xliff:document:1.2" version="1.2" xmlns:d="https://docs.djangoproject.com/">
The XLIFF header is technically one line, just like the
<django-objects> tag. I've added a Django namespace (
d) for a few Django-specific attributes used later on.
<object pk="1" model="simpleapp.article">
<file datatype="database" source-language="en-us" original="simpleapp.article.1"> <body> <group resname="simpleapp.article.1" restype="row">
source-language attribute is assigned Django's
LANGUAGE_CODE setting. I decided the incorporate the id or primary key into the naming and identification within the XLIFF file, thus
<file originial=""> and
<group resname=""> both use the
app.model.pk dotted notation for identity.
<file originial=""> and
<group resname=""> the same? Initially
<file> was going to be equivalent to a database and use
app.model and the
<group>'s would be rows and use
app.model.pk. However several translating tools I used for testing got confused, so now every object is redundantly enclosed in
<field to="simpleapp.author" name="author" rel="ManyToOneRel">2</field> <field type="CharField" name="headline">Poker has no place on ESPN</field> <field type="DateTimeField" name="pub_date">2006-06-16T11:00:00</field> <field to="simpleapp.category" name="categories" rel="ManyToManyRel"> <object pk="3"></object> <object pk="1"></object> </field>
<trans-unit restype="x-ForeignKey" d:keytype="pk" d:to="simpleapp.author" d:rel="ManyToOneRel" translate="no" resname="author" id="author"> <source>2</source> </trans-unit> <trans-unit restype="x-CharField" maxwidth="50" size-unit="char" translate="yes" resname="headline" id="headline"> <source>Poker has no place on ESPN</source> </trans-unit> <trans-unit restype="x-DateTimeField" translate="no" resname="pub_date" id="pub_date"> <source>2006-06-16T11:00:00</source> </trans-unit> <trans-unit restype="x-ManyToManyField" d:keytype="pk" d:to="simpleapp.category" d:rel="ManyToManyRel" translate="no" resname="categories" id="1.3"> <source>3</source> </trans-unit> <trans-unit restype="x-ManyToManyField" d:keytype="pk" d:to="simpleapp.category" d:rel="ManyToManyRel" translate="no" resname="categories" id="1.1"> <source>1</source> </trans-unit>
Django's default XML export doesn't contain any indication of the field's type. That metadata might be useful during translation, so I put in the
restype attribute with an
x- prefix, the allowable way to extend that attribute.
resname attribute contains the field's name. The
id attribute is also the field's name, except for many-to-many fields. The
ids need to be unique across all the siblings, so I use the record's id and the relation's id.
Most fields do not require translation, so only character fields and text fields have the
translate attribute set to
"yes". All other fields have it set to
Character fields also include two additional attributes:
maxwidth is the length allowed for that character field, (you don't want to make a translation that won't fit) and
size-unit specifies that the width is in Unicode characters.
Foreign keys and many-to-many fields require a few extra attributes. The default Django XML uses the
to attributes of the
<field> tag to specify the type of relation and the related model respectively. And, if the export is using natural keys, it includes extra tags for those.
We achieve this in XLIFF using a separate Django namespace (
d) and adding attributes to the
d:keytype specifies if the value is using the primary key (
pk) or the natural key (
d:rel are equivalent to the
rel attributes in the default Django XML.
Natural keys are encoded in the
<source> tag by concatenating multiple values with the
NATURAL_KEY_SEPARATOR setting. It defaults to