Difference between revisions of "The WFC Translation memory format Wordfast Classic"
(One intermediate revision by the same user not shown) | |||
Line 90: | Line 90: | ||
When reading a TU, Wordfast defaults on the side of optimism in case the TU does not look correct or canonical. When in a TU: | When reading a TU, Wordfast defaults on the side of optimism in case the TU does not look correct or canonical. When in a TU: | ||
− | * | + | *the date is missing or wrong: if WFC is executing a loop that parses TUs, then it will take the previous TU's date and increment it with one second, otherwise, WFC takes the local machine's current date and time; |
− | * | + | *the user ID is empty, WFC will assume the TM header's user ID. If it is missing, WFC will use the user's identity as defined in Ms-Word. If it is missing, WFC will use XX; |
− | * | + | *• a language code is missing or incorrec: WFC assumes the TM's header language code. |
− | ' | + | == Files == |
+ | |||
+ | A translation memory (e.g. WfMemory) generates the following files: | ||
+ | |||
+ | {| border="1" style="border-collapse: collapse;" | ||
+ | |- | ||
+ | |WfMemory.Txt | ||
+ | |This is the translation memory. Do not delete it unless you want to discard it. | ||
+ | |- | ||
+ | |WfMemoryXXX.Itx | ||
+ | |This is the TM's index. Deletion not a problem, since WFC re-creates it automatically when needed. | ||
+ | |- | ||
+ | |WfMemory.Bak | ||
+ | |This is a copy of the TM before any Reorganisation, Merge or Sort operation. Deletion not a problem. | ||
+ | |- | ||
+ | |WfMemory.Old | ||
+ | |When a new BAK file is created and replaces an existing BAK file, WFC renames the existing WfMemory.Bak file WfMemory.Old if the existing BAK file appears larger than the new BAK file that overwrites it. Deletion not a problem. | ||
+ | |} | ||
+ | |||
+ | If you need to archive a TM, or send it to a colleague, the only necessary file is the .TXT file. It is recommended to reorganise a TM before sending it to someone (using the WFC/Translation memory/TM/Reorganise button). | ||
+ | |||
+ | If a translation memory is lost, remember that (if you keep copies of your translated, segmented files) cleaning up the segmented files that produced the TM will recreate the corresponding TM with its translation units. | ||
+ | |||
+ | == Fault detection (ignoring malformed TUs) == | ||
+ | |||
+ | WFC considers that a TU is a bad one based on counting how many tabulators are in a line of text. A line of text with less than 6 tabulators cannot form a valid TU. Another fault-detection method used by WFC is that language codes should not be no longer than 5 characters. When language codes of more than 5 characters are encountered during a TM reorganisation, it is an indicator that something is amiss with that particular TU, and it is assumed to be faulty. WFC does not halt on faulty TUs, it ignores them. | ||
Remarks: | Remarks: | ||
− | + | ||
− | + | The date does not necessarily have a tilde (~) separating date and time. Any printable character can be used there, except a number. WFC uses the tilde (~), the equal (=) sign, and the star sign(*). The equal sign means the TU was "marked" (flagged) by WFC's data editor. This has no consequence on the TU's status: it remains fully valid. Although WFC always records the date and time when writing a TU, the date and time are optional and could be empty (or even made of an invalid date) in which case WFC would simply assume the current computer's date and time, or previous TU incremented by one second, if in a sequential loop. Dates and times are "local", taken from the local computer's clock. | |
− | + | ||
+ | If any optional field is left empty, its trailing tabulator should be present. For a TU to be valid, there must be at least six tabulators, with the fifth field (the source segment, located between the fourth and the fifth tabulator) made of at least one printable character. | ||
+ | |||
+ | The date's first character (a number from 0 to 9, usually, a number 2 if the TU was created in the current millenium) can be "x". It means that this TU is not valid anymore - WFC marked it for future deletion. The first full reorganisation of the TM by WFC will erase this TU. Do not remove the "x", or replace it with a number, unless you know what you are doing. | ||
+ | |||
Back to [[Wordfast Classic User Manual]] | Back to [[Wordfast Classic User Manual]] |
Latest revision as of 06:30, 6 November 2017
A Wordfast translation memory is a tab-delimited text file. It's the simplest of all formats - it can be opened with text editors, like Notepad, or unicode-compliant word processors, as well as with Excel. Wordfast TMs can be regular ANSI (8-bit) text, or Unicode UTF-16 (both little-endian and big-endian).
A Translation Memory (TM) is a set of lines (paragraphs) of text. In a pure text file where the display does not wrap, lines are paragraphs. The very first line is a header, and all other lines are TUs (Translation Units), sometimes called "entries". Lines/Entries/TUs are sets of fields, a field being any text (even lack of text, which denotes an empty field) followed by a tabulator. In other words, the Wordfast TM format is Tab-delimited Text, which is arguably one of the oldest, most robust, open, easy to manipulate data format ever. In the header (the very first line in a TM), each field begins with a % (per cent) mark.
Fields making up a TU:
Field | Example | Format | Remark |
Date | 20041231~165410 | yyyymmdd~hhmmss - the example here means 31 December 2004, at 16:54:10, local time. See note on the tilde ~ character further below. | Optional field: can be empty |
User ID
(Attribute #1) |
YAC | Initials of the TU's creator. | Optional field: can be empty |
Counter | 5 | A number between 0 and 9999 that records how many times this TU was proposed as a 100% match and accepted, meaning, re-used, as it is. | Optional field: can be empty |
Source language | EN-US | TMX-compliant language code (but case-insensitive with WFC). It is made of a two-letter ISO language code, and optinally, a dash followed by a two-letter local variant. | Optional field: can be empty.
Rule: field cannot be longer than 5 characters. |
Source segment | Red Riding Hood was walking in the woods. | The source segment. Maximum size: 8000 Unicode characters. | Should contain at least one character. |
Target language | FR-FR | Language code, TMX-compliant | Optional field: can be empty.
Rule: field cannot be longer than 5 characters. |
Target segment | Le Petit Chaperon Rouge se promenait dans les bois. | The target segment. Maximum size: 8000 Unicode characters. | Optional field: can be empty |
Attribute #2 (optional) | EL | A mnemonic (maximum length=64 characters; no space allowed) for user-defined attribute #1. See Wordfast's "Sample" attributes. | Optional field: can be empty+tabulator omitted |
Attribute #3 (optional) | PS | Optional field: can be empty+tabulator omitted | |
Attribute #4 (optional) | Optional field: can be empty+tabulator omitted | ||
Attribute #5 (optional) | Optional field: can be empty+tabulator omitted |
Here are the first two paragraphs (the TM's header and first Translation Unit) of a TM where the TU is defined as in the table above. Paragraphs are long, so they may wrap in your display - but there are only two paragraphs:
%20041231~160445 | %YAC, Yves A. Champollion | %TU=00000000 | %EN-US %Wordfast TM v5.0 | %FR-FR | %87412764 | ||
20041231~165410 YAC | 5 | EN-US | Red Riding Hood was walking in the woods. | FR-FR | Le Chaperon Rouge se promenait dans les bois. | EL | PS |
The header (first line in the TU) in the example above defines two attributes named Domain and Client. The first TU contains two attribute values: EL and PS. Either attribute names (unique per TM) or attribute values (multiple: one per TU) can be made of up to 64 characters (acronyms are used in the example above: EL for Electronics and PS for a client, however, longer descriptors can be used). Question/exclamation marks ( ! ¡ ? ¿ ) are forbidden in attributes names and values.
When reading a TU, Wordfast defaults on the side of optimism in case the TU does not look correct or canonical. When in a TU:
- the date is missing or wrong: if WFC is executing a loop that parses TUs, then it will take the previous TU's date and increment it with one second, otherwise, WFC takes the local machine's current date and time;
- the user ID is empty, WFC will assume the TM header's user ID. If it is missing, WFC will use the user's identity as defined in Ms-Word. If it is missing, WFC will use XX;
- • a language code is missing or incorrec: WFC assumes the TM's header language code.
Files
A translation memory (e.g. WfMemory) generates the following files:
WfMemory.Txt | This is the translation memory. Do not delete it unless you want to discard it. |
WfMemoryXXX.Itx | This is the TM's index. Deletion not a problem, since WFC re-creates it automatically when needed. |
WfMemory.Bak | This is a copy of the TM before any Reorganisation, Merge or Sort operation. Deletion not a problem. |
WfMemory.Old | When a new BAK file is created and replaces an existing BAK file, WFC renames the existing WfMemory.Bak file WfMemory.Old if the existing BAK file appears larger than the new BAK file that overwrites it. Deletion not a problem. |
If you need to archive a TM, or send it to a colleague, the only necessary file is the .TXT file. It is recommended to reorganise a TM before sending it to someone (using the WFC/Translation memory/TM/Reorganise button).
If a translation memory is lost, remember that (if you keep copies of your translated, segmented files) cleaning up the segmented files that produced the TM will recreate the corresponding TM with its translation units.
Fault detection (ignoring malformed TUs)
WFC considers that a TU is a bad one based on counting how many tabulators are in a line of text. A line of text with less than 6 tabulators cannot form a valid TU. Another fault-detection method used by WFC is that language codes should not be no longer than 5 characters. When language codes of more than 5 characters are encountered during a TM reorganisation, it is an indicator that something is amiss with that particular TU, and it is assumed to be faulty. WFC does not halt on faulty TUs, it ignores them.
Remarks:
The date does not necessarily have a tilde (~) separating date and time. Any printable character can be used there, except a number. WFC uses the tilde (~), the equal (=) sign, and the star sign(*). The equal sign means the TU was "marked" (flagged) by WFC's data editor. This has no consequence on the TU's status: it remains fully valid. Although WFC always records the date and time when writing a TU, the date and time are optional and could be empty (or even made of an invalid date) in which case WFC would simply assume the current computer's date and time, or previous TU incremented by one second, if in a sequential loop. Dates and times are "local", taken from the local computer's clock.
If any optional field is left empty, its trailing tabulator should be present. For a TU to be valid, there must be at least six tabulators, with the fifth field (the source segment, located between the fourth and the fifth tabulator) made of at least one printable character.
The date's first character (a number from 0 to 9, usually, a number 2 if the TU was created in the current millenium) can be "x". It means that this TU is not valid anymore - WFC marked it for future deletion. The first full reorganisation of the TM by WFC will erase this TU. Do not remove the "x", or replace it with a number, unless you know what you are doing.
Back to Wordfast Classic User Manual