Skip to content
Harri Pitkänen edited this page Feb 12, 2016 · 11 revisions

Varissuo finite state transducer file format

Header

The file format has four variants: unweighted and weighted transducers with little endian and big endian byte order.

  • The little endian file starts with the following 8 byte sequence (magic number): 6E 3A 01 00 FA 51 03 00

  • Big endian file starts with 00 01 3A 6E 00 03 51 FA

  • The byte after the magic number specifies whether the transducer is unweighted (0x00) or weighted (0x01).

  • The following 7 bytes must be zeros. These are reserved for identifying the features of future format extensions.

Symbol list

  • After header there is a 16 bit unsigned integer specifying the number of symbols (including the epsilon symbol) in the transducer.
  • This is followed by a list of null terminated UTF-8 strings containing the string representations of each symbol. The order of symbols in this lists specifies the integer that is used to label the symbol in the rest of the transducer.
  • After last symbol file is padded with zeros up to a complete 8 byte boundary (16 byte boundary for weighted transducers).

Symbols are sorted so that epsilon symbol (0) is followed by flag diacritics, then normal characters (single Unicode codepoints) and finally multichar tags. String representation of flag diacritics must be in format @...@ and multichar tags must be in format [...]. Tags are not allowed in the input alphabet and no other types of multichar symbols are allowed at all (some exceptions will be made for representing special characters such as spaces).

Transition list

VFST format has no explicit representation for states. The transducer is built from two kinds of data cells.

Unweighted cell types (8 bytes)

  • Transition cell:

    • symIn: Input symbol, 2 bytes. 0xFFFF is reserved for signaling final state.
    • symOut: Output symbol, 2 bytes. 0xFFFF is reserved for future use.
    • targetState: Cell offset of target state, 3 bytes
    • moreTransitions: Number of additional outgoing transitions from this state, 1 byte.
  • Overflow cell:

    • moreTransitions: Number of additional outgoing transitions from this state, 4 bytes.
    • 4 bytes of padding (reserved for future use)

Weighed cell types (16 bytes)

  • Transition cell:

    • symIn: Input symbol, 4 bytes. 0xFFFFFFFF is reserved for signaling final state.
    • symOut: Output symbol, 4 bytes. 0xFFFFFFFF is reserved for future use.
    • targetState: Cell offset of target state, 4 bytes
    • weight: Transition weight, 2 bytes (signed integer)
    • moreTransitions: Number of additional outgoing transitions from this state, 1 byte.
    • 1 byte reserved for future use.
  • Overflow cell:

    • moreTransitions: Number of additional outgoing transitions from this state, 4 bytes.
    • 12 bytes of padding (reserved for future use)

The first cell in transition list is a Transition cell. It is the "state head" of the initial state in the transducer. In the normal case state head is not much different from other Transition cells. symIn and symOut contain the codes for input and output symbol for a transition leaving the state. targetState points to the "state head" of transition target state (the actual address of transition target is calculated as "state head of initial state" + targetState * sizeof(Transition)).

In case there are more than one transition leaving from a state, moreTransitions contains the number (1-254) of such additional transitions. If there are more than 254 additional transitions needed (this is rare in practice) moreTransitions is set to 0xFF and the state head is immediately followed by an Overflow cell from which the actual number of additional transitions can be read.

State head is then followed by the specified number of additional Transition cells (if any). Those will always have moreTransitions=0 (number of extra transitions was already given either in state head or in a Overflow cell following the state head).

Final states are slightly different: in those the state head has symIn set to 0xFFFF (or 0xFFFFFFFF) and all outgoing transitions are left to following cells. Even then the state head specifies moreTransitions as usual. For weighted transducers the final weight of a final state is specified in the weight field of the state head.

Order of transitions in transition lists

As of voikkovfstc version 4.0.2 the following sort order is used for the transitions according to input symbol:

  1. Pure epsilon transitions
  2. Transition signaling the final state
  3. Other symbols sorted according to their ordinal number Because this ordering was not implemented in the first stable releases of voikkovfstc the library does not rely on it for the unweighted transducers. For weighted transducers the order is used to avoid linear lookup (the same will be done for unweighted transducers at the next incompatible format version bump).

Since the sorting can only work in one direction slightly different lookup algorithms are needed for morphological analysis and generation (generation algorithm has not yet been implemented).

Implications

  • Unweighted format is compact but supports only relatively small transducers. 3 bytes for target state address limits the number of Transition and Overflow cells to about 16 million leading to around 128 MB upper limit for the total transducer size.
  • Weighted transducers may contain over 4 billion transitions and the upper limit for transducer size is around 64 GB.
  • Non-final states without outgoing transitions are impossible to represent in this format. This won't be a problem if we require that transducers must be made co-accessible before feeding them to VFST compiler.