Sample lexica

We have compiled two sample lexica for demonstration use. One contains Dublin street names, in RP English, the other a set of Stockholm street names, in standard Swedish.

The samples were produced with no specific application in mind. This is not the usual way for STTS to work, since we prefer to formulate, implement and follow phonetic and other guidelines as strictly as possible. Such guidelines are typically tied to a specific application domain. These samples were produced from fresh data, in the sense that STTS has not previously transcribed this data, neither for internal use nor for any customer.

Symbol set

We try to follow the SAMPA/SAMPROSA conventions as far as possible, but we use a space character, / /, as phoneme delimiter. /$/ is used as a syllable delimiter, and each word is delimited by /#/.

File format

The plain text file format of the sample lexicon (generated from an internal database) looks like this:

 <ORTHOGRAPHY>(<TAB><COMMENT>)?
 <TAB><TRANSCRIPTION>
  ...  

The orthography starts a new line, and is followed by an optional tab separated comment. One or more transcriptions then follow on lines starting with a tab.

The first transcription following an entry should be considered the preferred pronunciation, followed by zero or more variants.

If an entry consists of several words that all have multiple transcriptions, all possible combinations have been generated. For example, if an entry consists of two words, one of which has three pronunciations and the other has two, there will be six transcriptions of this item.