The
Moby Pronunciator II contains 177,267 entries with corresponding pronunciations. Most of the entries describe a single word, but approximately 79,000 contain hyphenated or multiple word phrases, names, or
lexemes. The Project Gutenberg distribution also contains a copy of the
cmudict v0.3. The file contains lines of the format
word[/part-of-speech] pronunciation. Each line is ended with the ASCII
carriage return character (CR, '\r', 0x0D, 13 in decimal). The
word field can include apostrophes (e.g. ''isn't
), hyphens (e.g. able-bodied''), and multiple words separated by underscores (e.g. '
). Non-English words are generally rendered, as stated in the documentation, without accents or other diacritical marks. However, in 36 entries (e.g. '), some non-ASCII accented characters remain, represented using
Mac OS Roman encoding. The part-of-speech field is used to disambiguate 770 of the words which have differing pronunciations depending on their part-of-speech. For example, for the words spelled
close, the verb has the pronunciation , whereas the adjective is . The parts-of-speech have been assigned the following codes: Following this is the pronunciation. Several special symbols are present: The rest of the symbols are used to represent
IPA characters. The pronunciations are generally consistent with a
General American dialect of English, that exhibits
father-bother merger,
hurry-furry merger and
lot-cloth split, but does not exhibit
cot-caught merger or
wine-whine merger. Each phoneme is represented by a sequence of one or more characters. Some of the sequences are delimited with a slash character "/", as shown in the following table, but note that the sequence for is delimited by
two slash characters at either end: To this collection are added a number of extra sequences representing phonemes found in several other languages. These are used to encode the non-English words, phrases and names that are included in the database. The following table contains these extra phonemes, but note that the extent to which some of these may exist due to encoding errors is not clear. == Shakespeare ==