Two types of ancient texts are known to modern scholars – those that have only survived in younger manuscripts, but whose great age is undisputed (this applies to the bulk of the Chinese,
Brahmi, Greek, Latin, Hebrew and
Avestan tradition), and those known from original inscriptions, papyri and other manuscripts. Counting of the words in each corpus presents significant methodological challenges – in principle, every single occurrence of a word in the text is counted separately, but in the case of parallel transmission of literary texts, only a single transmission is taken into account. Just as the
Book of the Dead and the
coffin texts are only included once in the number given for the Egyptian, the Greek and Latin literary works should only be counted according to one manuscript. If, on the other hand, tombs, royal inscriptions or economic documents of certain ancient languages often show a more or less identical form, this is not evaluated as a purely "parallel tradition". Attached prepositions are counted as separate words, except in the case of the
definite article in Hebrew,
Aramaic and Greek since it has no equivalent in most languages, so its frequency would significantly affect the comparability of numbers.
Languages with known size estimates South Asian •
Sanskrit (
Vedic Sanskrit and
Classical Sanskrit) •
Indus script (3,800 items, c.20,000 characters) •
Brahmi script •
Old Tamil •
Early Indian epigraphy and
Indian epic poetry •
Kharosthi •
Pali literature •
List of historic Indian texts Mesoamerican •
Olmec hieroglyphs •
Maya script East Asian •
Old Chinese •
Chinese classics • The pre-Qin corpus: a collection of ancient Chinese texts written before the
Qin dynasty (221 BCE). The corpus includes texts from
Confucianism,
Taoism,
Legalism, and other
schools of thought. • The pre-Han corpus: a collection of ancient Chinese texts written before the
Han dynasty (202 BCE). The corpus includes texts from Confucianism, Taoism, Legalism, and other schools of thought. • See the
Chinese Text Project •
Chinese bronze inscriptions,
Oracle bone script,
Seal script,
Clerical script Central Iranian languages • Prior to 300 AD, the
Central Iranian languages are mainly in the form of
Sassanid stone inscriptions in the two closely related idioms
Middle Persian (
Pahlavi scripts and
Inscriptional Parthian), there are 5000 for the corpus of Middle Persian (mostly 3rd, but also 4th/5th centuries) and for the
corpus of Parthian (3rd century) 3000 words. To what extent some of the
Manichaean Middle Persian literary texts may date back to the 3rd century is difficult to estimate;
Mani is said to have personally written the
Shabuhragan totaling about 5000 words. In any case, if we combine Middle Persian and Parthian, we come to over 10,000 words.
Proto-Sinaitic •
Proto-Sinaitic script has no more than about 400 letters (number of words is unknown since the script has not been fully interpreted). To a similar extent, there are probably approximately contemporaneous
Proto-Canaanite inscriptions (ibid.).
Anatolian •
Luwian cuneiform, approx. 3000 words • the
Palaic language few hundred words. •
Hieroglyphic Luwian • the
Lycian alphabet (the best attested Anatolian successor language written in alphabetic script) with about 5000 words • The
Lydian alphabet 109 inscriptions comprising about 1500 words • The
Phrygian alphabet the in-tomb inscriptions from the 2nd and 3rd centuries AD (approx. 1000 words) and in the so-called "old Phrygian" inscriptions less than 300 words • The
Carian alphabets whose texts, mainly from Egypt, contain around 600 words.
Old Italic • the
Umbrian language attested essentially by the sacrificial instructions of the
Iguvinian Tables with 5000 words • the
Oscan language (ibid.) with 2000 words • the
Messapic language with probably a good 1000 words (the estimate is difficult because most texts in this hardly understandable language do not use word separators) • the
Venetic language a few hundred words • the
Faliscan language a few hundred words •
Cisalpine Celtic inscriptions amount to approximately 2000 words, to which are added a number of glosses by classical authors
Iberia •
Iberian scripts, more rarely written in Greek or Latin script, approx. 2500 words •
Celtiberian script, which refers to Celtic language testimonies in Iberian, but also in Latin script from Spain (approx. 1000 words)
Africa •
Geʽez script: comparatively few inscriptions with a total of around 1,000 words before 300 AD. Following
Christianization in the 4th century, more extensive texts are known. •
Libyco-Berber alphabet: over 1,000 inscriptions from the
Maghreb, which are dated to
Roman times. Most texts do not use a word separator; Peust estimates that the total number of words could be around 5,000 •
Meroitic script (Ancient
Nubian): about 900 texts are known, which Peust estimates may contain approximately 10,000 words, albeit with uncertainty from the fact that the word separator is not used consistently in the Meroitic script.
Aegean • The Cretan
Linear A inscriptions that have not yet been deciphered are available in about 2500 texts, which contain a total of around 20,000 characters. The total number of words can hardly be determined; Peust tentatively put it in the same order of magnitude as in Meroitic. • In addition to the Linear A texts, there are also inscriptions
Cretan hieroglyphs of a few hundred characters and texts written in the Greek alphabet, but not in Greek, with a few dozen words •
Cypriot syllabary in the first millennium BC, in which mostly Greek texts were recorded. The relevant texts comprise around 100 to 200 words.
Micro corpora There are a significant number of ancient micro-corpus languages. Estimating the total number of attested ancient languages may be as difficult as estimating their corpus size. For example, Greek and Latin sources hand down an enormous amount of foreign-language glosses, the seriousness of which is not always certain. ==Preservation and curation==