Zipf and Type-Token rules for the English, Spanish, Irish and Latin languages

Research output: Contribution to journalArticlepeer-review

Abstract

The Zipf curves of log of frequency against log of rank for a large English corpus of 500 million word tokens, 689,000 word types and for a large Spanish corpus of 16 million word tokens, 139,000 word types are shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank they turn to give a slope close to –2. This is apparently mainly due to foreign words and place names. Other Zipf curves for highlyinflected Indo-European languages, Irish and ancient Latin, are also given. Because of the larger number of word types per lemma, they remain flatter than the English curve maintaining a slope of –1 until turning points of about ranks 30,000 for Irish and 10,000 for Latin. A formula which calculates the number of tokens given the number of types is derived in terms of the rank at the turning point, 5,000 for both English and Spanish, 30,000 for Irish and 10,000 for Latin.
Original languageEnglish
Pages (from-to)1-12
Number of pages12
JournalWeb Journal of Formal, Computational and Cognitive Linguistics
Volume1 (8)
Publication statusPublished - Jan 2006

Fingerprint

Dive into the research topics of 'Zipf and Type-Token rules for the English, Spanish, Irish and Latin languages'. Together they form a unique fingerprint.

Cite this