The Zipf curves of log of frequency against log of rank for a large English corpus of 500 million word tokens, 689,000 word types and for a large Spanish corpus of 16 million word tokens, 139,000 word types are shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank they turn to give a slope close to –2. This is apparently mainly due to foreign words and place names. Other Zipf curves for highlyinflected Indo-European languages, Irish and ancient Latin, are also given. Because of the larger number of word types per lemma, they remain flatter than the English curve maintaining a slope of –1 until turning points of about ranks 30,000 for Irish and 10,000 for Latin. A formula which calculates the number of tokens given the number of types is derived in terms of the rank at the turning point, 5,000 for both English and Spanish, 30,000 for Irish and 10,000 for Latin.
|Number of pages||12|
|Journal||Web Journal of Formal, Computational and Cognitive Linguistics|
|Publication status||Published - Jan 2006|