Extending Zipf’s law to n-grams for large corpora

Le Quan Ha, Philip Hanna, Ming Ji, F.J. Smith

Research output: Contribution to journalArticlepeer-review

12 Citations (Scopus)
8 Downloads (Pure)

Abstract

Experiments show that for a large corpus, Zipf’s law does not hold for all rank of words: the frequencies fall below those predicted by Zipf’s law for ranks greater than about 5,000 word types in the English language and about 30,000 word types in the inflected languages Irish and Latin. It also does not hold for syllables or words in the syllable-based languages, Chinese or Vietnamese. However, when single words are combined together with word n-grams in one list and put in rank order, the frequency of tokens in the combined list extends Zipf’s law with a slope close to -1 on a log-log plot in all five languages. Further experiments have demonstrated the validity of this extension of Zipf’s law to n-grams of letters, phonemes or binary bits in English. It is shown theoretically that probability theory
alone can predict this behavior in randomly created n-grams of binary bits.
Original languageEnglish
Pages (from-to)101-113
Number of pages13
JournalArtificial Intelligence Review
Volume32
Issue number1-4
DOIs
Publication statusPublished - Dec 2009

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint Dive into the research topics of 'Extending Zipf’s law to n-grams for large corpora'. Together they form a unique fingerprint.

Cite this