Tehran Monolingual Corpus

Lua error in package.lua at line 80: module 'strict' not found. The Tehran Monolingual Corpus (TMC) is a large-scale Persian monolingual corpus. TMC is suited for Language Modeling and relevant research areas in Natural Language Processing.

The corpus is extracted from Hamshahri Corpus and ISNA news agency website. The quality of Hamshahri corpus is improved for language modeling purpose by a series of tokenization and spell-checking steps.

TMC comprises more than 250 million words. The total number of unique words (with frequency of two or more) of the corpus is about 300 thousand, which is relatively good for a highly-inflectional language like Persian.

TMC is created by Natural Language Processing Lab of University of Tehran. The corpus is free for research use, after obtaining permission from the corpus aggregator.

External links

TMC description page

v t e Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus Spoken English Corpus Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES Croatian Language Corpus Croatian National Corpus Europarl corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project PropBank Quranic Arabic Corpus Russian National Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto Thesaurus Linguae Graecae TIMIT VerbNet
Organizations	BNC consortium COBUILD

Tehran Monolingual Corpus

See also

External links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools