Abstract
This paper presents a machine learning approach to sarcasm detection on Twitter in two languages
– English and Czech. Although there has been some research in sarcasm detection in
languages other than English (e.g., Dutch, Italian, and Brazilian Portuguese), our work is the
first attempt at sarcasm detection in the Czech language. We created a large Czech Twitter corpus
consisting of 7,000 manually-labeled tweets and provide it to the community. We evaluate
two classifiers with various combinations of features on both the Czech and English datasets.
Furthermore, we tackle the issues of rich Czech morphology by examining different preprocessing
techniques. Experiments show that our language-independent approach significantly outperforms
adapted state-of-the-art methods in English (F-measure 0.947) and also represents a strong
baseline for further research in Czech (F-measure 0.582).
Original language | English |
---|---|
Title of host publication | Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, |
Pages | 213-223 |
Number of pages | 11 |
Publication status | Published - Aug 2014 |
Event | COLING 2014, the 25th International Conference on Computational Linguistics - Dublin, Ireland Duration: 23 Aug 2014 → 29 Aug 2014 |
Conference
Conference | COLING 2014, the 25th International Conference on Computational Linguistics |
---|---|
Country/Territory | Ireland |
City | Dublin |
Period | 23/08/2014 → 29/08/2014 |