Automatic acquisition of sense-tagged corpora

From Infogalactic: the planetary knowledge core
Jump to: navigation, search

Lua error in package.lua at line 80: module 'strict' not found. Lua error in package.lua at line 80: module 'strict' not found. The knowledge acquisition bottleneck is perhaps the major impediment to solving the word sense disambiguation (WSD) problem. Unsupervised learning methods rely on knowledge about word senses, which is barely formulated in dictionaries and lexical databases. Supervised learning methods depend heavily on the existence of manually annotated examples for every word sense, a requisite that can so far be met only for a handful of words for testing purposes, as it is done in the Senseval exercises.

Existing methods

Therefore, one of the most promising trends in WSD research is using the largest corpus ever accessible, the World Wide Web, to acquire lexical information automatically.[1] WSD has been traditionally understood as an intermediate language engineering technology which could improve applications such as information retrieval (IR). In this case, however, the reverse is also true: Web search engines implement simple and robust IR techniques that can be successfully used when mining the Web for information to be employed in WSD.

The most direct way of using the Web (and other corpora) to enhance WSD performance is the automatic acquisition of sense-tagged corpora, the fundamental resource to feed supervised WSD algorithms. Although this is far from being commonplace in the WSD literature, a number of different and effective strategies to achieve this goal have already been proposed. Some of these strategies are:

Summary

Optimistic results

The automatic extraction of examples to train supervised learning algorithms reviewed has been, by far, the best explored approach to mine the web for word sense disambiguation. Some results are certainly encouraging:

  • In some experiments, the quality of the Web data for WSD equals that of human-tagged examples. This is the case of the monosemous relatives plus bootstrapping with Semcor seeds technique[2] and the examples taken from the ODP Web directories.[3] In the first case, however, Semcor-size example seeds are necessary (and only available for English), and it has only been tested with a very limited set of nouns; in the second case, the coverage is quite limited, and it is not yet clear whether it can be grown without compromising the quality of the examples retrieved.
  • It has been shown[4] that a mainstream supervised learning technique trained exclusively with web data can obtain better results than all unsupervised WSD systems which participated at Senseval-2.
  • Web examples made a significant contribution to the best Senseval-2 English all-words system.[5]

Difficulties

There are, however, several open research issues related to the use of Web examples in WSD:

  • High precision in the retrieved examples (i.e., correct sense assignments for the examples) does not necessarily lead to good supervised WSD results (i.e., the examples are possibly not useful for training).[6]
  • The most complete evaluation of Web examples for supervised WSD[7] indicates that learning with Web data improves over unsupervised techniques, but the results are nevertheless far from those obtained with hand-tagged data, and do not even beat the most-frequent-sense baseline.
  • Results are not always reproducible; the same or similar techniques may lead to different results in different experiments. Compare, for instance, Mihalcea (2002[8]) with Agirre and Martínez (2004[9]), or Agirre and Martínez (2000[10]) with Mihalcea and Moldovan (1999[11]). Results with Web data seem to be very sensitive to small differences in the learning algorithm, to when the corpus was extracted (search engines change continuously), and on small heuristic issues (e.g., differences in filters to discard part of the retrieved examples).
  • Results are strongly dependent on bias (i.e., on the relative frequencies of examples per word sense).[12] It is unclear whether this is simply a problem of Web data, or an intrinsic problem of supervised learning techniques, or just a problem of how WSD systems are evaluated (indeed, testing with rather small Senseval data may overemphasize sense distributions compared to sense distributions obtained from the full Web as corpus).
  • In any case, Web data has an intrinsic bias, because queries to search engines directly constrain the context of the examples retrieved. There are approaches that alleviate this problem, such as using several different seeds/queries per sense[13] or assigning senses to Web directories and then scanning directories for examples;[14] but this problem is nevertheless far from being solved.
  • Once a Web corpus of examples is built, it is not entirely clear whether its distribution is safe from a legal perspective.

Future

Besides automatic acquisition of examples from the Web, there are some other WSD experiments that have profited from the Web:

  • The Web as a social network has been successfully used for cooperative annotation of a corpus (OMWE, Open Mind Word Expert project),[15] which has already been used in three Senseval-3 tasks (English, Romanian and Multilingual).
  • The Web has been used to enrich WordNet senses with domain information: topic signatures[16] and Web directories,[17] which have in turn been successfully used for WSD.
  • Also, some research benefited from the semantic information that the Wikipedia maintains on its disambiguation pages.[18][19]

It is clear,[according to whom?] however, that most research opportunities remain largely unexplored. For instance, little is known about how to use lexical information extracted from the Web in knowledge-based WSD systems; and it is also hard to find systems that use Web-mined parallel corpora for WSD, even though there are already efficient algorithms that use parallel corpora in WSD.

References

  1. Kilgarriff, A.; G. Grefenstette. 2003. Introduction to the special issue on the Web as corpus. Computational Linguistics 29(3)
  2. Mihalcea, Rada. 2002. Bootstrapping large sense tagged corpora. Proceedings of the Language Resources and Evaluation Conference (LREC), Las Palmas, Spain.
  3. Santamaría, Celina, Julio Gonzalo & Felisa Verdejo. 2003. Automatic association of Web directories to word senses. Computational Linguistics, 29(3): 485–502.
  4. Agirre, Eneko & David Martínez. 2004. Unsupervised WSD based on automatically retrieved examples: The importance of bias. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, 25–33.
  5. Mihalcea, Rada. 2002a. Word sense disambiguation with pattern learning and automatic feature selection. Natural Language Engineering, 8(4): 348–358.
  6. Agirre, Eneko & David Martínez. 2000. Exploring automatic word sense disambiguation with decision lists and the Web. Proceedings of the COLING Workshop on Semantic Annotation and Intelligent Annotation, Luxembourg, 11–19.
  7. Agirre, Eneko & David Martínez. 2004. Unsupervised WSD based on automatically retrieved examples: The importance of bias. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, 25–33.
  8. Mihalcea, Rada. 2002b. Bootstrapping large sense tagged corpora. Proceedings of the Language Resources and Evaluation Conference (LREC), Las Palmas, Spain.
  9. Agirre, Eneko & David Martínez. 2004. Unsupervised WSD based on automatically retrieved examples: The importance of bias. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, 25–33.
  10. Agirre, Eneko & David Martínez. 2000. Exploring automatic word sense disambiguation with decision lists and the Web. Proceedings of the COLING Workshop on Semantic Annotation and Intelligent Annotation, Luxembourg, 11–19.
  11. Mihalcea, Rada & Dan Moldovan. 1999. An automatic method for generating sense tagged corpora. Proceedings of the American Association for Artificial Intelligence (AAAI), Orlando, U.S.A., 461–466.
  12. Agirre, Eneko & David Martínez. 2004. Unsupervised WSD based on automati- cally retrieved examples: The importance of bias. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, 25–33.
  13. Mihalcea, Rada. 2002b. Bootstrapping large sense tagged corpora. Proceedings of the Language Resources and Evaluation Conference (LREC), Las Palmas, Spain.
  14. Santamaría, Celina, Julio Gonzalo & Felisa Verdejo. 2003. Automatic association of Web directories to word senses. Computational Linguistics, 29(3): 485–502.
  15. Chklovski, Tim & Rada Mihalcea. 2002. Building a sense tagged corpus with Open Mind Word Expert. Proceedings of the ACL SIGLEX Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, Philadelphia, U.S.A., 116–122.
  16. Agirre, Eneko, Olatz Ansa, Eduard H. Hovy & David Martínez. 2000. Enriching very large ontologies using the WWW. Proceedings of the Ontology Learning Workshop, European Conference on Artificial Intelligence (ECAI), Berlin, Germany.
  17. Santamaría, Celina, Julio Gonzalo & Felisa Verdejo. 2003. Automatic association of Web directories to word senses. Computational Linguistics, 29(3): 485–502.
  18. Denis Turdakov, Pavel Velikhov. Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation // SYRCoDIS.— 2008.
  19. Турдаков Денис. Устранение лексической многозначности терминов Википедии на основе скрытой модели Маркова // XI Всероссийская научная конференция «Электронные библиотеки: перспективные методы и технологии, электронные коллекции».— 2009. pdf (russian)