Περίληψη : | As the immense amount of text data increases rapidly over the years, the need to improve the quality of algorithms in text related tasks is eminent. Traditional models for representing documents, like the standard vector space model (VSM), often neglect the semantic relatedness between words, suffering from the restriction of exact keywords matching, in order to explore the similarity or relatedness between segments of text. In critical tasks, like text classification and retrieval, which have been studied over the past decades intensively, this assumption of exact keyword matching is often the reason for poor performance. This thesis aims to explore the potential of incorporating semantic relatedness between documents in several text related applications,like text classification, retrieval and paraphrasing recognition. Several aspects have been taken into account, like natural language processing techniques and use of a word thesaurus, namely WordNet, in an effort to exhaust as many possibilities as possible in the workflow from analyzing and preprocessing documents up to embedding successfully the semantic information in a machine readable manner in those tasks. The outcome of this thesis shows that lexical semantic similarity can be used efficiently in the studied tasks and that it can boost their performance, widening the possibilities of more efficient algorithms in text applications. This thesis is part of the research project number 03E¢850/8.3.1., implemented within the framework of the Greek Reinforcement Programme of Human Research Manpower (PENED) and co-financed by Greek national and European Union Funds (25% from the Greek Ministry of Development-General Secretariat of Research and Technology, and 75% from E.U.- European Social Fund).
|
---|