Abstract : | Supervised and unsupervised learning have been the focus of critical research in the areas of machine learning and artificial intelligence. In the literature, these two streams flow independently of each other, despite their close conceptual and practical connections. This dissertation demonstrates that unsupervised learning algorithms, i.e. clustering, can provide us with valuable information about the data and help in the creation of high-accuracy text classifiers. In the case of clustering,the aim is to extract a kind of \structure" from a given sample of objects. The reasoning behind this is that if some structure exists in the objects, it is possible to take advantage of this information and find a short description of the data,exploiting the dependence or association between index terms and documents.This concise representation of the whole dataset can be properly incorporated in the existing data representation. The use of prior knowledge about the nature oft he dataset helps in building a more efficient classifier for this set. This approach does not capture all the intricacies of text; however on some domains this technique substantially improves text classification accuracy.In this vein, a study of the interaction between supervised and unsupervised learning has been carried out. We have studied and implemented models that apply clustering in multiple ways and in conjunction with classification to construct robust text classifiers. The extensive experimentation has shown the effectiveness of using clustering to boost text classification performance. Additionally, preliminary experiments on some of the most important applications of text classification such as Spam Mail Filtering, Spam Detection in Social Bookmarking Systems,and Sentence Boundary Disambiguation, have shown promising enhancements by exploiting the proposed models.
|
---|