In text categorization, core terms of an input document are hardly selected as classification features if they do not occur in a training document set. Besides, synonymous terms with the same concept are usually treated as different features. This study aims to improve text categorization performance by integrating synonyms into a single feature and by replacing input terms not in the training document set with the most similar term occurring in training documents using Wikipedia. For the selection of classification features, experiments were performed in various settings composed of three different conditions: the use of category information of non-training terms, the part of Wikipedia used for measuring term-term similarity, and the type of similarity measures. The categorization performance of a kNN classifier was improved by 0.35~1.85% in F1 value in all the experimental settings when non-learning terms were replaced by the learning term with the highest similarity above the threshold value. Although the improvement ratio is not as high as expected, several semantic as well as structural devices of Wikipedia could be used for selecting more effective classification features.
Bird, S.. (2007). Natural language processing in Python:O'ReillyMedia.
Gabrilovich, E.. (2005). Feature generation for text categorization using world knowledge (1048-1053). Proceedings of the 19th international Joint Conference on Artificial intelligence.
Gabrilovich, E.. (2006). Overcoming the brittleness bottleneck using Wikipedia : enhancing text categorization with encyclopedic knowledge (1301-1306). Proceedings of the 21st National Conference on Artificial Intelligence.
Gabrilovich, E.. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis (1606-1611). Proceedings of the 20th International Joint Conference on Artificial Intelligence.
Huang, A.. (2009). Clustering documents using a Wikipediabased concept representation (628-636). Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (LNCS 5476/2009).
Milne, D.. (2007). A knowledge-based search engine powered by Wikipedia (445-454). Proceedings of the 16th ACM Conference on Information and Knowledge Management.
Milne, D.. (2008). An effective, low-cost measure of semantic relatedness obtained from Wikipedia links (-). Proceedings of the First AAAI Workshop on Wikipedia and Artificial Intelligence, (WIKIAI 2008).
Minier, Z.. (2007). Wikipedia-based kernels for text categorization (157-164). Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.
Ponzetto, S. P.. (2006). Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution (192-199). Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics.
Ponzetto, S. P.. (2007). Knowledge derived from Wikipedia for computing semantic relatedness. Journal of Artificial Intelligence Research, 30(1), 181-212.
Strube, M.. (2006). WikiRelate! Computing semantic relatedness using Wikipedia (1419-1424). Proceedings of the 21st National Conference on Artificial Intelligence.
Wang, P.. (2007). Improving text classification by using encyclopedia knowledge (332-341). Proceedings of the 2007 Seventh IEEE International Conference on Data Mining.
Wang, P.. (2008). Building semantic kernels for text classification using wikipedia (713-721). Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.