바로가기메뉴

본문 바로가기 주메뉴 바로가기

Empirical Study on Improving the Performance of Text Categorization Considering the Relationships between Feature Selection Criterea and Weighting Methods

Journal of the Korean Society for Library and Information Science / Journal of the Korean Society for Library and Information Science, (P)1225-598X; (E)2982-6292
2005, v.39 no.2, pp.123-146

Abstract

This study aims to find consistent strategies for feature selection and feature weighting methods, which can improve the effectiveness and efficiency of kNN text classifier. Feature selection criteria and feature weighting methods are as important factor as classification algorithms to achieve good performance of text categorization systems. Most of the former studies chose conflicting strategies for feature selection criteria and weighting methods. In this study, the performance of several feature selection criteria are measured considering the storage space for inverted index records and the classification time. The classification experiments in this study are conducted to examine the performance of IDF as feature selection criteria and the performance of conventional feature selection criteria, e.g. mutual information, as feature weighting methods. The results of these experiments suggest that using those measures which prefer low-frequency features as feature selection criterion and also as feature weighting method, we can increase the classification speed up to three or five times without loosing classification accuracy.

keywords
문서범주화, 자동분류, 자질선정, 자질가중치, kNN 분류기, Text Categorization, Automatic Classification, Feature Selection, Feature Weighting Methods, kNN Classifier

Reference

1.

(2002) 베이지언 문서분류시스템을 위한 능동적 학습기반의 학습문서집합 구성방법 소프트웨어 및 응용,

2.

(2004.) 잠재의미색인기법을 이용한 kNN 분류기의 자질 선정에 관한 연구. ,

3.

(2002"휴리스틱을) 휴리스틱을 이용한 kNN의 효율성 개선,

4.

(2003) 대표용어를 이용한 kNN 분류기의 처리속도 개선,

5.

(2004.) Introduction to Machine Learning.,

6.

(1996) The design of a high performance infor- mation filtering system Proceedings of the 19th Annual ACM Confe- rence on Research and Develop- ment in Information Retrieval,

7.

(2003) moving to the mainstream,

8.

(2002) Inverted file search algo- rithms for collaborative filtering Proceedings of the 25th Annual ACM Conference on Research and Development in Information Re- trieval,

9.

(2002) An extensive empirical study of feature selection metrics for text classification Jour- nal of Machine Learning Research,

10.

an efficient feature-selection algorithm for text categorization,

11.

(2000) Experiments on the use of feature selection and negative evi- dence in automated text categori- zation Proceedings of the 4th European Conference on Research and Advanced Technology for Di- gital Libraries,

12.

(2002) An improved kNN learning based korean text classifier with heuristic information Proceedings of the 9th Interna- tional Conference on Neural Infor- mation Processing,

13.

(1999) Foundations of Statistical Natural Language Processing. Cambridge, MIT Press.

14.

Document filtering for fast ranking Proceedings of the 17th Annual ACM Conference on Research and Development in Infor- mation Retrieval,

15.

(2002) Machine lear- ning in automated text categori- zation,

16.

(1997) A comparative study on feature selec- tion in text categorization Procee- dings of the Fourteenth Interna- tional Conference on Machine Lear- ning,

17.

(1999) A re- examination of text categorizationi methods Proceedings of the 22nd Annual ACM Conference on Re- search and Development in Infor- mation Retrieval,

18.

(1999) An evaluation of statis- tical aqpproaches to text categori- zation,

19.

(2003) "Fast text classi- fication: A training-corpus pruning based approach" Proceedings of the 8th International Conference on Database Systems for Advanced Applications (DASFAA'2003), DASFAA'2003

20.

(2003) Accuracy improvement of automatic text classification based on feature trans- formation Proceedings of the 2003 ACM Symposium on Document Engineering,

Journal of the Korean Society for Library and Information Science