이 연구에서는 문서 자동분류에서 분류자질 선정과 가중치 할당을 위해서 일관된 전략을 채택하여 kNN 분류기의 성능을 향상시킬 수 있는 방안을 모색하였다. 문서 자동 분류에서 분류자질 선정 방식과 자질 가중치 할당 방식은 자동분류 알고리즘과 함께 분류성능을 좌우하는 중요한 요소이다. 기존 연구에서는 이 두 방식을 결정할 때 상반된 전략을 사용해왔다. 이 연구에서는 색인파일 저장공간과 실행시간에 따른 분류성능을 기준으로 분류자질 선정 결과를 평가해서 기존 연구와 다른 결과를 얻었다. 상호정보량과 같은 저빈도 자질 선호 기준이나 심지어는 역문헌빈도를 이용해서 분류 자질을 선정하는 것이 kNN 분류기의 분류 효과와 효율 면에서 바람직한 것으로 나타났다. 자질 선정 기준으로 저빈도 자질 선호 척도를 자질 선정 및 자질 가중치 할당에 일관되게 이용한 결과 분류성능의 저하 없이 kNN 분류기의 처리 속도를 약 3배에서 5배 정도 향상시킬 수 있었다.
This study aims to find consistent strategies for feature selection and feature weighting methods, which can improve the effectiveness and efficiency of kNN text classifier. Feature selection criteria and feature weighting methods are as important factor as classification algorithms to achieve good performance of text categorization systems. Most of the former studies chose conflicting strategies for feature selection criteria and weighting methods. In this study, the performance of several feature selection criteria are measured considering the storage space for inverted index records and the classification time. The classification experiments in this study are conducted to examine the performance of IDF as feature selection criteria and the performance of conventional feature selection criteria, e.g. mutual information, as feature weighting methods. The results of these experiments suggest that using those measures which prefer low-frequency features as feature selection criterion and also as feature weighting method, we can increase the classification speed up to three or five times without loosing classification accuracy.
(2002) 베이지언 문서분류시스템을 위한 능동적 학습기반의 학습문서집합 구성방법 소프트웨어 및 응용,
(2004.) 잠재의미색인기법을 이용한 kNN 분류기의 자질 선정에 관한 연구. ,
(2002"휴리스틱을) 휴리스틱을 이용한 kNN의 효율성 개선,
(2003) 대표용어를 이용한 kNN 분류기의 처리속도 개선,
(2004.) Introduction to Machine Learning.,
(1996) The design of a high performance infor- mation filtering system Proceedings of the 19th Annual ACM Confe- rence on Research and Develop- ment in Information Retrieval,
(2003) moving to the mainstream,
(2002) Inverted file search algo- rithms for collaborative filtering Proceedings of the 25th Annual ACM Conference on Research and Development in Information Re- trieval,
(2002) An extensive empirical study of feature selection metrics for text classification Jour- nal of Machine Learning Research,
an efficient feature-selection algorithm for text categorization,
(2000) Experiments on the use of feature selection and negative evi- dence in automated text categori- zation Proceedings of the 4th European Conference on Research and Advanced Technology for Di- gital Libraries,
(2002) An improved kNN learning based korean text classifier with heuristic information Proceedings of the 9th Interna- tional Conference on Neural Infor- mation Processing,
(1999) Foundations of Statistical Natural Language Processing. Cambridge, MIT Press.
Document filtering for fast ranking Proceedings of the 17th Annual ACM Conference on Research and Development in Infor- mation Retrieval,
(2002) Machine lear- ning in automated text categori- zation,
(1997) A comparative study on feature selec- tion in text categorization Procee- dings of the Fourteenth Interna- tional Conference on Machine Lear- ning,
(1999) A re- examination of text categorizationi methods Proceedings of the 22nd Annual ACM Conference on Re- search and Development in Infor- mation Retrieval,
(1999) An evaluation of statis- tical aqpproaches to text categori- zation,
(2003) "Fast text classi- fication: A training-corpus pruning based approach" Proceedings of the 8th International Conference on Database Systems for Advanced Applications (DASFAA'2003), DASFAA'2003
(2003) Accuracy improvement of automatic text classification based on feature trans- formation Proceedings of the 2003 ACM Symposium on Document Engineering,