바로가기메뉴

본문 바로가기 주메뉴 바로가기

A Study on Feature Selection for kNN Classifier using Document Frequency and Collection Frequency

Journal of Korean Library and Information Science Society / Journal of Korean Library and Information Science Society, (P)2466-2542;
2013, v.44 no.1, pp.27-47
https://doi.org/10.16981/kliss.44.1.201303.27

  • Downloaded
  • Viewed

Abstract

This study investigated the classification performance of a kNN classifier using the feature selection methods based on document frequency(DF) and collection frequency(CF). The results of the experiments, which used HKIB-20000 data, were as follows. First, the feature selection methods that used high-frequency terms and removed low-frequency terms by the CF criterion achieved better classification performance than those using the DF criterion. Second, neither DF nor CF methods performed well when low-frequency terms were selected first in the feature selection process. Last, combining CF and DF criteria did not result in better classification performance than using the single feature selection criterion of DF or CF.

keywords
자동분류, 자질 선정, kNN 분류기, 문헌빈도, 장서빈도, Automatic classification, Feature selection, kNN classifier, Document frequency, Collection frequency

Reference

1.

심경. “문헌범주화에서 학습문헌수 최적화에 관한 연구.” 정보관리학회지, 제23권, 제4호(2006. 12), pp.277-294.

2.

이용구. “단어 중의성 해소를 위한 지도학습 방법의 통계적 자질선정에 관한 연구.” 한국비블리아학회지, 제22권, 제2호(2011. 6), pp.5-25.

3.

이재윤. “자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구." 한국문헌정보학회지, 제39권, 제2호(2005. 6), pp.123-146.

4.

정영미. 정보검색연구. 서울 : 구미무역 출판부, 2005.

5.

정은경. “문서범주화 성능 향상을 위한 의미기반 자질확장에 관한 연구." 정보관리학회지, 제26권, 제3호(2009. 9), pp.261-278.

6.

Azam, N. and J. Yao. “Comparison of term frequency and document frequency based feature selection metrics in text categorization." Expert Systems with Applications, Vol.39, No.5(2012), pp.4760-4768.

7.

Guyon, I. and A. Elisseeff. “An Introduction to Variable and Feature Selection." Journal of Machine Learning Research, 3(2002), pp.1157-1182.

8.

Jackson, P. and I. Moulinier. Natural Language Processing for Online Applications - Text Retrieval, Extraction and Categorization. Amsterdam : Benjamins Publishing Co., 2002.

9.

Kim, J. et al. “HKIB-2000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research." Journal of Computing Science and Engineering, Vol.3, No.3(Sep. 2009), pp.165-180.

10.

Sebastiani, F. “Machine Learning in Automated Text Categorization." ACM Computing Surveys, Vol.34, No.1(2002), pp.1-47.

11.

Shang, W. et al. “A novel feature selection algorithm for text categorization." Expert Systems with Applications, Vol.33, No.1(July. 2007), pp.1-5.

12.

Tan, S. “Neighbor-weighted K-nearest Neighbor for Unbalanced Text Corpus." Expert Systems with Applications, Vol.28, No.4(2005), pp.667-671.

13.

Yang, Y. and J.O. Pedersen. “A comparative study on feature selection in text categorization." In: Proceedings of the 14th International Conference on Machine Learning(1997), pp.412-420.

14.

Yang, Y. and X. Lin. “A re-examination of text categorization methods." In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in the information retrieval(1999), pp.42-49.

15.

HKIB 실험집단. <http://www.kristalinfo.com/TestCollections/#hkib> [cited 2012. 7. 10].

16.

HAM 형태소 분석기. <http://nlp.kookmin.ac.kr/HAM/kor/> [cited 2012. 7. 15].

Journal of Korean Library and Information Science Society