바로가기메뉴

본문 바로가기 주메뉴 바로가기

Comparative Study of Feature Selection Methods for Korean Web Documents Clustering

Journal of the Korean Society for Library and Information Science / Journal of the Korean Society for Library and Information Science, (P)1225-598X; (E)2982-6292
2005, v.39 no.1, pp.45-58

Abstract

This paper is a comparative study of feature selection methods for Korean web documents clustering. First, we focused on how the term feature and the co-link of web documents affect clustering performance. We clustered web documents by native term feature, co-link and both, and compared the output results with the originally allocated category. And we selected term features for each category using X2, Information Gain (IG), and Mutual Information (MI) from training documents, and applied these features to other experimental documents. In addition we suggested a new method named Max Feature Selection, which selects terms that have the maximum count for a category in each experimental document, and applied X2 (or MI or IG) values to each term instead of term frequency of documents, and clustered them. In the results, X2 shows a better performance than IG or MI, but the difference appears to be slight. But when we applied the Max Feature Selection Method, the clustering performance improved notably. Max Feature Selection is a simple but effective means of feature space reduction and shows powerful performance for Korean web document clustering.

keywords
클러스터링, 자질선정 기법, 한글 웹 문서, 최댓값 자질선정 기법, Clustering, Feature Selection Methods, Korean Web Documents, Max Feature Selection

Reference

1.

(2002.) 문서관리를 위한 자동문서범주화에 대한 이론 및 기법. ,

2.

(2003.) 동시링크를 이용한 웹 문서 클러스터링 실험.,

3.

(2002.) 웹 문서중 의미 있는 표의 추출. ,

4.

(1998.) “Distributional clusteringof words for text classification Proc. of the 21th Annual InternationalACM-SIGIR.,

5.

(2003.) “Information Retrieval on the World Wide Web and Active Logic,

6.

(2000.) ACognitive perspective on search enginetechnology and the WWW.Cambridge University Press.,

7.

(1997.) “Syntacticclustering of the Web,

8.

(1998.) Proc. of International Conferenceon SIGMOD '98,

9.

(1998) Proceedings of the 7th InternationalWWW Conference.,

10.

(2002) .“web document clustering usinghyperlink structures 19-45.,

11.

A Study ofUser Queries On The Web,

12.

(2002.) Department of ComputerScience University of Minnesota.,

13.

(1999.) “Trawlingthe Web for emerging cybercommunities Proceedings of the8th WWW Conference.,

14.

(1996.) R. R. “Bibliometrics of theWorld Wide Web An ExploratoryAnalysis of the Intellectual Structureof Cyberspace Proceedings of the1996 American Society for InformationScience Annual Meeting.,

15.

(2004) “Clustering ofweb documents with the use ofterm frequency and co-link inhypertext The 3rd Asia PacificInternational symposium on InformationTechnology,

16.

(1998.) “A comparison of two learningalgorithms for text categorization Proc. of the 3rd AnnualSymposium on Document Analysisand Information Retrieval,

17.

(1996.) “Training algorithms for lineartext classifier Proc. of the 19thAnnual International ACM-SIGIR,

18.

(2000) “Organizing topicspecificWeb information,

19.

(2000.) “WTMS: a system for collecting and analyzing topicspecific Web information",

20.

Proceedings of the Conference onHuman Factors in Computing Systems,

21.

1973. “Co-citation in the scientificliterature A new measure ofthe relationship between two documents Journal of American societyfor Information Science. vol.24,

22.

(2003.) “Webpage clustering using a self-organizing map of user navigation patterns,

23.

Second InternationalConference on Advancesin Web - Age Information management,

24.

1997.“A comparative study on featureselection in text categorization Proceeding of ICML-97 14th InternationalConference on MachineLearning.,

25.

(2001.) “Criterionfunctions for document clustering- experiment and analysis Department of Computer Science University of Minnesota,

Journal of the Korean Society for Library and Information Science