A Study on the Reclassification of Author Keywords for Automatic Assignment of Descriptors

본 연구는 국내 주요 학술 DB의 검색서비스에서 제공되고 있는 저자키워드(비통제키워드)의 재분류를 통하여 디스크립터(통제키워드)를 자동 할당할 수 있는 가능성을 모색하였다. 먼저 기계학습에 기반한 주요 분류기들의 특성을 비교하는 실험을 수행하여 재분류를 위한 최적 분류기와 파라미터를 선정하였다. 다음으로, 국내 독서 분야 학술지 논문들에 부여된 저자키워드를 학습한 결과에 따라 해당 논문들을 재분류함으로써 키워드를 추가로 할당하는 실험을 수행하였다. 또한 이러한 재분류 결과에 따라 새롭게 추가된 문헌들에 대하여 통제키워드인 디스크립터와 마찬가지로 동일 주제의 논문들을 모아주는 어휘통제 효과가 있는지를 살펴보았다. 그 결과, 저자키워드의 재분류를 통하여 디스크립터를 자동 할당하는 효과를 얻을 수 있음을 확인하였다.

keywords: 자동분류, 텍스트 범주화, 재분류, 어휘통제, 디스크립터, 저자키워드, automatic classification, text categorization, reclassification, vocabulary control, descriptors, author keywords, automatic classification, text categorization, reclassification, vocabulary control, descriptors, author keywords

Abstract

This study purported to investigate the possibility of automatic descriptor assignment using the reclassification of author keywords in domestic scholarly databases. In the first stage, we selected optimal classifiers and parameters for the reclassification by comparing the characteristics of machine learning classifiers. In the next stage, learning the author keywords that were assigned to the selected articles on readings, the author keywords were automatically added to another set of relevant articles. We examined whether the author keyword reclassifications had the effect of vocabulary control just as descriptors collocate the documents on the same topic. The results showed the author keyword reclassification had the capability of the automatic descriptor assignment.

keywords: 자동분류, 텍스트 범주화, 재분류, 어휘통제, 디스크립터, 저자키워드, automatic classification, text categorization, reclassification, vocabulary control, descriptors, author keywords, automatic classification, text categorization, reclassification, vocabulary control, descriptors, author keywords

참고문헌

김용환. (2012). 위키피디아를 이용한 분류자질 선정에 관한 연구. 정보관리학회지, 29(2), 155-171. http://dx.doi.org/10.3743/KOSIM.2012.29.2.155.

김판준. (2006). 기계학습을 통한 디스크립터 자동부여에 관한 연구. 정보관리학회지, 23(1), 279-299.

김판준. (2006). 로치오 알고리즘을 이용한 학술지 논문의 디스크립터 자동부여에 관한 연구. 정보관리학회지, 23(3), 69-90.

김판준. (2008). 용어 가중치부여 기법을 이용한 로치오 분류기의 성능 향상에 관한 연구. 정보관리학회지, 25(1), 211-233.

김판준. (2007). 문헌간 유사도를 이용한 자동분류에서 미분류 문헌의 활용에 관한 연구. 정보관리학회지, 24(1), 251-271.

윤구호. (1999). 색인·초록:한국도서관협회.

이재윤. (2005). 문헌간 유사도를 이용한 SVM 분류기의 문헌분류성능 향상에 관한 연구. 정보관리학회지, 22(3), 261-287.

이재윤. (2005). 자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구. 한국문헌정보학회지, 39(2), 123-146.

정영미. (2012). 정보검색연구(증보판):연세대학교 출판문화원.

10.

정은경. (2009). 문서범주화 성능 향상을 위한 의미기반 자질확장에 관한 연구. 정보관리학회지, 26(3), 261-278.

11.

Chen, E.. (2011). Exploiting probabilistic topic models to improve text categorization under class imbalance. Information Processing and Management, 47(2), 202-214.

12.

Chen, Yao-Tsung. (2011). Using chi-square statistics to measure similarities for text categorization. Expert Systems with Application, 38(4), 3085-3090.

13.

Chung, Y.. (1998). Automatic subject indexing using an associative neural network (59-68). Proceedings of the 3rd ACM International Conference on Digital Libraries (DL '98). ACM Press.

14.

Gil-Leiva, I.. (2007). Keywords given by authors of scientific articles in database descriptors. Journal of the American Society for Information Science and Technology, 58(8), 1175-1187.

15.

Harish, B. S.. (2010). Representation and classification of text documents : A brief review (110-119). IJCA Special Issue on"Recent Trends in Image Processing and Pattern Recognition"RTIPPR.

16.

Hurt, C. D.. (2010). Automatically generated keywords: A comparison to author-generated keywords in the sciences. Journal of Information and Organizational Sciences, 34(1), 81-88.

17.

Jiang, S.. (2012). An improved k-nearest-neighbor algorithm for text categorization. Expert Systems with Applications, 39(1), 1503-1509.

18.

Joachims, T.. (1998). Text categorization with support vector machines : Learning with many relevant features (137-142). Proceedings of the 10th European Conference on Machine Learning.

19.

Khan, A.. (2010). A review of machine learning algorithms for text-documents classification. Journal of Advances in Information Technology, 1(1), 4-20.

20.

Kumar, M. Arun. (2010). A comparison study on multiple binary-class SVM methods for unilabel text categorization. Pattern Recognition Letters, 31(11), 1437-1444.

21.

Lauser, B.. (2003). Automatic multi-label subject indexing in a multilingual environment (140-151). Proceedings of the 7th European Conference in Research and Adavanced Technology for Digital Libraries(ECDL '03).

22.

Lewis, D. D.. (1996). Training algorithms for linear text classfiers (298-306). Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR '96).

23.

Li, Cheng Hua. (2009). An efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Systems with Applications, 36(2), 3208-3215.

24.

Li, Xiangdong. (2011). The review of text categorization research over Chinese Library Classification. American Journal of Engineering and Technology Research, 11(9), 2729-2734.

25.

Miao, Yun-Qian. (2011). Pairwise optimized Rocchio algorithm for text categorization. Pattern Recognition, 32(2), 375-382.

26.

Mitchell, T. M.. (1997). Machine learning:McGraw-Hill.

27.

Moens, Marie-Francine. (2000). Automatic indexing and abstracting of document texts:Kluwer Academic Publishers.

28.

Nidhi. (2011). Recent trends in text classification techniques. International Journal of Computer Applications, 35(6), 45-51.

29.

Ruiz, M. E.. (2002). Hierarchical text categorization using neural networks. Information Retrieval, 5(1), 87-118.

30.

Sebastiani, F.. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47.

31.

Torii, M.. (2011). An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics. International Journal of Medical Informatics, 80(1), 56-66.

32.

Uĝuz, H.. (2011). A two-stage feature selection methods for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems, 24(7), 1024-1032.

33.

Vasuki, V.. (2010). Reflective random indexing for semi-automatic indexing of the biomedical literature. Journal of Biomedical Informatics, 43(5), 694-700.

34.

Villena-Román, J.. (2011). Hybrid approach combining machine learning and a rule-based expert system for text categorization (323-328). Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference.

35.

Voorhees, E. M.. (2005). TREC : Experiment and evaluation in information retrieval:MIT Press.

36.

Wang, Tai-Yue. (2007). Fuzzy support vector machine for multi-class text categorization. Information Processing and Management, 43(4), 914-929.

37.

Wu, Chih-Hung. (2009). Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Systems with Applications, 36(1), 4321-4330.

38.

Yang, Y.. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1), 69-90.

39.

Yang, Y.. (1997). A comparative study on feature selection in text categorization (412-420). Proceedings of the 14th International Conference on Machine Learning(ICML '97).

40.

Yang, Y.. (1999). A re-examination for text categorization methods (42-49). Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval('SIGIR 99).

41.

Yu, Bo. (2008). Latent semantic analysis for text categorization using neural network. Knowledge-Based Systems, 21(8), 900-904.

42.

Zhang, J.. (2003). Robustness of regularized linear classification methods in text categorization (190-197). Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR '03).

43.

Zhang, Y.. (2011). Multilingual sentence categorization and novelty mining. Information Processing and Management, 47(5), 667-675.

바로가기메뉴

논문 상세

Vol.29 No.2

디스크립터 자동 할당을 위한 저자키워드의 재분류에 관한 실험적 연구

A Study on the Reclassification of Author Keywords for Automatic Assignment of Descriptors

초록

Abstract

참고문헌

정보관리학회지