An Analytical Study on Automatic Classification of Domestic Journal articles Based on Machine Learning

문헌정보학 분야의 국내 학술지 논문으로 구성된 문헌집합을 대상으로 기계학습에 기초한 자동분류의 성능에 영향을 미치는 요소들을 검토하였다. 특히, 「정보관리학회지」에 수록된 논문에 주제 범주를 자동 할당하는 분류 성능 측면에서 용어 가중치부여 기법, 학습집합 크기, 분류 알고리즘, 범주 할당 방법 등 주요 요소들의 특성을 다각적인 실험을 통해 살펴보았다. 결과적으로 분류 환경 및 문헌집합의 특성에 따라 각 요소를 적절하게 적용하는 것이 효과적이며, 보다 단순한 모델의 사용으로 상당히 좋은 수준의 성능을 도출할 수 있었다. 또한, 국내 학술지 논문의 분류는 특정 논문에 하나 이상의 범주를 할당하는 복수-범주 분류(multi-label classification)가 실제 환경에 부합한다고 할 수 있다. 따라서 이러한 환경을 고려하여 단순하고 빠른 분류 알고리즘과 소규모의 학습집합을 사용하는 최적의 분류 모델을 제안하였다.

keywords: automatic classification, text categorization, performance factors, Journal articles, Rocchio, SVM (Support Vector Machine), NB (Naïve Bayes), single-label classification, multi-label classification, machine learning, 자동분류, 텍스트 범주화, 성능 요소, 학술지 논문, 로치오, 지지벡터기계, 나이브 베이즈, 단일-범주 분류, 복수-범주 분류, 기계학습

Abstract

This study examined the factors affecting the performance of automatic classification based on machine learning for domestic journal articles in the field of LIS. In particular, In view of the classification performance that assigning automatically the class labels to the articles in 「Journal of the Korean Society for Information Management」, I investigated the characteristics of the key factors(weighting schemes, training set size, classification algorithms, label assigning methods) through the diversified experiments. Consequently, It is effective to apply each element appropriately according to the classification environment and the characteristics of the document set, and a fairly good performance can be obtained by using a simpler model. In addition, the classification of domestic journals can be considered as a multi-label classification that assigns more than one category to a specific article. Therefore, I proposed an optimal classification model using simple and fast classification algorithm and small learning set considering this environment.

keywords: automatic classification, text categorization, performance factors, Journal articles, Rocchio, SVM (Support Vector Machine), NB (Naïve Bayes), single-label classification, multi-label classification, machine learning, 자동분류, 텍스트 범주화, 성능 요소, 학술지 논문, 로치오, 지지벡터기계, 나이브 베이즈, 단일-범주 분류, 복수-범주 분류, 기계학습

참고문헌

강승식. (2002). 한국어 형태소 분석과 정보검색:홍릉출판사.

김성희. (2008). 기계학습을 이용한 문서 자동분류에 관한 연구. Journal of Information Science Theory and Practice, 39(4), 47-66.

김용환. (2012). 위키피디아를 이용한 분류자질 선정에 관한 연구. 정보관리학회지, 29(2), 155-171. http://dx.doi.org/10.3743/KOSIM.2012.29.2.155.

김종민. (2014). 특징 추출 비용에 민감한 분류를 위한 선형 분류기 최적화 알고리즘 (2021-2024). 2014년도 대한전자공학회 하계학술대회 논문집.

김판준. (2006). 기계학습을 통한 디스크립터 자동부여에 관한 연구. 정보관리학회지, 23(1), 279-299.

김판준. (2006). 로치오 알고리즘을 이용한 학술지 논문의 디스크립터 자동부여에 관한 연구. 정보관리학회지, 23(3), 69-90.

김판준. (2008). 용어 가중치부여 기법을 이용한 로치오 분류기의 성능 향상에 관한 연구. 정보관리학회지, 25(1), 211-233.

김판준. (2016). 기계학습에 기초한 자동분류의 성능 요소에 관한 연구. 정보관리학회지, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033.

김판준. (2007). 문헌간 유사도를 이용한 자동분류에서 미분류 문헌의 활용에 관한 연구. 정보관리학회지, 24(1), 251-271.

10.

김판준. (2012). 디스크립터 자동 할당을 위한 저자키워드의 재분류에 관한 실험적 연구. 정보관리학회지, 29(2), 225-246. http://dx.doi.org/10.3743/KOSIM.2012.29.2.225.

11.

김판준. (2014). 해외 데이터베이스의 통제키워드에 기초한 국내 학술지 논문의 자동분류 성능 향상에 관한 실험적 연구. 한국문헌정보학회지, 48(3), 491-510. http://dx.doi.org/10.4275/KSLIS.2014.48.3.491.

12.

송성전. (2012). 용어의 문맥활용을 통한 문헌 자동 분류의 성능 향상에 관한 연구. 정보관리학회지, 29(2), 205-224. http://dx.doi.org/10.3743/KOSIM.2012.29.2.205.

13.

심경. (2006). 문헌범주화에서 학습문헌수 최적화에 관한 연구. 정보관리학회지, 23(4), 277-294.

14.

심경. (2006). 학습문헌집합에 기 부여된 범주의 정확성과 문헌 범주화 성능. 정보관리학회지, 23(2), 265-285.

15.

이용구. (2009). 기계번역을 이용한 교차언어 문서 범주화의 분류 성능 분석. 한국문헌정보학회지, 43(1), 313-332.

16.

이용구. (2013). 문헌빈도와 장서빈도를 이용한 kNN 분류기의 자질선정에 관한 연구. 한국도서관·정보학회지, 44(1), 27-47. http://dx.doi.org/10.16981/kliss.44.1.201303.27.

17.

이재윤. (2005). 문서측 자질선정을 이용한 고속 문서분류기의 성능향상에 관한 연구. Journal of Information Science Theory and Practice, 36(4), 51-69.

18.

이재윤. (2005). 자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구. 한국문헌정보학회지, 39(2), 123-146.

19.

정은경. (2009). 문서범주화 성능 향상을 위한 의미기반 자질확장에 관한 연구. 정보관리학회지, 26(3), 261-278.

20.

한국연구재단. (2016). 학술연구분야 분류표. http://www.nrf.re.kr.

21.

(2018). 한국학술지인용색인. https://www.kci.go.kr.

22.

AI-Salemi, B. (2015). Boosting algorithms with topic modeling for multi-label text categorization: A comparative empirical study. Journal of Information Science, 41(5), 732-746. http://dx.doi.org/10.1177/0165551515590079.

23.

Chen, E. (2011). Exploiting probabilistic topic models to improve text categorization under class imbalance. Information Processing and Management, 47(2), 202-214.

24.

Chen, Yao-Tsung. (2011). Using chi-square statistics to measure similarities for text categorization. Expert Systems with Application, 38(4), 3085-3090.

25.

Dalal, M. K. (2012). Automatic text classification of sports blog data (219-222). proceedings of the ieee international conference on computing, communications and applications(ComComAp 2012).

26.

Dalal, M. K. (2013). Automatic classification of unstructured blog text. Journal of Intelligent Learning Systems and Applications, 5(2), 108-114. http://dx.doi.org/10.4236/jilsa.2013.52012..

27.

Eriksson, Tobias. (2013). Automatic web page categorization using text classification methods.

28.

Foulds, J. (2010). A review of multi-instance learning assumptions. Knowl. Eng. Rev, 25(1), 1-25.

29.

Ismail Hmeidi. (2014). Automatic Arabic text categorization: A comprehensive comparative study. Journal of Information Science, 41(1), 114-124. http://dx.doi.org/10.1177/0165551514558172.

30.

Shengyi Jiang. (2012). An improved K-nearest-neighbor algorithm for text categorization. Expert Systems with Applications, 39(1), 1503-1509. http://dx.doi.org/10.1016/j.eswa.2011.08.040.

31.

Jindal, Rajni. (2015). Techniques for text classification: Literature review and current trends. Webology, 12(2), 2-28.

32.

Joorabchi, A. (2011). An unsupervised approach to automatic classification of scientific literature utilizing bibliographic metadata. Journal of Information Science, 37(5), 499-514. http://dx.doi.org/10.1177/0165551511417785.

33.

Khan, A. (2010). A review of machine learning algorithms for text-documents classification. Journal of Advances in Information Technology, 1(1), 4-20. http://dx.doi.org/10.4304/jait.1.1.4-20.

34.

M. Arun Kumar. (2010). A comparison study on multiple binary-class SVM methods for unilabel text categorization. Pattern Recognition Letters, 31(11), 1437-1444. http://dx.doi.org/10.1016/j.patrec.2010.02.015.

35.

Cheng Hua Li. (2009). An efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Systems with Applications, 36(2), 3208-3215. http://dx.doi.org/10.1016/j.eswa.2008.01.014.

36.

Liu, Y. (2007). Natural Language Processing and Text Mining:Springer.

37.

Miao, Yun-Qian. (2011). Pairwise optimized rocchio algorithm for text categorization. Pattern Recognition, 32(2), 375-382. http://dx.doi.org/10.1016/j.patrec.2010.09.018.

38.

Pratiksha Y. Pawar. (2012). A Comparative Study on Different Types of Approaches to Text Categorization. International Journal of Machine Learning and Computing, 2(4), 423-426. http://dx.doi.org/10.7763/ijmlc.2012.v2.158.

39.

Pedregosa, F. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825-2830.

40.

Read, J. (2010). Scalable Multi-label Classification.

41.

Read, J. (2011). Classifier chains for multi-label classification. Machine Learning, 85, 333-359.

42.

Schapire, R. E. (2000). BoosTexter: A boosting-based system for text categorization. Machine Learning, 39, 135-168.

43.

Sebastiani, Fabrizio. (2002). Machine learning in automated text categorization. ACM computing Surveys, 34(1), 1-47. http://dx.doi.org/10.1145/505282.505283.

44.

Shehab, M. A. (2016). A supervised approach for multi-label classification of Arabic news articles (-). 7th International Conference on Computer Science and Information Technology (CSIT).

45.

Tarragó, D. S. (2014). A multi-instance learning wrapper based on the Rocchio classifier for web index recommendation. Knowledge-Based Systems, 59, 173-181. http://dx.doi.org/10.1016/j.knosys.2014.01.008.

46.

Torii, M. (2011). An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics. International Journal of Medical Informatics, 80(1), 56-66. http://dx.doi.org/10.1016/j.ijmedinf.2010.10.015.

47.

Tsoumakas G. (2010). Data mining and knowledge discovery handbook:Springer.

48.

Uĝuz, Harun. (2011). A two-stage feature selection methods for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems, 24(7), 1024-1032. http://dx.doi.org/10.1016/j.knosys.2011.04.014.

49.

Vasuki, Vidya. (2010). Reflective random indexing for semi-automatic indexing of the biomedical literature. Journal of Biomedical Informatics, 43(5), 694-700. http://dx.doi.org/10.1016/j.jbi.2010.04.001.

50.

Villena-Román, J. (2011). Hybrid approach combining machine learning and a rule-based expert system for text categorization (323-328). Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference.

51.

Vogrinčič, Sergeja. (2011). Ontology-based multi-label classification of economic articles. ComSIS, 8(1), 101-119. http://dx.doi.org/10.2298/csis100420034v.

52.

Wang, Tai-Yue. (2007). Fuzzy support vector machine for multi-class text categorization. Information Processing and Management, 43(4), 914-929. http://dx.doi.org/10.1016/j.ipm.2006.09.011.

53.

Wu, Chih-Hung. (2009). Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Systems with Applications, 36(1), 4321-4330. http://dx.doi.org/10.1016/j.eswa.2008.03.002.

54.

Yu, B. (2008). Latent semantic analysis for text categorization using neural network. Knowledge-Based Systems, 21(8), 900-904. http://dx.doi.org/10.1016.

바로가기메뉴

논문 상세

Vol.35 No.2

기계학습에 기초한 국내 학술지 논문의 자동분류에 관한 연구

An Analytical Study on Automatic Classification of Domestic Journal articles Based on Machine Learning

초록

Abstract

참고문헌

정보관리학회지