A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning

본 연구는 LDA 토픽 모델과 딥 러닝을 적용한 단어 임베딩 기반의 Doc2Vec 기법을 활용하여 자질을 선정하고 자질집합의 크기와 종류 및 분류 알고리즘에 따른 분류 성능의 차이를 평가하였다. 또한 자질집합의 적절한 크기를 확인하고 문헌의 위치에 따라 종류를 다르게 구성하여 분류에 이용할 때 높은 성능을 나타내는 자질집합이 무엇인지 확인하였다. 마지막으로 딥 러닝을 활용한 실험에서는 학습 횟수와 문맥 추론 정보의 유무에 따른 분류 성능을 비교하였다. 실험문헌집단은 PMC에서 제공하는 생의학 학술문헌을 수집하고 질병 범주 체계에 따라 구분하여 Disease-35083을 구축하였다. 연구를 통하여 가장 높은 성능을 나타낸 자질집합의 종류와 크기를 확인하고 학습 시간에 효율성을 나타냄으로써 자질로의 확장 가능성을 가지는 자질집합을 제시하였다. 또한 딥 러닝과 기존 방법 간의 차이점을 비교하고 분류 환경에 따라 적합한 방법을 제안하였다.

keywords: 문헌 분류, 자질 선정, 텍스트 범주화, 토픽 모델, 딥 러닝, LDA, Doc2Vec, 텍스트 마이닝, document classification, feature selection, text categorization, topic model, deep learning, LDA, Doc2Vec, text mining

Abstract

This research evaluated differences of classification performance for feature selection methods using LDA topic model and Doc2Vec which is based on word embedding using deep learning, feature corpus sizes and classification algorithms. In addition to find the feature corpus with high performance of classification, an experiment was conducted using feature corpus was composed differently according to the location of the document and by adjusting the size of the feature corpus. Conclusionally, in the experiments using deep learning evaluate training frequency and specifically considered information for context inference. This study constructed biomedical document dataset, Disease-35083 which consisted biomedical scholarly documents provided by PMC and categorized by the disease category. Throughout the study this research verifies which type and size of feature corpus produces the highest performance and, also suggests some feature corpus which carry an extensibility to specific feature by displaying efficiency during the training time. Additionally, this research compares the differences between deep learning and existing method and suggests an appropriate method by classification environment.

keywords: 문헌 분류, 자질 선정, 텍스트 범주화, 토픽 모델, 딥 러닝, LDA, Doc2Vec, 텍스트 마이닝, document classification, feature selection, text categorization, topic model, deep learning, LDA, Doc2Vec, text mining

참고문헌

김도우. (2017). Doc2Vec과 Word2Vec을 활용한 Convolutional Neural Network 기반 한국어 신문 기사 분류. 정보과학회논문지, 44(7), 742-747.

김판준. (2016). 기계학습에 기초한 자동분류의 성능 요소에 관한 연구. 정보관리학회지, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033.

이재윤. (2005). 자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구. 한국문헌정보학회지, 39(2), 123-146.

정영미. (2012). 정보검색연구:연세대학교 출판문화원.

진설아. (2016). 토픽 모델링 기반 정보학 분야 학술지의 학제성 측정 연구. 정보관리학회지, 33(1), 7-32. http://dx.doi.org/10.3743/KOSIM.2016.33.1.007.

최상희. (2012). 문서 클러스터링을 위한 학술지 논문의 구조적 초록 활용성 연구. 정보관리학회지, 29(1), 331-349. http://dx.doi.org/10.3743/KOSIM.2012.29.1.331.

Atlig, C. (2017). Learning-based classification of natural science articles. International Journal of Scientific Research in Information Systems and Engineering (IJSRISE), 2(3), 20-26.

Bengio, Y. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137-1155.

Bhushan, S. B. (2017). A novel integer representation based approach for classification of text documents (557-564). Proceedings of the International Conference on Data Engineering and Communication Technology. Springer.

10.

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84. http://dx.doi.org/10.1145/2133806.2133826.

11.

Blei, D. M. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.

12.

Collobert, R. (2008). A unified architecture for natural language processing:Deep neural networks with multitask learning (160-167). Proceedings of the 25th International Conference on Machine Learning. ACM.

13.

Dai, A. M. Document embedding with paragraph vectors.

14.

Deerwester, S. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.

15.

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289-1305.

16.

Fuhr, N. (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems(TOIS), 9(3), 223-248.

17.

Harter, S. P. (1975). A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing. Journal of the American Society for Information Science, 26(5), 280-289. http://dx.doi.org/10.1002/asi.4630260504.

18.

Hofmann, T. (2017). Probabilistic latent semantic indexing. ACM SIGIR Forum, 51(2), 211-218.

19.

Hughes, M. (2017). Medical text classification using convolutional neural networks. Stud Health Technol Inform, 235, 246-250.

20.

Jiang, S. (2016). Integrating rich document representations for text classification (303-308). Systems and Information Engineering Design Symposium (SIEDS), 2016 IEEE. IEEE.

21.

John, G. H. (1994). Irrelevant features and the subset selection problem (121-129). Proceedings of the Eleventh International Conference on Machine Learning.

바로가기메뉴

논문 상세

Vol.35 No.2

토픽모델링과 딥 러닝을 활용한 생의학 문헌 자동 분류 기법 연구

A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning

초록

Abstract

참고문헌

정보관리학회지