바로가기메뉴

본문 바로가기 주메뉴 바로가기

logo

  • P-ISSN1013-0799
  • E-ISSN2586-2073
  • KCI

A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning

Journal of the Korean Society for Information Management / Journal of the Korean Society for Information Management, (P)1013-0799; (E)2586-2073
2018, v.35 no.2, pp.63-88
https://doi.org/10.3743/KOSIM.2018.35.2.063


Abstract

This research evaluated differences of classification performance for feature selection methods using LDA topic model and Doc2Vec which is based on word embedding using deep learning, feature corpus sizes and classification algorithms. In addition to find the feature corpus with high performance of classification, an experiment was conducted using feature corpus was composed differently according to the location of the document and by adjusting the size of the feature corpus. Conclusionally, in the experiments using deep learning evaluate training frequency and specifically considered information for context inference. This study constructed biomedical document dataset, Disease-35083 which consisted biomedical scholarly documents provided by PMC and categorized by the disease category. Throughout the study this research verifies which type and size of feature corpus produces the highest performance and, also suggests some feature corpus which carry an extensibility to specific feature by displaying efficiency during the training time. Additionally, this research compares the differences between deep learning and existing method and suggests an appropriate method by classification environment.

keywords
문헌 분류, 자질 선정, 텍스트 범주화, 토픽 모델, 딥 러닝, LDA, Doc2Vec, 텍스트 마이닝, document classification, feature selection, text categorization, topic model, deep learning, LDA, Doc2Vec, text mining

Reference

1.

김도우. (2017). Doc2Vec과 Word2Vec을 활용한 Convolutional Neural Network 기반 한국어 신문 기사 분류. 정보과학회논문지, 44(7), 742-747.

2.

김판준. (2016). 기계학습에 기초한 자동분류의 성능 요소에 관한 연구. 정보관리학회지, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033.

3.

이재윤. (2005). 자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구. 한국문헌정보학회지, 39(2), 123-146.

4.

정영미. (2012). 정보검색연구:연세대학교 출판문화원.

5.

진설아. (2016). 토픽 모델링 기반 정보학 분야 학술지의 학제성 측정 연구. 정보관리학회지, 33(1), 7-32. http://dx.doi.org/10.3743/KOSIM.2016.33.1.007.

6.

최상희. (2012). 문서 클러스터링을 위한 학술지 논문의 구조적 초록 활용성 연구. 정보관리학회지, 29(1), 331-349. http://dx.doi.org/10.3743/KOSIM.2012.29.1.331.

7.

Atlig, C. (2017). Learning-based classification of natural science articles. International Journal of Scientific Research in Information Systems and Engineering (IJSRISE), 2(3), 20-26.

8.

Bengio, Y. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137-1155.

9.

Bhushan, S. B. (2017). A novel integer representation based approach for classification of text documents (557-564). Proceedings of the International Conference on Data Engineering and Communication Technology. Springer.

10.

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84. http://dx.doi.org/10.1145/2133806.2133826.

11.

Blei, D. M. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.

12.

Collobert, R. (2008). A unified architecture for natural language processing:Deep neural networks with multitask learning (160-167). Proceedings of the 25th International Conference on Machine Learning. ACM.

13.

Dai, A. M. Document embedding with paragraph vectors.

14.

Deerwester, S. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.

15.

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289-1305.

16.

Fuhr, N. (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems(TOIS), 9(3), 223-248.

17.

Harter, S. P. (1975). A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing. Journal of the American Society for Information Science, 26(5), 280-289. http://dx.doi.org/10.1002/asi.4630260504.

18.

Hofmann, T. (2017). Probabilistic latent semantic indexing. ACM SIGIR Forum, 51(2), 211-218.

19.

Hughes, M. (2017). Medical text classification using convolutional neural networks. Stud Health Technol Inform, 235, 246-250.

20.

Jiang, S. (2016). Integrating rich document representations for text classification (303-308). Systems and Information Engineering Design Symposium (SIEDS), 2016 IEEE. IEEE.

21.

John, G. H. (1994). Irrelevant features and the subset selection problem (121-129). Proceedings of the Eleventh International Conference on Machine Learning.

Journal of the Korean Society for Information Management