A Study on Patent Literature Classification Using Distributed Representation of Technical Terms

최윤수; 최성필

doi:10.4275/KSLIS.2019.53.2.179

P-ISSN1225-598X
E-ISSN2982-6292

홈으로

논문 상세

이전 다음

논문 투고

Vol.53 No.2

Citation Share

기술용어 분산표현을 활용한 특허문헌 분류에 관한 연구

A Study on Patent Literature Classification Using Distributed Representation of Technical Terms

한국문헌정보학회지 / Journal of the Korean Society for Library and Information Science, (P)1225-598X; (E)2982-6292

2019, v.53 no.2, pp.179-199

https://doi.org/10.4275/KSLIS.2019.53.2.179

최윤수 (경기대학교 일반대학원 문헌정보학과)
최성필 (경기대학교)

최윤수, & 최성필. (2019). 기술용어 분산표현을 활용한 특허문헌 분류에 관한 연구. , 53(2), 179-199, https://doi.org/10.4275/KSLIS.2019.53.2.179

복사

초록

본 연구의 목적은 특허 문헌 분류에 가장 적합한 방법론을 발견하기 위하여 다양한 자질 추출 방법과 기계학습 및 딥러닝 모델을 살펴보고 실험을 통해 최적의 성능을 제공하는 방법론을 분석하는데 있다. 자질 추출 방법으로는 전통적인 BoW 방법과 분산표현 방식인 워드 임베딩 벡터를 비교 실험하고, 문헌 집합 구축 방식으로는 형태소 분석과 멀티그램을 이용하는 방식을 비교 검토하였다. 또한 전통적인 기계학습 모델과 딥러닝 모델을 이용하여 분류 성능을 검증하였다. 실험 결과, 분산표현 방법과 형태소 분석을 이용한 자질추출 방법을 기반으로 딥러닝 모델을 적용하였을 경우에 분류 성능이 가장 우수한 것으로 판명되었으며 섹션, 클래스, 서브클래스 분류 실험에서 전통적인 기계학습 방법에 비해 각각 5.71%, 18.84%, 21.53% 우수한 분류 성능을 보여주었다.

keywords: Patent Literature Classification, Distributed Representation, Word Embedding Vector, Deep Learning, 특허문헌 분류, 분산표현, 워드 임베딩 벡터, 딥러닝

Abstract

In this paper, we propose optimal methodologies for classifying patent literature by examining various feature extraction methods, machine learning and deep learning models, and provide optimal performance through experiments. We compared the traditional BoW method and a distributed representation method (word embedding vector) as a feature extraction, and compared the morphological analysis and multi gram as the method of constructing the document collection. In addition, classification performance was verified using traditional machine learning model and deep learning model. Experimental results show that the best performance is achieved when we apply the deep learning model with distributed representation and morphological analysis based feature extraction. In Section, Class and Subclass classification experiments, We improved the performance by 5.71%, 18.84% and 21.53%, respectively, compared with traditional classification methods.

keywords: Patent Literature Classification, Distributed Representation, Word Embedding Vector, Deep Learning, 특허문헌 분류, 분산표현, 워드 임베딩 벡터, 딥러닝

참고문헌

김재호, 최기선. 2005. 문서의 의미적 구조정보를 이용한 특허 문서 분류. 『한국정보과학회 언어공학연구회 학술발표 논문집』, 28-34.

박찬정, 김기용, 성동수. 2014. KNN 을 이용한 융합기술 특허문서의 자동 IPC 분류. 『한국정보기술학회논문지』, 12(3): 175-185.

임소라, 권용진. 2017. 특허문서 필드의 기능적 특성을 활용한 IPC 다중 레이블 분류. 『인터넷정보학회지』, 18(1): 77-88.

특허청. 2018. 『2017 지식재산통계연보』. 대전: 특허청.

한국과학기술원 융합연구정책센터. 2018. 『2017년도 국가융합기술 R&D 조사·분석』. 서울: 한국과학기술원 융합연구정책센터

Bahdanau D., Cho, K. and Bengio, Y. 2015. “Neural Machine Translation by Jointly Learning to Align and Translate.” In Proceeding of ICLR 2015. [arXiv:1409.0473]

Bojanowski, P. et al. 2017. “Enriching word vectors with subword information.” Transactions of the Association for Computational Linguistics, 5: 135-146.

Chen, Y. and Chang, Y. 2012. “A three-phase method for patent classification.” Information Processing & Management, 48(6): 1017-1030.

Collobert, R. and Weston, J. 2008. “A Unified Architecture for Natural Language Processing:Deep Neural Networks with Multitask Learning.” In Proceeding of the 25th International Conference on Maching Learning.

10.

Fall, C. et al. 2003. “Automated categorization in the international patent classification.”In Acm Sigir Forum, 37(1): 10-25.

11.

Koster, C. and Seutter, M. 2003. “Taming wild phrases.” In Proceedings of the 25th European conference on IR research (ECIR’03), 161-176.

12.

Larkey, L. 1999. “A patent search and classification system.” In Proceedings of the fourth ACM conference on Digital libraries, 179-187.

13.

Mikolov, T., Chen, K., Corrado, G. and Dean, J. 2013. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781.

14.

Pennington, J., Socher, R. and Manning, C. 2014. “Glove: Global vectors for word representation.”In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532-1543.

15.

Tikk, D., Biró, G. and Törcsvári, A. 2008. “A hierarchical online classifier for patent categorization.”Emerging technologies of text mining: Techniques and applications. IGI Global, 244-267.

바로가기메뉴

논문 상세

Vol.53 No.2

기술용어 분산표현을 활용한 특허문헌 분류에 관한 연구

A Study on Patent Literature Classification Using Distributed Representation of Technical Terms

초록

Abstract

참고문헌

한국문헌정보학회지