Classification Performance Analysis of Cross-Language Text Categorization using Machine Translation

이용구

doi:10.4275/KSLIS.2009.43.1.313

ACOMS+ 및 학술지 리포지터리 설명회

한국과학기술정보연구원(KISTI) 서울분원 대회의실(별관 3층)
2024년 07월 03일(수) 13:30

사전등록 바로가기

오늘 하루 그만보기

P-ISSN1225-598X
E-ISSN2982-6292

홈으로

논문 상세

이전 다음

논문 투고

Vol.43 No.1

Citation Share

기계번역을 이용한 교차언어 문서 범주화의 분류 성능 분석

Classification Performance Analysis of Cross-Language Text Categorization using Machine Translation

한국문헌정보학회지 / Journal of the Korean Society for Library and Information Science, (P)1225-598X; (E)2982-6292

2009, v.43 no.1, pp.313-332

https://doi.org/10.4275/KSLIS.2009.43.1.313

이용구 (피츠버그대학)

이용구. (2009). 기계번역을 이용한 교차언어 문서 범주화의 분류 성능 분석. 한국문헌정보학회지, 43(1), 313-332, https://doi.org/10.4275/KSLIS.2009.43.1.313

복사

초록

교차언어 문서 범주화(CLTC)는 다른 언어로 된 학습집단을 이용하여 문헌을 자동 분류할 수 있다. 이 연구는 KTSET으로부터 CLTC에 적합한 실험문헌집단을 추출하고, 기계 번역기를 이용하여 가능한 여러 CLTC 방법의 분류 성능을 비교하였다. 분류기는 SVM 분류기를 이용하였다. 실험 결과, CLTC 중에 다국어 학습방법이 가장 좋은 분류 성능을 보였으며, 학습집단 번역방법, 검증집단 번역방법 순으로 분류 성능이 낮아졌다. 하지만 학습집단 번역방법이 기계번역 측면에서 효율적이며, 일반적인 환경에 쉽게 적용할 수 있고, 비교적 분류 성능이 좋아 CLTC 방법 중에서 가장 높은 이용 가능성을 보였다. 한편 CLTC에서 기계번역을 이용하였을 때 번역과정에서 발생하는 자질축소나 주제적 특성이 없는 자질로의 번역으로 인해 성능 저하를 가져왔다.

keywords: 교차언어 문서 범주화, 문헌자동분류, 다국어 분류, 다국어 학습, 교차언어 학습, Cross-Language Text Categorization, CLTC, Document Classification, Multilingual Classification, Poly-Lingual Training, Cross-Language Training, Cross-Language Text Categorization, CLTC, Document Classification, Multilingual Classification, Poly-Lingual Training, Cross-Language Training

Abstract

Cross-language text categorization(CLTC) can classify documents automatically using training set from other language. In this study, collections appropriated for CLTC were extracted from KTSET. Classification performance of various CLTC methods were compared by SVM classifier using machine translation. Results showed that the classification performance in the order of poly-lingual training method, training-set translation and test-set translation. However, training-set translation could be regarded as the most useful method among CLTC, because it was efficient for machine translation and easily adapted to general environment. On the other hand, low performance was shown to be due to the feature reduction or features with no subject characteristics, which occurred in the process of machine translation of CLTC.

keywords: 교차언어 문서 범주화, 문헌자동분류, 다국어 분류, 다국어 학습, 교차언어 학습, Cross-Language Text Categorization, CLTC, Document Classification, Multilingual Classification, Poly-Lingual Training, Cross-Language Training, Cross-Language Text Categorization, CLTC, Document Classification, Multilingual Classification, Poly-Lingual Training, Cross-Language Training

참고문헌

김성혁, 서은경, 이원규, 김명철, 김영환, 김재군. 1994. 자동색인기 성능시험을 위한 Test Set 개발. 정보관리학회지 , 11(1): 81-102.

Adeva, J., R. Calvo, and D. L. Ipiña. 2005. “Multilingual Approaches to Text Categorisation." The European Journal for the Informatics Professional, 6(3): 43-51.

Amine, B. M., and M. Mimoun. 2007. “Word-Net based Cross-Language Text Categorization." ACS International Conference on Computer Systems and Applications, 848-855.

Bel, N., C. Koster, and M. Villegas. 2003. “Cross-Lingual Text Categorization." LNCS, 2769: 126-139.

Chang, C. and C. Lin. 2001. “LIBSVM : a library for support vector machines." [online]. [cited 2008.08.30]. <http://www.csie.ntu.edu.tw/~cjlin/libsvm>.

Cristianini, N., and J. Shawe-Taylor. 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. London: Cambridge University Press.

Gliozzo, A. M., and C. Strapparava. 2005. “Cross language text categorization by acquiring multilingual domain models from comparable corpora." Proceedings of the ACL workshop on building and using parallel texts, 9-16.

Joachims, T. 1998. “Text categorization with Support Vector Machines: Learning with many relevant features." Proceedings of the 10th European Conference on Machine Learning, 137-142.

Kishida, K. 2005. “Technical issues of crosslanguage information retrieval: a review." Information Processing & Management, 41: 433-455.

10.

Melo, G. and S. Siersdorfer. 2007. “Multilingual text classification using ontologies." Proceeding 29th European Conference on Information Retrieval, 541-548.

11.

Oard, D. W., and A. R. Diekema. 1998. “Crosslanguage information retrieval." Annual Review of Information Science and Technology, 33: 223-256.

12.

Peters, C., and P. Sheridan. 2001. “Multilingual information access." Lectures on information retrieval, 51-80.

13.

Rigutini, L., M. Maggini, and B. Liu. 2005. “An EM based training algorithm for Cross- Language Text Categorization." Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 529-535.

14.

Taira, H., and M. Haruno. 1999. “Feature selection in SVM text categorization." Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99), 480-486.

15.

Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. New York: Springer.

16.

Vapnik, V. N. 2000. The nature of statistical learning theory. 2nd ed. New York: Springer.

17.

Wu, K. and B. Lu. 2008. “A Refinement Framework for Cross Language Text Categorization." Information Retrieval Technology 4th Asia Information Retrieval Symposium, 15-18.

18.

Yang, Y., and X. Liu. 1999. “A re-examination of text categorization methods." Proceedings of the ACM SIGIR Conference on Research and Development in International Retrieval (SIGIR'99), 42-49.

바로가기메뉴

논문 상세

Vol.43 No.1

기계번역을 이용한 교차언어 문서 범주화의 분류 성능 분석

Classification Performance Analysis of Cross-Language Text Categorization using Machine Translation

초록

Abstract

참고문헌

한국문헌정보학회지