바로가기메뉴

본문 바로가기 주메뉴 바로가기

logo

기계번역을 이용한 교차언어 문서 범주화의 분류 성능 분석

Classification Performance Analysis of Cross-Language Text Categorization using Machine Translation

한국문헌정보학회지 / Journal of the Korean Society for Library and Information Science, (P)1225-598X; (E)2982-6292
2009, v.43 no.1, pp.313-332
https://doi.org/10.4275/KSLIS.2009.43.1.313
이용구 (피츠버그대학)
  • 다운로드 수
  • 조회수

초록

교차언어 문서 범주화(CLTC)는 다른 언어로 된 학습집단을 이용하여 문헌을 자동 분류할 수 있다. 이 연구는 KTSET으로부터 CLTC에 적합한 실험문헌집단을 추출하고, 기계 번역기를 이용하여 가능한 여러 CLTC 방법의 분류 성능을 비교하였다. 분류기는 SVM 분류기를 이용하였다. 실험 결과, CLTC 중에 다국어 학습방법이 가장 좋은 분류 성능을 보였으며, 학습집단 번역방법, 검증집단 번역방법 순으로 분류 성능이 낮아졌다. 하지만 학습집단 번역방법이 기계번역 측면에서 효율적이며, 일반적인 환경에 쉽게 적용할 수 있고, 비교적 분류 성능이 좋아 CLTC 방법 중에서 가장 높은 이용 가능성을 보였다. 한편 CLTC에서 기계번역을 이용하였을 때 번역과정에서 발생하는 자질축소나 주제적 특성이 없는 자질로의 번역으로 인해 성능 저하를 가져왔다.

keywords
교차언어 문서 범주화, 문헌자동분류, 다국어 분류, 다국어 학습, 교차언어 학습, Cross-Language Text Categorization, CLTC, Document Classification, Multilingual Classification, Poly-Lingual Training, Cross-Language Training, Cross-Language Text Categorization, CLTC, Document Classification, Multilingual Classification, Poly-Lingual Training, Cross-Language Training

Abstract

Cross-language text categorization(CLTC) can classify documents automatically using training set from other language. In this study, collections appropriated for CLTC were extracted from KTSET. Classification performance of various CLTC methods were compared by SVM classifier using machine translation. Results showed that the classification performance in the order of poly-lingual training method, training-set translation and test-set translation. However, training-set translation could be regarded as the most useful method among CLTC, because it was efficient for machine translation and easily adapted to general environment. On the other hand, low performance was shown to be due to the feature reduction or features with no subject characteristics, which occurred in the process of machine translation of CLTC.

keywords
교차언어 문서 범주화, 문헌자동분류, 다국어 분류, 다국어 학습, 교차언어 학습, Cross-Language Text Categorization, CLTC, Document Classification, Multilingual Classification, Poly-Lingual Training, Cross-Language Training, Cross-Language Text Categorization, CLTC, Document Classification, Multilingual Classification, Poly-Lingual Training, Cross-Language Training

참고문헌

1.

김성혁, 서은경, 이원규, 김명철, 김영환, 김재군. 1994. 자동색인기 성능시험을 위한 Test Set 개발. 정보관리학회지 , 11(1): 81-102.

2.

Adeva, J., R. Calvo, and D. L. Ipiña. 2005. “Multilingual Approaches to Text Categorisation." The European Journal for the Informatics Professional, 6(3): 43-51.

3.

Amine, B. M., and M. Mimoun. 2007. “Word-Net based Cross-Language Text Categorization." ACS International Conference on Computer Systems and Applications, 848-855.

4.

Bel, N., C. Koster, and M. Villegas. 2003. “Cross-Lingual Text Categorization." LNCS, 2769: 126-139.

5.

Chang, C. and C. Lin. 2001. “LIBSVM : a library for support vector machines." [online]. [cited 2008.08.30]. <http://www.csie.ntu.edu.tw/~cjlin/libsvm>.

6.

Cristianini, N., and J. Shawe-Taylor. 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. London: Cambridge University Press.

7.

Gliozzo, A. M., and C. Strapparava. 2005. “Cross language text categorization by acquiring multilingual domain models from comparable corpora." Proceedings of the ACL workshop on building and using parallel texts, 9-16.

8.

Joachims, T. 1998. “Text categorization with Support Vector Machines: Learning with many relevant features." Proceedings of the 10th European Conference on Machine Learning, 137-142.

9.

Kishida, K. 2005. “Technical issues of crosslanguage information retrieval: a review." Information Processing & Management, 41: 433-455.

10.

Melo, G. and S. Siersdorfer. 2007. “Multilingual text classification using ontologies." Proceeding 29th European Conference on Information Retrieval, 541-548.

11.

Oard, D. W., and A. R. Diekema. 1998. “Crosslanguage information retrieval." Annual Review of Information Science and Technology, 33: 223-256.

12.

Peters, C., and P. Sheridan. 2001. “Multilingual information access." Lectures on information retrieval, 51-80.

13.

Rigutini, L., M. Maggini, and B. Liu. 2005. “An EM based training algorithm for Cross- Language Text Categorization." Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 529-535.

14.

Taira, H., and M. Haruno. 1999. “Feature selection in SVM text categorization." Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99), 480-486.

15.

Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. New York: Springer.

16.

Vapnik, V. N. 2000. The nature of statistical learning theory. 2nd ed. New York: Springer.

17.

Wu, K. and B. Lu. 2008. “A Refinement Framework for Cross Language Text Categorization." Information Retrieval Technology 4th Asia Information Retrieval Symposium, 15-18.

18.

Yang, Y., and X. Liu. 1999. “A re-examination of text categorization methods." Proceedings of the ACM SIGIR Conference on Research and Development in International Retrieval (SIGIR'99), 42-49.

한국문헌정보학회지