바로가기메뉴

본문 바로가기 주메뉴 바로가기

logo

Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities

Journal of the Korean Society for Information Management / Journal of the Korean Society for Information Management, (P)1013-0799; (E)2586-2073
2007, v.24 no.1, pp.251-271
https://doi.org/10.3743/KOSIM.2007.24.1.251


  • Downloaded
  • Viewed

Abstract

This paper studies the problem of classifying documents with labeled and unlabeled learning data, especially with regards to using document similarity features. The problem of using unlabeled data is practically important because in many information systems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. There are two steps in general semi-supervised learning algorithm. First, it trains a classifier using the available labeled documents, and classifies the unlabeled documents. Then, it trains a new classifier using all the training documents which were labeled either manually or automatically. We suggested two types of semi-supervised learning algorithm with regards to using document similarity features. The one is one step semi-supervised learning which is using unlabeled documents only to generate document similarity features. And the other is two step semi-supervised learning which is using unlabeled documents as learning examples as well as similarity features. Experimental results, obtained using support vector machines and naive Bayes classifier, show that we can get improved performance with small labeled and large unlabeled documents then the performance of supervised learning which uses labeled-only data. When considering the efficiency of a classifier system, the one step semi-supervised learning algorithm which is suggested in this study could be a good solution for improving classification performance with unlabeled documents.

keywords
automatic classification, text categorization, semi-supervised learning, unlabeled documents, document similarities, SVM classifier, naive Bayes classifier, 문헌자동분류, 문헌 범주화, 준지도학습, 미분류문헌, 문헌유사도, SVM 분류기, 나이브베이즈 분류기

Reference

1.

(2000). 한국어 테스트 컬렉션 HANTEC의 확장 및 보완. , 210-215.

2.

김판준. (2006). 기계학습을 통한 디스크립터 자동부여에 관한 연구. 정보관리학회지, 23(1), 279-299.

3.

이재윤. (2005). 문헌간 유사도를 이용한 SVM 분류기의 문헌분류성능 향상에 관한 연구. 정보관리학회지, 22(3), 261-287.

4.

이재윤. (2005). 자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구. 한국문헌정보학회지, 39(2), 123-146.

5.

(2005). 정보검색연구. , -.

6.

(2002). Semi-supervised clustering by seeding. , 19-26.

7.

(2004). A probabilistic framework for semi-supervised clustering. , 59-68.

8.

(1998). A semi supervised support vector machines. , 368-374.

9.

(1998). Combining labeled and unlabeled data with co-training. , 92-100.

10.

(2002). Exploiting relations among concepts to acquire weakly labeled training data. , 43-50.

11.

(2003). Semi-supervised clustering with user feedback. , -.

12.

A fast algorithm for automatic classification. Journal of Library Automation. , 31-48.

13.

(1998). PAC learning from positive statistical queries. , 112-126.

14.

(2002). Text classification from positive and unlabeled examples. , -.

15.

(2002). Combining labeled and unlabeled data for multiclass text categorization. , 187-194.

16.

(2000). Enhancing supervised learning with unlabeled data. , 327-334.

17.

and R. C. Dubes. 1988. Algorithms for Clustering Data. Englewood Cliffs. , -.

18.

(1999). Transductive inference for text classification using Support Vector Machines. , 200-209.

19.

A sequential algorithm for training text classifiers. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. , 3-12.

20.

(2003). Building text classifiers using positive and unlabeled examples. , 179-188.

21.

(1998). Employing EM and pool-based active learning with keywords, EM and shrinkage. , 359-367.

22.

(2002). Active + semi-supervised learning = robust multi-view learning. , 435-442.

23.

(2000). Analyzing the effectiveness and applicability of co-training. , 86-93.

24.

(2000). Text classification from labeled and unlabeled documents using EM. 39(2/3), 103-134.

25.

(2004). Co-trained support vector machines for large scale unstructured document classification using unlabeled data and syntactic information. 40(3), 421-439.

26.

(2004). Labeled and unlabeled data in text categorization. , 2971-2976.

27.

(2001). Constrained k-means clustering with background knowledge. , 577-584.

28.

(2005). Data Mining: Practical Machine Learning Tools and Techniques. , -.

29.

(2003). Text classification from positive and unlabeled documents. , 232-239.

30.

(2000). The value of unlabeled data for classification problems. , -.

Journal of the Korean Society for Information Management