
본문 바로가기 주메뉴 바로가기

A Study on Keyword Extraction From a Single Document Using Term Clustering

Journal of the Korean Society for Library and Information Science / Journal of the Korean Society for Library and Information Science, (P)1225-598X; (E)2982-6292
2010, v.44 no.3, pp.155-173


In this study, a new keyword extraction algorithm is applied to a single document with term clustering. A single document is divided by multiple passages, and two ways of calculating similarities between two terms are investigated; the first-order similarity and the second-order distributional similarity. In this experiment, the best cluster performance is achieved with a 50-term passage from the second-order distributional similarity. From the results of first experiment, the second-order distribution similarity was also applied to various keyword extraction methods using statistic information of terms. In the second experiment, (paragraph frequency) and (term frequency by inverse paragraph frequency) were found to improve the overall performance of keyword extraction. Therefore, it showed that the algorithm fulfills the necessary conditions which good keywords should have.

용어 클러스터링, 키워드 추출, 단일문서, 2차 분포 유사도, 텍스트 마이닝, Term Clustering, Keyword Extraction, Single Document, Second-order Similarity, Text Mining, Term Clustering, Keyword Extraction, Single Document, Second-order Similarity, Text Mining



김수연, 정영미. 2006. 텍스트 마이닝 기법을 이용한 연관용어 선정에 관한 실험적 연구. ꡔ정보관리학회지ꡕ, 23(3): 147-165.


서은경. 1984. 용어의 자동분류에 관한 연구. ꡔ정보관리학회지ꡕ, 1(1): 78-99.


유사라. 1999. ꡔ정보학연구와 분석방법론ꡕ. 서울: 나남출판.


이성직, 김한준. 2009. TF-IDF의 변형을 이용한 전자뉴스에서의 키워드 추출 기법. ꡔ한국전자거래학회지ꡕ, 14(4): 59-73.


이재윤. 2007. 분포 유사도를 이용한 문헌클러스터링의 성능향상에 대한 연구. ꡔ정보관리학회지ꡕ, 24(4): 267-283.


이주호, 김학수. 2009. 의존관계를 이용한 단일문서의 키워드 추출. ꡔ2009 한국컴퓨터종합학술대회논문집ꡕ, 36(1): 293-296.


정영미. 2005. ꡔ정보검색연구ꡕ. 서울: 구미무역.


정영미. 1993. ꡔ정보검색론ꡕ. 서울: 구미무역.


한승희, 정영미. 2004. 클러스터링 기법을 이용한 개별문서의 지식구조 자동 생성에 관한 연구. ꡔ정보관리학회지ꡕ, 21(3): 251-267.


Al-Khalifa, Hend S., & Hugh C. Davis. 2006. “Folksonomies versus automatic keyword extraction: an empirical study." Proceedings of IADIS Web Applications and Research, 2: 132-143.


Callan, James P. 1994. “Passage-level evidence on document retrieval." Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 302-310.


Dagan, Ido, Lillian Lee, & Fernando Pereira. 1999. “Similarity-based models of cooccurrence probabilities." Machine Learning, 34(1-3): 43-69.


Hulth, A., Jussi Karlgren, Anna Jonsson, Henrik Bostrom, & Lars Asker. 2010. “Automatic Keyword Extraction Using Domain Knowledge." Lecture Notes in Computer Science, 2004/2010: 472-482.


Kullback, Solomon. 1968. Information Theory and Statistics, 2nd ed. New York: Dover Books.


Lee, Lillan. 1999. “Measures of distributional similarity." Proceedings of 37th Annual Meeting of the Association for Computational Linguistics, 25-32.


Leweis, David D., & W. Bruce Croft. 1990. “Term clustering of syntactic phrases." Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 385-404.


Lin, J. 1991. “Divergence measures based on the Shannon entropy." IEEE Transactions on Information Theory, 37(1): 145-151.


Liu, M., Li, W., Wu Mingli, & Qin Lu. 2007. “Extractive summarization based on event term clustering." Proceedings of the ACL 2007, 185-188.


Matzuo, Y., & M. Ishizuka. 2004. “Keyword extraction from a single document using word co-occurrence statistical information." International Journal on artificial Intelligence Tool, 13(1): 157-169.


Pereira, F., Naftali Tishby, & Lillian Lee. 1993. “Distributional clustering of English words." Proceedings of the 31st Annual Meeting of the ACL, 183-190.


Plas, L. van der, V. Pallotta, M. Rajman, & H. Ghorbel. 2004. “Automatic keyword extraction from spoken text." Proceedings of the 4th International Conference on Language Resources and Evaluation 2004, 2205-2208.


Sneath, P. H. A., and R. R. Sokal. 1973. Numerical Taxonomy. SF: Freeman.


Sparck Jones, K. 1971. Automatic Keyword Classification for Information Retrieval. London: Butterworth&Co.


Sparck Jones, K. 1972. “Automatic indexing." Journal of Documentation, 30(4): 393-432.


Strehl, Alexander, Joydeep Ghosh, & Raymond Mooney. 2000. “Impact of similarity measures on web-page clustering." Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search(AAAI 2000), 58-64.


Suzuki, Y., F. Fukumoto, Y. Sekiguchi. 1998. “Keyword extraction of radio news using term weighting with an encyclopedia and newspaper articles." Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 373-374.


Tombros, Anastasios. 2002. The Effects of Query-based Hierarchical Clustering of Documents for Information Retrieval. Ph.D. diss., Cornell University.


Turney, Peter D. 2000. “Learning algorithm for keyphrase extraction." Information Retrieval, 2(4): 303-36.


Weeds, J. E. 2003. Measures and Applications of Lexical Distributional Similarity. Ph. D. diss., University of Sussex.


White, H. D., & B. C. Griffith. 1981. “Author cocitation: a literature measure of intellectual structure." Journal of the American Society for Information Science, 32: 163-171.


Witten, Ian H., Paynter, Gordon W., Frank, Eibe., Gutwin, Carl., & Nevill-Manning, Craig G. 1999. “KEA: practical automatic keyphrase extraction.” Proceedings of the 4th ACM Conference on Digital Library, 254-255.


Zobel, J., A. Moffat, R. Wilkinson, & R. Sacks-Davis. 1995. “Efficient Retrieval of Partial Documents." Information Processing and Management, 31(3): 36-377.

Journal of the Korean Society for Library and Information Science