Performance Comparison of Automatic Classification Using Word Embeddings of Book Titles

Lee Yong-Gu; 이용구

doi:10.3743/KOSIM.2023.40.4.307

P-ISSN1013-0799
E-ISSN2586-2073
KCI

Home

OA Policy

ISSN : 1013-0799

Article Contents

Prev Next

e-Submission

Vol.40 No.4

Citation Share

Performance Comparison of Automatic Classification Using Word Embeddings of Book Titles

Journal of the Korean Society for Information Management / Journal of the Korean Society for Information Management, (P)1013-0799; (E)2586-2073

2023, v.40 no.4, pp.307-327

https://doi.org/10.3743/KOSIM.2023.40.4.307

Yong-Gu Lee (Kyungpook National University)

Lee, Y. (2023). Performance Comparison of Automatic Classification Using Word Embeddings of Book Titles. Journal of the Korean Society for Information Management, 40(4), 307-327, https://doi.org/10.3743/KOSIM.2023.40.4.307

copy

Abstract

To analyze the impact of word embedding on book titles, this study utilized word embedding models (Word2vec, GloVe, fastText) to generate embedding vectors from book titles. These vectors were then used as classification features for automatic classification. The classifier utilized the k-nearest neighbors (kNN) algorithm, with the categories for automatic classification based on the DDC (Dewey Decimal Classification) main class 300 assigned by libraries to books. In the automatic classification experiment applying word embeddings to book titles, the Skip-gram architectures of Word2vec and fastText showed better results in the automatic classification performance of the kNN classifier compared to the TF-IDF features. In the optimization of various hyperparameters across the three models, the Skip-gram architecture of the fastText model demonstrated overall good performance. Specifically, better performance was observed when using hierarchical softmax and larger embedding dimensions as hyperparameters in this model. From a performance perspective, fastText can generate embeddings for substrings or subwords using the n-gram method, which has been shown to increase recall. The Skip-gram architecture of the Word2vec model generally showed good performance at low dimensions(size 300) and with small sizes of negative sampling (3 or 5).

keywords: Word2vec, GloVe, fastText, word embedding, automatic classification, Dewey Decimal Classification(DDC), Word2vec, GloVe, fastText

Submission Date: 2023-11-20

Revised Date: 2023-12-08

Accepted Date: 2023-12-13

바로가기메뉴

Article Contents

Vol.40 No.4

Performance Comparison of Automatic Classification Using Word Embeddings of Book Titles

Abstract

Journal of the Korean Society for Information Management