E-ISSN : 3058-311X
This study proposes the design and implementation of a subword-based morphological analyzer for automated analysis of modern Sino-Korean mixed texts. Current large-scale modern literature databases are difficult to process effectively with existing morphological analyzers due to their characteristics of mixed Sino-Korean characters and archaic Korean. To address this issue, we present a new approach using the sw_tokenizer of the kiwipiepy library based on subword tokenization. We implemented three models with different vocab sizes (32000, 48000, 64000) using approximately 2.3 million newspaper and magazine articles (about 771.5 million syllables) from 1890-1940 as training data. The experimental results show that larger vocab sizes better preserve the semantic units of compound Sino-Korean characters, and researchers can select appropriate analysis units according to their research purposes. This study contributes to digital humanities research by providing a practical tool for automated analysis of modern Sino-Korean mixed texts and suggests new directions for future research in this field.