바로가기메뉴

본문 바로가기 주메뉴 바로가기

logo

Design and Implementation of a Subword-based Morphological Analyzer for Modern Sino-Korean Mixed Texts

Korean Journal of Digital Humanities / Korean Journal of Digital Humanities, (E)3058-311X
2024, v.1 no.2, pp.68-76
https://doi.org/10.23287/KJDH.2024.1.2.5
Byungjun Kim (The Academy of Korean Studies)

Abstract

This study proposes the design and implementation of a subword-based morphological analyzer for automated analysis of modern Sino-Korean mixed texts. Current large-scale modern literature databases are difficult to process effectively with existing morphological analyzers due to their characteristics of mixed Sino-Korean characters and archaic Korean. To address this issue, we present a new approach using the sw_tokenizer of the kiwipiepy library based on subword tokenization. We implemented three models with different vocab sizes (32000, 48000, 64000) using approximately 2.3 million newspaper and magazine articles (about 771.5 million syllables) from 1890-1940 as training data. The experimental results show that larger vocab sizes better preserve the semantic units of compound Sino-Korean characters, and researchers can select appropriate analysis units according to their research purposes. This study contributes to digital humanities research by providing a practical tool for automated analysis of modern Sino-Korean mixed texts and suggests new directions for future research in this field.

keywords
modern Sino-Korean mixed text, subword tokenization, morphological analysis, corpus construction
Submission Date
2024-11-10
Revised Date
2024-11-15
Accepted Date
2024-11-22

Korean Journal of Digital Humanities