Design and Implementation of a Subword-based Morphological Analyzer for Modern Sino-Korean Mixed Texts

Kim Byungjun; 김병준

doi:10.23287/KJDH.2024.1.2.5

KOREAN
E-ISSN3058-311X

Home

E-ISSN : 3058-311X

Article Contents

Prev Next

e-Submission

Vol.1 No.2

Citation Share

Design and Implementation of a Subword-based Morphological Analyzer for Modern Sino-Korean Mixed Texts

Korean Journal of Digital Humanities / Korean Journal of Digital Humanities, (E)3058-311X

2024, v.1 no.2, pp.68-76

https://doi.org/10.23287/KJDH.2024.1.2.5

Byungjun Kim (The Academy of Korean Studies)

Kim, B. (2024). Design and Implementation of a Subword-based Morphological Analyzer for Modern Sino-Korean Mixed Texts. , 1(2), 68-76, https://doi.org/10.23287/KJDH.2024.1.2.5

copy

Abstract

This study proposes the design and implementation of a subword-based morphological analyzer for automated analysis of modern Sino-Korean mixed texts. Current large-scale modern literature databases are difficult to process effectively with existing morphological analyzers due to their characteristics of mixed Sino-Korean characters and archaic Korean. To address this issue, we present a new approach using the sw_tokenizer of the kiwipiepy library based on subword tokenization. We implemented three models with different vocab sizes (32000, 48000, 64000) using approximately 2.3 million newspaper and magazine articles (about 771.5 million syllables) from 1890-1940 as training data. The experimental results show that larger vocab sizes better preserve the semantic units of compound Sino-Korean characters, and researchers can select appropriate analysis units according to their research purposes. This study contributes to digital humanities research by providing a practical tool for automated analysis of modern Sino-Korean mixed texts and suggests new directions for future research in this field.

keywords: modern Sino-Korean mixed text, subword tokenization, morphological analysis, corpus construction

Submission Date: 2024-11-10

Revised Date: 2024-11-15

Accepted Date: 2024-11-22

바로가기메뉴

Article Contents

Vol.1 No.2

Design and Implementation of a Subword-based Morphological Analyzer for Modern Sino-Korean Mixed Texts

Abstract

Korean Journal of Digital Humanities