Automatic Generation of Training Character Samples for OCR Systems

Ha Le; 김수형; 나인섭; Yen Do; 박상철; 정선화

doi:10.5392/IJoC.2012.8.3.083

ACOMS+ 및 학술지 리포지터리 설명회

한국과학기술정보연구원(KISTI) 서울분원 대회의실(별관 3층)
2024년 07월 03일(수) 13:30

사전등록 바로가기

오늘 하루 그만보기

권한신청
P-ISSN1738-6764
E-ISSN2093-7504
KCI

홈으로

OA 정책

ISSN : 1738-6764

논문 상세

이전 다음

논문 투고

Vol.8 No.3

Citation Share

Automatic Generation of Training Character Samples for OCR Systems

INTERNATIONAL JOURNAL OF CONTENTS / INTERNATIONAL JOURNAL OF CONTENTS, (P)1738-6764; (E)2093-7504

2012, v.8 no.3, pp.83-93

https://doi.org/10.5392/IJoC.2012.8.3.083

Ha Le (전남대학교)
김수형 (전남대학교)
나인섭 (전남대학교)
Yen Do (전남대학교)
박상철 (삼성메디슨(주))
정선화 (한국전자통신연구원)

Ha, L. , 김수형, 나인섭, Yen, D. , 박상철, & 정선화. (2012). . INTERNATIONAL JOURNAL OF CONTENTS, 8(3), 83-93, https://doi.org/10.5392/IJoC.2012.8.3.083

복사

Abstract

In this paper, we propose a novel method that automatically generates real character images to familiarize existing OCR systems with new fonts. At first, we generate synthetic character images using a simple degradation model. The synthetic data is used to train an OCR engine, and the trained OCR is used to recognize and label real character images that are segmented from ideal document images. Since the OCR engine is unable to recognize accurately all real character images, a substring matching method is employed to fix wrongly labeled characters by comparing two strings; one is the string grouped by recognized characters in an ideal document image, and the other is the ordered string of characters which we are considering to train and recognize. Based on our method, we build a system that automatically generates 2350 most common Korean and 117 alphanumeric characters from new fonts. The ideal document images used in the system are postal envelope images with characters printed in ascending order of their codes. The proposed system achieved a labeling accuracy of 99%. Therefore, we believe that our system is effective in facilitating the generation of numerous character samples to enhance the recognition rate of existing OCR systems for fonts that have never been trained.

keywords: Character Sample Generation, Optical Character Recognition, Postal Envelope Images, Training Samples, Degradation Model, and Substring Matching.

바로가기메뉴

논문 상세

Vol.8 No.3

Automatic Generation of Training Character Samples for OCR Systems

Abstract

INTERNATIONAL JOURNAL OF CONTENTS