바로가기메뉴

본문 바로가기 주메뉴 바로가기

A Study on Collecting and Structuring Language Resource for Named Entity Recognition and Relation Extraction from Biomedical Abstracts

Journal of the Korean Society for Library and Information Science / Journal of the Korean Society for Library and Information Science, (P)1225-598X; (E)2982-6292
2017, v.51 no.4, pp.227-248
https://doi.org/10.4275/KSLIS.2017.51.4.227



Abstract

This paper introduces an integrated model for systematically constructing a linguistic resource database that can be used by machine learning-based biomedical information extraction systems. The proposed method suggests an orderly process of collecting and constructing dictionaries and training sets for both named-entity recognition and relation extraction. Multiple heterogeneous structures for the resources which are collected from diverse sources are analyzed to derive essential items and fields for constructing the integrated database. All the collected resources are converted and refined to build an integrated linguistic resource storage. In this paper, we constructed entity dictionaries of gene, protein, disease and drug, which are considered core linguistic elements or core named entities in the biomedical domains and conducted verification tests to measure their acceptability.

keywords
정보 추출, 개체명 인식, 관계 추출, 바이오 텍스트 마이닝, 학습 집합, Information Extraction, Named-Entity Recognition, Relation Extraction, Bio-text Mining, Training Set

Reference

1.

박성배. 2005. 기계학습/텍스트마이닝과 생명과학. 정보과학회지, 23(5), 32-40.

2.

박경미, 황규백. 2011. 자연어처리 기반 바이오 텍스트 마이닝 시스템. 정보과학회논문지: 컴퓨팅의 실제 및 레터, 17(4), 205-213.

3.

송영길, 정석원, 김학수. 2015. 위키피디아 기반 개체명 사전 반자동 구축 방법. 정보과학회논문지, 42(11), 1397-1403.

4.

신성호 외. 2014. 개체명 인식 향상을 위한 학습 집합 및 개체명 인식 모델 구축. 정보과학회논문지: 컴퓨팅의 실제 및 레터, 20(7), 425-429.

5.

이혜진, 김재웅. 2017. 자연어 처리 기술 현황 및 표준화 동향에 관한 연구. 한국통신학회 학술대회논문집, 2017년 6월 21일, 제주: 라마다 프라자 제주 호텔: 876-877.

6.

허고은, 송민. 2014. 텍스트 마이닝 기반의 그래프 모델을 이용한 미발견 공공 지식 추론. 정보관리학회지, 31(1), 231-250.

7.

Ananiadou, S., Kell, D. B., and Tsujii, J. 2006. Text Mining and Its Potential Applications in Systems Biology." Trends in Biotechnology, 24(12), 571-579.

8.

Beuning, P., and Musier-Forsyth, K. 1999. Transfer RNA Recognition by Aminoacyl-tRNA Synthetases." Biopolymers, 52(1), 1-28.

9.

Biomedical Informatics Lab at ASU. 2017. Arizona Disease Corpus. [online] [cited 2017. 6. 1.] <http://diego.asu.edu>

10.

Choi, S. 2016. Extraction of Protein-Protein Interactions (PPIs) from the Literature by Deep Convolutional Neural Networks with Various Feature Embeddings." Sage Journals, 2016.

11.

Comparative Toxicogenomics Database. 2017. Gene vocabulary. [online] [cited 2017. 4. 27.]<http://ctdbase.org/;jsessionid=0868DE4D459374D22AB222F9CC3ECA43>

12.

DrugBank. 2017. COMPLETE DATABASE: All drugs. [online] [cited 2017. 4. 27.]<https://www.drugbank.ca/>

13.

Fraunhofer Institute for Algorithms and Scientific Computing SCAI. 2017. Silver Standard Corpus for Protein Protein and Drug Drug Interaction. [online] [cited 2017. 6. 2.]<https://www.scai.fraunhofer.de/en.html>

14.

GENIA: The BioNLP Shared Task 2016. 2017. The BioNLP Shared Task. [online] [cited 2017. 10. 9.] <http://2016.bionlp-st.org/>

15.

Huang, C., and Lu, Z. 2016. Community Challenges in Biomedical Text Mining over 10 years:Success, Failure and the Future." Briefings in Bioinformatics, 17(1), 132-144.

16.

HUGO Gene Nomenclature Committee. 2017. Complete HGNC Dataset. [online] [cited 2017. 4. 27.] <https://www.genenames.org/>

17.

Jensen, L. J., Saric, J., and Bork, P. 2006. Literature Mining for the Biologist: from Information Retrieval to Biological Discovery." Nature Reviews Genetics, 7(2), 119-129.

18.

Kim, J., Wang, Y., and Yasunori, Y. 2013. The Genia Event Extraction Shared Task, 2013Edition-Overview." In Proceedings of the BioNLP Shared Task 2013 Workshop, August 9, 2013, Sofia: Association for Computational Linguistics.

19.

National Center for Biotechnology Information. 2017. PubMed. [online] [cited 2017. 6. 11.]<https://www.ncbi.nlm.nih.gov/>

20.

National Institutes of Health. 2017. Genetic Association Database. [online] [cited 2017. 6. 1.] <https://www.nih.gov/>

21.

Natural Language Toolkit. 2017. Natural Language Processing with Python. [online] [cited 2017. 7. 29.] <http://www.nltk.org/>

22.

The National Centre for Text Mining. 2016. Text Mining Resources. [online] [cited 2017. 9. 17.] <http://www.nactem.ac.uk/resources.php>

23.

The University of Pittsburgh Pharmacokinetic Drug-drug Interation (PK DDI) Package Insert Corpus. 2017. Download the PK-DDI corpus with consensus annotations. [online][cited 2017. 6. 20.]<https://dbmi-icode-01.dbmi.pitt.edu/dikb-evidence/package-insert-DDI-NLP-corpus.html>

24.

tagtog. 2017. LocText. [online] [cited 2017. 6. 2.] <https://www.tagtog.net/>

25.

Thomas, P et al. 2012. Weakly labeled corpora as silver standard for drug-drug and proteinprotein interaction." In Proceedings of the Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM) on Language Resources and Evaluation Conference (LREC), 2012. Istanbul, Turkey.

26.

Tripathi, V. et al. 2010. The Nuclear-Retained Noncoding RNA MALAT1 Regulates Alternative Splicing by Modulating SR Splicing Factor Phosphorylation." Molecular Cell, 39(6), 925-938.

27.

UniProt. 2017. Uniprot data. [online] [cited 2017. 4. 27.] <http://www.uniprot.org/>

Journal of the Korean Society for Library and Information Science