본 논문에서는 생의학 분야의 특정 세부 분야에 특화된 관계 추출 학습 말뭉치를 효율적으로 구축할 수 있는 시스템을 소개한다. 이 시스템은 대상 분야에 해당하는 용어집(유전자, 단백질, 질환 명칭 등)을 입력하면, 대용량 상호 작용 데이터베이스를 통해서 이들 용어 간의 연관 관계를 1차적으로 생성하고 생성된 연관 관계 집합을 다시 학술 데이터베이스에서 검색하여 최종적으로 연관 관계 포함 문장을 추출하는 형태로 수행된다. 개발된 시스템의 유용성 검증을 위해서 알츠하이머 병 분야에서의 유전자 간 상호 작용 학습 말뭉치를 구축하는데 본 시스템을 적용하였고, 140개의 유전자 집합을 입력하여 이 분야에 특화된 학습 집합인 유전자 쌍 및 상호 작용 포함 문장 3,510 건을 추출하였다. 본 논문에서 제안한 시스템을 활용함으로써 기존에 완전 수작업으로 수행되던 연관 관계 추출용 학습 말뭉치 구축의 효율성을 높일 수 있고 다양한 세부 분야에 적합한 학습 말뭉치 구축에 도움을 줄 수 있다.
This paper introduces a software system and process model for constructing domain-specific relation extraction datasets semi-automatically. The system uses a set of terms such as genes, proteins diseases and so forth as inputs and then by exploiting massive biological interaction database, generates a set of term pairs which are utilized as queries for retrieving sentences containing the pairs from scientific databases. To assess the usefulness of the proposed system, this paper applies it into constructing a genic interaction dataset related to Alzheimer’s disease domain, which extracts 3,510 interaction-related sentences by using 140 gene names in the area. In conclusion, the resulting outputs of the case study performed in this paper indicate the fact that the system and process could highly boost the efficiency of the dataset construction in various subfields of biomedical research.
Alex, B., Grover, C., Haddow, B., Kabadjor, M., Klein, E., Matthews, M., Wang, X. 2008. Assisted Curation: Does Text Mining Really Help?. In Pacific Symposium on Biocomputing (Vol. 13, pp. 556–567).
Alnazzawi, N., Thompson, P., & Ananiadou, S. 2014. Building a semantically annotated corpus for congestive heart and renal failure from clinical records and the literature. In Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)@ EACL (pp. 69–74).
Bader, G. D., Betel, D., & Hogue, C. W. V. 2003. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Research, 31(1): 248–250.
Blaschke, C., Hirschman, L., & Valencia, A. 2002. Information extraction in molecular biology. Briefings in Bioinformatics, 3(2): 154–165.
Bunescu, R., Ge, R., Kate, R. J., Marcotte, E. M., Mooney, R. J., Ramani, A. K., & Wong, Y. W. 2005. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine, 33(2): 139–155.
Chatr-aryamontri, A., Ceol, A., Palazzi, L. M., Nardelli, G., Schneider, M. V., Castagnoli, L., & Cesareni, G. 2007. MINT: the Molecular INTeraction database. Nucleic Acids Research, 35(Database issue), D572D – 574. https://doi.org/10.1093/nar/gkl950
Choi, S.-P., & Myaeng, S.-H. 2010. Simplicity is Better: Revisiting Single Kernel PPI Extraction. In Proceedings of the 23rd International Conference on Computational Linguistics (pp. 206–214). Stroudsburg, PA, USA: Association for Computational Linguistics.
Ding, J., Berleant, D., Nettleton, D., & Wurtele, E. 2002. Mining MEDLINE: abstracts, sentences, or phrases? Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 326–337.
Fundel, K., Küffner, R., & Zimmer, R. 2007. RelEx—Relation extraction using dependency parse trees. Bioinformatics, 23(3): 365–371. https://doi.org/10.1093/bioinformatics /btl616
Haddow, B., & Alex, B. 2008. Exploiting Multiply Annotated Corpora in Biomedical Information Extraction Tasks. In D. T. Nicoletta Calzolari (Conference Chair)Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis (Ed.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Association (ELRA).
Hastie, T., Tibshirani, R., & Friedman, J. 2009. The Elements of Statistical Learning. New York, NY: Springer New York.
Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Apweiler, R. 2004. IntAct: an open source molecular interaction database. Nucleic Acids Research, 32(Database issue), D452–D455. https://doi.org/10.1093/nar/gkh052
Hirschman, L., Yeh, A., Blaschke, C., & Valencia, A. 2005. Overview of BioCreAtIvE:critical assessment of information extraction for biology. BMC Bioinformatics, 6(Suppl 1), S1.
Huang, C.-C., & Lu, Z. 2016. Community challenges in biomedical text mining over 10years: success, failure and the future. Briefings in Bioinformatics, 17(1): 132–144.
Ivanović, M., & Budimac, Z. 2014. An overview of ontologies and data resources in medical domains. Expert Systems with Applications, 41(11), 5158–5166.
Kim, J.-D., Pyysalo, S., Ohta, T., Bossy, R., Nguyen, N., & Tsujii, J. ichi. 2011. Overview of BioNLP Shared Task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop (pp. 1–6). Stroudsburg, PA, USA: Association for Computational Linguistics.
Krallinger, M., Leitner, F., Rodriguez-Penagos, C., & Valencia, A. 2008. Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology, 9(Suppl 2), S4. https://doi.org/10.1186/gb-2008-9-s2-s4
Lee, J., Kim, S., Lee, S., Lee, K., & Kang, J. 2012. High Precision Rule Based PPI Extraction and Per-pair Basis Performance Evaluation. In Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics (pp. 69–76). New York, NY, USA: ACM.
Li, L., Guo, R., Jiang, Z., & Huang, D. 2014. Improving Kernel-based protein-protein interaction extraction by unsupervised word representation. In Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on (pp. 379–384). IEEE.
Malhotra, A., Younesi, E., Gündel, M., Müller, B., Heneka, M. T., & Hofmann-Apitius, M. 2014. ADO: a disease ontology representing the domain knowledge specific to Alzheimer’s disease. Alzheimer’s & Dementia: The Journal of the Alzheimer’s Association, 10(2), 238–246.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Association for Computational Linguistics (ACL) System Demonstrations (pp. 55–60).
Mintz, M., Bills, S., Snow, R., & Jurafsky, D. 2009. Distant Supervision for Relation Extraction Without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2 (pp. 1003–1011). Stroudsburg, PA, USA: Association for Computational Linguistics.
Nédellec, C. 2005. Learning language in logic-genic interaction extraction challenge. In Proceedings of the 4th Learning Language in Logic Workshop (LLL05) (Vol. 7). Citeseer.
Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Järvinen, J., & Salakoski, T. 2007. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics, 8(1): 50.
Ravikumar, K., Liu, H., Cohn, J. D., Wall, M. E., & Verspoor, K. 2012. Literature mining of protein-residue associations with graph rules learned through distant supervision. Journal of Biomedical Semantics, 3 Suppl 3, S2.
Rubin, D. L., Shah, N. H., & Noy, N. F. 2008. Biomedical ontologies: a functional perspective. Briefings in Bioinformatics, 9(1): 75–90.
Saffer, J. D., & Burnett, V. L. 2014. Introduction to Biomedical Literature Text Mining:Context and Objectives. In Biomedical Literature Mining (pp. 1–7). Springer.
Segura Bedmar, I., Martínez, P., & Sánchez Cisneros, D. 2011. The 1st DDIExtraction-2011 Challenge Task: Extraction of Drug-Drug Interactions from Biomedical Texts.
Stark, C., Breitkreutz, B.-J., Reguly, T., Boucher, L., Breitkreutz, A., & Tyers, M. 2006. BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 34(Database issue), D535-539.
Thompson, P., Iqbal, S. A., McNaught, J., & Ananiadou, S. 2009. Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics, 10(1): 349.
Uzuner, ö., South, B. R., Shen, S., & DuVall, S. L. 2011. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association: JAMIA, 18(5): 552–556.
Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K., Marcotte, E. M., & Eisenberg, D. 2000. DIP: the Database of Interacting Proteins. Nucleic Acids Research, 28(1), 289–291.
박경미, 황규백. 2011. 자연어처리 기반 바이오 텍스트 마이닝 시스템. 『정보과학회논문지 : 컴퓨팅의 실제 및 레터』, 17(4).
정창후, 최성필, 이민호, 최윤수. 2010. 기술용어 간 관계추출의 성능평가를 위한 반자동 테스트컬렉션 구축 프레임워크 개발. 『한국콘텐츠학회논문지』, 10(2).
최성필. 2016. 기계 학습을 이용한 바이오 분야 학술 문헌에서의 관계 추출에 대한 실험적 연구. 『한국문헌정보학회지』, 50(2).
허고은, 송민. 2014. 텍스트 마이닝 기반의 그래프 모델을 이용한 미발견 공공 지식 추론. 『정보관리학회지』, 31(1).