바로가기메뉴

본문 바로가기 주메뉴 바로가기

A Study on the Semiautomatic Construction of Domain - Specific Relation Extraction Datasets from Biomedical Abstracts - Mainly Focusing on a Genic Interaction Dataset in Alzheimer’s Disease Domain -

Journal of Korean Library and Information Science Society / Journal of Korean Library and Information Science Society, (P)2466-2542;
2016, v.47 no.4, pp.289-307
https://doi.org/10.16981/kliss.47.4.201612.289



Abstract

This paper introduces a software system and process model for constructing domain-specific relation extraction datasets semi-automatically. The system uses a set of terms such as genes, proteins diseases and so forth as inputs and then by exploiting massive biological interaction database, generates a set of term pairs which are utilized as queries for retrieving sentences containing the pairs from scientific databases. To assess the usefulness of the proposed system, this paper applies it into constructing a genic interaction dataset related to Alzheimer’s disease domain, which extracts 3,510 interaction-related sentences by using 140 gene names in the area. In conclusion, the resulting outputs of the case study performed in this paper indicate the fact that the system and process could highly boost the efficiency of the dataset construction in various subfields of biomedical research.

keywords
관계 추출, 학습 집합 구축, 유전자 간 상호 작용, 기계 학습, 텍스트 마이닝, Relation extraction, Dataset construction, Genic interactions, Machine learning, Text mining

Reference

1.

Alex, B., Grover, C., Haddow, B., Kabadjor, M., Klein, E., Matthews, M., Wang, X. 2008. Assisted Curation: Does Text Mining Really Help?. In Pacific Symposium on Biocomputing (Vol. 13, pp. 556–567).

2.

Alnazzawi, N., Thompson, P., & Ananiadou, S. 2014. Building a semantically annotated corpus for congestive heart and renal failure from clinical records and the literature. In Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)@ EACL (pp. 69–74).

3.

Bader, G. D., Betel, D., & Hogue, C. W. V. 2003. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Research, 31(1): 248–250.

4.

Blaschke, C., Hirschman, L., & Valencia, A. 2002. Information extraction in molecular biology. Briefings in Bioinformatics, 3(2): 154–165.

5.

Bunescu, R., Ge, R., Kate, R. J., Marcotte, E. M., Mooney, R. J., Ramani, A. K., & Wong, Y. W. 2005. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine, 33(2): 139–155.

6.

Chatr-aryamontri, A., Ceol, A., Palazzi, L. M., Nardelli, G., Schneider, M. V., Castagnoli, L., & Cesareni, G. 2007. MINT: the Molecular INTeraction database. Nucleic Acids Research, 35(Database issue), D572D – 574. https://doi.org/10.1093/nar/gkl950

7.

Choi, S.-P., & Myaeng, S.-H. 2010. Simplicity is Better: Revisiting Single Kernel PPI Extraction. In Proceedings of the 23rd International Conference on Computational Linguistics (pp. 206–214). Stroudsburg, PA, USA: Association for Computational Linguistics.

8.

Ding, J., Berleant, D., Nettleton, D., & Wurtele, E. 2002. Mining MEDLINE: abstracts, sentences, or phrases? Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 326–337.

9.

Fundel, K., Küffner, R., & Zimmer, R. 2007. RelEx—Relation extraction using dependency parse trees. Bioinformatics, 23(3): 365–371. https://doi.org/10.1093/bioinformatics /btl616

10.

Haddow, B., & Alex, B. 2008. Exploiting Multiply Annotated Corpora in Biomedical Information Extraction Tasks. In D. T. Nicoletta Calzolari (Conference Chair)Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis (Ed.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Association (ELRA).

11.

Hastie, T., Tibshirani, R., & Friedman, J. 2009. The Elements of Statistical Learning. New York, NY: Springer New York.

12.

Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Apweiler, R. 2004. IntAct: an open source molecular interaction database. Nucleic Acids Research, 32(Database issue), D452–D455. https://doi.org/10.1093/nar/gkh052

13.

Hirschman, L., Yeh, A., Blaschke, C., & Valencia, A. 2005. Overview of BioCreAtIvE:critical assessment of information extraction for biology. BMC Bioinformatics, 6(Suppl 1), S1.

14.

Huang, C.-C., & Lu, Z. 2016. Community challenges in biomedical text mining over 10years: success, failure and the future. Briefings in Bioinformatics, 17(1): 132–144.

15.

Ivanović, M., & Budimac, Z. 2014. An overview of ontologies and data resources in medical domains. Expert Systems with Applications, 41(11), 5158–5166.

16.

Kim, J.-D., Pyysalo, S., Ohta, T., Bossy, R., Nguyen, N., & Tsujii, J. ichi. 2011. Overview of BioNLP Shared Task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop (pp. 1–6). Stroudsburg, PA, USA: Association for Computational Linguistics.

17.

Krallinger, M., Leitner, F., Rodriguez-Penagos, C., & Valencia, A. 2008. Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology, 9(Suppl 2), S4. https://doi.org/10.1186/gb-2008-9-s2-s4

18.

Lee, J., Kim, S., Lee, S., Lee, K., & Kang, J. 2012. High Precision Rule Based PPI Extraction and Per-pair Basis Performance Evaluation. In Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics (pp. 69–76). New York, NY, USA: ACM.

19.

Li, L., Guo, R., Jiang, Z., & Huang, D. 2014. Improving Kernel-based protein-protein interaction extraction by unsupervised word representation. In Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on (pp. 379–384). IEEE.

20.

Malhotra, A., Younesi, E., Gündel, M., Müller, B., Heneka, M. T., & Hofmann-Apitius, M. 2014. ADO: a disease ontology representing the domain knowledge specific to Alzheimer’s disease. Alzheimer’s & Dementia: The Journal of the Alzheimer’s Association, 10(2), 238–246.

21.

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Association for Computational Linguistics (ACL) System Demonstrations (pp. 55–60).

22.

Mintz, M., Bills, S., Snow, R., & Jurafsky, D. 2009. Distant Supervision for Relation Extraction Without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2 (pp. 1003–1011). Stroudsburg, PA, USA: Association for Computational Linguistics.

23.

Nédellec, C. 2005. Learning language in logic-genic interaction extraction challenge. In Proceedings of the 4th Learning Language in Logic Workshop (LLL05) (Vol. 7). Citeseer.

24.

Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Järvinen, J., & Salakoski, T. 2007. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics, 8(1): 50.

25.

Ravikumar, K., Liu, H., Cohn, J. D., Wall, M. E., & Verspoor, K. 2012. Literature mining of protein-residue associations with graph rules learned through distant supervision. Journal of Biomedical Semantics, 3 Suppl 3, S2.

26.

Rubin, D. L., Shah, N. H., & Noy, N. F. 2008. Biomedical ontologies: a functional perspective. Briefings in Bioinformatics, 9(1): 75–90.

27.

Saffer, J. D., & Burnett, V. L. 2014. Introduction to Biomedical Literature Text Mining:Context and Objectives. In Biomedical Literature Mining (pp. 1–7). Springer.

28.

Segura Bedmar, I., Martínez, P., & Sánchez Cisneros, D. 2011. The 1st DDIExtraction-2011 Challenge Task: Extraction of Drug-Drug Interactions from Biomedical Texts.

29.

Stark, C., Breitkreutz, B.-J., Reguly, T., Boucher, L., Breitkreutz, A., & Tyers, M. 2006. BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 34(Database issue), D535-539.

30.

Thompson, P., Iqbal, S. A., McNaught, J., & Ananiadou, S. 2009. Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics, 10(1): 349.

31.

Uzuner, ö., South, B. R., Shen, S., & DuVall, S. L. 2011. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association: JAMIA, 18(5): 552–556.

32.

Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K., Marcotte, E. M., & Eisenberg, D. 2000. DIP: the Database of Interacting Proteins. Nucleic Acids Research, 28(1), 289–291.

33.

박경미, 황규백. 2011. 자연어처리 기반 바이오 텍스트 마이닝 시스템. 『정보과학회논문지 : 컴퓨팅의 실제 및 레터』, 17(4).

34.

정창후, 최성필, 이민호, 최윤수. 2010. 기술용어 간 관계추출의 성능평가를 위한 반자동 테스트컬렉션 구축 프레임워크 개발. 『한국콘텐츠학회논문지』, 10(2).

35.

최성필. 2016. 기계 학습을 이용한 바이오 분야 학술 문헌에서의 관계 추출에 대한 실험적 연구. 『한국문헌정보학회지』, 50(2).

36.

허고은, 송민. 2014. 텍스트 마이닝 기반의 그래프 모델을 이용한 미발견 공공 지식 추론. 『정보관리학회지』, 31(1).

Journal of Korean Library and Information Science Society