Submitted by behrouz on Mon, 10/17/2016 - 14:07

Attachment	Size
SNPPhenA.zip	585.38 KB

SNPPhenA corpus

The SNPPhenA corpus

The SNPPhenA corpus consists of medical and biological texts annotated for snp-phenotype associations, negation, modality markers and degree of confidence of associations. This was done to allow a comparison between the development of systems for association extraction as well as the degree of confidence and strength of associations The corpus is publicly available for research purposes.

The annotation guidelines: pdf
Annotation principles are also discussed in the following paper:

Corpus download

Information provided in the http://www.gopubmed.org/ search engine was used to collect genome-wide association abstracts. GoPubMed is a webserver that allows users to explore PubMed search results with Gene Ontology . Here is DTD for the xml files containing the annotations: DTD

Abstracts of the SNPPhenA corpus: xml v1.0

The full corpus in XML and BRAT formats is available in one file: zip

An online association extraction system that utilizes the SNPPhenA corpus is available here.

Inter-agreement analysis

In order to evaluate the quality of the corpus and the reliability of the annotations, inter-annotator agreement score was measured for the task of classifying candidate sentences into positive, negative and neutral classes, and also for task of determining the confidence level of the association. two annotators independently have tagged the corpus. In the case of disagreement between two tags, ‎a third annotator was asked to decide about the correct one. For the task of classifying types of association, inter-annotator agreement was 86%, which means that in 86% of cases, the two annotators have agreed. Additionally, ‎we computed Cohen's Kappa coefficient, ‎for two annotators‎, ‎which takes into account the amount of agreement that could be expected to occur through chance‎. For our two annotators and the type of association task ‎, ‎the Kappa value was 0.79‎. For the task of annotating confidence level of the association, the Kappa value was 0.80.

The results show that annotating confidence level of association is a more difficult task than simply classifying candidate sentences to positive, negative and neutral classes.

Corpus statistics

In the table below, some detailed statistics of the linguistic and nonlinguistic properties of the corpus, in terms of test and training parts, are presented.

Item	Train	Test	Total
Abstracts	270	90	360
Key Sentences	362	121	483
SNP	691	244	935
Phenotypes	496	158	654
SNP-Phenotype association candidates	935	365	1300
Neutral Candidates	142	166	308
Positive Candidates	702	170	872
Negative Candidates	91	29	120
Strong degree of confidence candidates	213	20	233
Medium degree of confidence candidates	92	32	124
Weak degree of cofindence candidates	390	125	515

SNPPhenA: A corpus for extracting ranked associations of SNP and phenotypes from literature

The SNPPhenA corpus

Corpus download

Inter-agreement analysis

Corpus statistics