SNPPhenA: A corpus for extracting ranked associations of SNP and phenotypes from literature

SNPPhenA corpus

The SNPPhenA corpus

The SNPPhenA corpus consists of medical and biological texts annotated for snp-phenotype associations, negation, modality markers and degree of confidence of associations. This was done to allow a comparison between the development of systems for association extraction as well as the degree of confidence and strength of associations The corpus is publicly available for research purposes.

The annotation guidelines: pdf 
Annotation principles are also discussed in the following paper:

Corpus download

Information provided in the search engine was used to collect genome-wide association abstracts. GoPubMed is a webserver that allows users to explore PubMed search results with Gene Ontology . Here is DTD for the xml files containing the annotations: DTD

Abstracts of the SNPPhenA corpus: xml v1.0

The full corpus  in XML and BRAT formats is available in one file: zip

An online association extraction system that utilizes the SNPPhenA  corpus is available  here.

Inter-agreement analysis

In order to evaluate the quality of the corpus and the reliability of the annotations, inter-annotator agreement score was measured for the task of classifying candidate sentences into positive, negative and neutral classes, and also for task of determining the confidence level of the association. two annotators independently have tagged the corpus. In the case of disagreement between two tags, ‎a third annotator was asked to decide about the correct one. For the task of classifying types of association, inter-annotator agreement was 86%, which means that in 86% of cases, the two annotators have agreed. Additionally, ‎we computed Cohen's Kappa coefficient, ‎for two annotators‎, ‎which takes into account the amount of agreement that could be expected to occur through chance‎. For our two annotators and the type of association task ‎, ‎the Kappa value was 0.79‎. For the task of annotating confidence level of the association, the Kappa value was 0.80.

The results show that annotating confidence level of association is a more difficult task than simply classifying candidate sentences to positive, negative and neutral classes.

Corpus statistics

In the table below, some detailed statistics of the linguistic and nonlinguistic properties of the corpus, in terms of test and training parts, are presented.

Item Train Test Total
Abstracts 270 90 360
Key Sentences 362 121 483
SNP 691 244 935
Phenotypes 496 158 654
SNP-Phenotype association candidates  935 365 1300
Neutral Candidates 142 166 308
Positive Candidates 702 170 872
Negative Candidates 91 29 120
Strong degree of confidence candidates 213 20 233
Medium degree of confidence candidates 92 32 124
Weak degree of cofindence candidates 390 125 515