The SNPPhenA corpus
The SNPPhenA corpus consists of medical and biological texts annotated for snp-phenotype associations, negation, modality markers and degree of confidence of associations. This was done to allow a comparison between the development of systems for association extraction as well as the degree of confidence and strength of associations The corpus is publicly available for research purposes.
The annotation guidelines: pdf
Information provided in the http://www.gopubmed.org/ search engine was used to collect genome-wide association abstracts. GoPubMed is a webserver that allows users to explore PubMed search results with Gene Ontology . Here is DTD for the xml files containing the annotations: DTD
Abstracts of the SNPPhenA corpus: xml v1.0
The full corpus in XML and BRAT formats is available in one file: zip
An online association extraction system that utilizes the SNPPhenA corpus is available here.
In order to evaluate the quality of the corpus and the reliability of the annotations, inter-annotator agreement score was measured for the task of classifying candidate sentences into positive, negative and neutral classes, and also for task of determining the confidence level of the association. two annotators independently have tagged the corpus. In the case of disagreement between two tags, a third annotator was asked to decide about the correct one. For the task of classifying types of association, inter-annotator agreement was 86%, which means that in 86% of cases, the two annotators have agreed. Additionally, we computed Cohen's Kappa coefficient, for two annotators, which takes into account the amount of agreement that could be expected to occur through chance. For our two annotators and the type of association task , the Kappa value was 0.79. For the task of annotating confidence level of the association, the Kappa value was 0.80.
The results show that annotating confidence level of association is a more difficult task than simply classifying candidate sentences to positive, negative and neutral classes.
In the table below, some detailed statistics of the linguistic and nonlinguistic properties of the corpus, in terms of test and training parts, are presented.