Extraction method

The observed precision at the evaluation set ranged from 79.3%-100.0%, while recall ranged from 80%-100.0%. The best precision was observed for study design (100.0%). The precision value of 100.0% should be taken with caution due to the relatively small number of study designs. Covariate returned the best recall (100.0%). Similarly to the observations for study design, since the number of annotated covariates is limited, the resulting value should be taken with caution. Study design returned the highest F-score overall with a value of 96%. The micro F-score was 870% suggesting relatively reliable results generated from our method. The performance can be seen in more detail in the Table provided below.

Normalization method

The evaluation results are not reported for the exposure, outcome and covariate characteristics since the performance of MetaMap has been extensively reported with an overall F-score of 81.3% for the normalization of biomedical concepts, suggesting reliable results (Aronson, 2001; Denny et al. 2003; Aronson et al. 2010).

At the document level, study design, population and effect size mentions are considered to have been normalized correctly only if all their reported attributes have been recognised e.g., in the example “boys aged from 13-15 years old”, the attribute gender is “male” and the age is “juvenile”. If one of these is not been identified the mention has not been normalized correctly.

Due to the small number of concepts in study design (13) and population (24) (each one is the chosen lengthiest span of each abstract), a further random sample of 100 MEDLINE epidemiological abstracts was selected from a larger corpus for each characteristic in order to perform a more meaningful evaluation of the normalization method.

If an effect size mention contains more than one mention, these are considered independent in the normalization process (e.g., the span “the risk for overweight (or = 0.89) and abdominal obesity (or = 0.91)”, has two mentions “overweight (or=0.86)” and “abdominal obesity (or=0.91)”). This explains the increased number of effect size spans in both the evaluation set and random sample when compared with those of study design and population.

The accuracy at the document level was above 91.0% in both the evaluation and the random set. The values returned in the larger set are though more accurate in their description of the normalization method performance due to the random selection of mentions and their increased number. Hence, the accuracy values of the random sample suggest a normalization that efficiently detects the related attributes of study design, population and effect size.

