Epidemiological Text Mining

Performance

Extraction method

The observed precision at the evaluation set ranged from 79.3%-100.0%, while recall ranged from 80%-100.0%. The best precision was observed for study design (100.0%). The precision value of 100.0% should be taken with caution due to the relatively small number of study designs. Covariate returned the best recall (100.0%). Similarly to the observations for study design, since the number of annotated covariates is limited, the resulting value should be taken with caution. Study design returned the highest F-score overall with a value of 96%. The micro F-score was 870% suggesting relatively reliable results generated from our method. The performance can be seen in more detail in the Table provided below.

**Values of precision, recall and F-score for the evaluation set**
characteristic	precision(%)	recall(%)	F-score(%)
study design	100.0	92.3	95.9
population	97.2	89.7	93.3
exposure	84.9	80.3	82.5
outcome	79.3	84.8	82.4
covariate	89.4	100.0	94.4
effect size	97.0	86.6	91.5
micro	88.5	86.3	87.4
macro	91.3	88.9	90.0

Show/Hide Table

Normalization method

The evaluation results are not reported for the exposure, outcome and covariate characteristics since the performance of MetaMap has been extensively reported with an overall F-score of 81.3% for the normalization of biomedical concepts, suggesting reliable results (Aronson, 2001; Denny et al. 2003; Aronson et al. 2010).

At the document level, study design, population and effect size mentions are considered to have been normalized correctly only if all their reported attributes have been recognised e.g., in the example “boys aged from 13-15 years old”, the attribute gender is “male” and the age is “juvenile”. If one of these is not been identified the mention has not been normalized correctly.

Due to the small number of concepts in study design (13) and population (24) (each one is the chosen lengthiest span of each abstract), a further random sample of 100 MEDLINE epidemiological abstracts was selected from a larger corpus for each characteristic in order to perform a more meaningful evaluation of the normalization method.

If an effect size mention contains more than one mention, these are considered independent in the normalization process (e.g., the span “the risk for overweight (or = 0.89) and abdominal obesity (or = 0.91)”, has two mentions “overweight (or=0.86)” and “abdominal obesity (or=0.91)”). This explains the increased number of effect size spans in both the evaluation set and random sample when compared with those of study design and population.

The accuracy at the document level was above 91.0% in both the evaluation and the random set. The values returned in the larger set are though more accurate in their description of the normalization method performance due to the random selection of mentions and their increased number. Hence, the accuracy values of the random sample suggest a normalization that efficiently detects the related attributes of study design, population and effect size.

Show/Hide Table

**Normalization accuracy for study design, population and effect size mentions at the document level**
evaluation method		study design	population	effect size
evaluation set	correct	13	23	113
incorrect	0	1	5
accuracy	100.0%	95.8%	95.7%
random evaluation sample	correct	99	91	232
incorrect	1	9	11
accuracy	90.0%	91.0%	95.4%

References

Aronson AR. MetaMap evaluation. Bethesda MD: National Library of Medicine, 2001. [http://skr.nlm.nih.gov/papers/references/mm.evaluation.pdf].
Denny JC, Miller RA, Spickard III A, Schildcrout J, Darbar D, Rosenbloom ST, Peterson JF. Identifying UMLS concepts from ECG Impressions using KnowledgeMap. InAMIA Annual Symposium Proceedings, vol. 2005, p. 196. American Medical Informatics Association, 2005.
Aronson AR, Lang FΜ. An Overview of MetaMap: Historical Perspective and Recent Advances. J Am Med Inform Assoc. 2010 May-Jun; 17(3) : 229-36.