Named-entity recognition and term mining

Named-entity recognition and term mining

Recognizing terms and named entities in research articles and mapping them to unique identifiers is an important first step in most text mining software. This is a challenging task because of ambiguity and variation in how entities and concepts are named and used in particular in the biological literature.
Our main focus in this theme is the identification of gene and protein names. This project aims to conduct a critical evaluation of open source gene name identification systems in full-text articles and associated supplementary data files, to integrate gene name identification systems into general bioinformatics workflows, and to develop real-time methods for gene name identification that will assist authors in the publication process. This project will also establish best practices, provide middleware for gene name identification, and integrate gene name identification into the Semantic Web infrastructure for the life sciences. We have been collaborating with J. Hakenberg to improve GNAT.
We are also interested in identification of other semantic classes, such as species names (see LINNAEUS and its demo page), anatomical locations (see GETM), clinical concepts (see i2b2), etc.