Integration of text and data mining in life sciences
There have been numerous efforts to provide tools for storing, extracting and analysing data in life sciences. Interoperability and integration of such efforts is a challenging issue, not only technically (e.g. different formats, protocols, encodings) but also more importantly semantically. We are involved in a number of community-driven initiatives to provide better integration for life science research.
One initiative is to provide harmonised ways for representing and tagging named entities in the life science literature. We are proposing to establish common document formats that facilitate the exchange of annotation results contained in the literature as a complementary approach to the development of interoperable tools. We work towards (a) recommendations for a common syntax to embody entity mentions in publishers’ document formats (e.g., into PMC), and (b) provision of a common way to reference semantic types. The initial results have been implemented in the IeXML proposal, which has already been used in some community-wide projects (e.g. CALBC). The original IeXML paper is available here.
Involved: D. Rebholz (EBI), G. Nenadic
Another initiative is to use ontologies and text mining to integrate and mark up data (both structured and unstructured) and provide semantics-based faceted browsing to help users navigate, query and retrieve data. The Ontogrator platform has been developed by the NERC Environmental Bioinformatics Centre and the University of Manchester, with a pilot implementation developed in collaboration with the Genomic Standards Consortium (GSC) that includes integrated content from the StrainInfo, GOLD, CAMERA, Silva and Pubmed databases.
Involved: D. Field (NEBC), N. Morrison (Manchester), D. Hancock, L. Hirschman, G. Nenadic, et al.
As part of the BBSRC-funded pubmed2ensembl project, we have developed a customised and extended version of the Ensembl BioMart by adding gene-related publication information, i.e. PubMed-IDs and PubMed Central-IDs including URL link-outs and other information. The pubmed2ensembl BioMart has an enhanced interface that permits to carry out interactive full-text search queries via NCBI’s Entrez Utilities (eUtils), whose search results are applied as an additional filter on the mart datasets. The system also provides DAS link-outs into the Ensembl Genome Browser, where a custom DAS track summarises the publication data that have been accumulated on a per gene basis.
Involved: J. Baran, C. Bergman, G. Nenadic, M. Gerner