Our research projects are focused on developing techniques for large-scale extraction and management of un- and semi-structured textual resources. We focus on:
- Health-related information synthesis
- Synthesis of information from unstructured electronic health-care records, patient narratives and literature to support clinical decision support
- Clinical documentation, text mining and terminology management
- Integration and analytics of health-care linked-data
- Sentiment mining of health-related social media
- Temporal text mining
- Large-scale extraction and contextualization of biomolecular events
- Identification of conflicting statements in scientific texts
- Extraction of host-pathogen interactions
- Mining of scientific methodologies from literature
- Capturing and analysing the use of in-silico experimental methods in computational biology and bioinformatics
The current TEAM projects include:
- Healthcare text mining projects (>>)
- Integration of dynamic documentation knowledge services into Siemens framework (>>)
- Linked2Safety - semantically linking electronic health records and clinical trials systems (>>)
- Epidemiological text mining (>>)
- HeRC - harnessing electronic health data to improve care for patients and communities (>>)
- Clinical temporal expression mining (>>)
- Mining molecular interaction data and its context (>>)
- Logical modelling of molecular interactions in the development of thyroid cancer (>>)
- Mining scientific methodologies (in computational biology) (>>)
- Contrasts and contradictions in scientific texts (>>)
- Data quality in question and answer reports (>>)
- Text analytics and sentiment analysis in healthcare Web 2.0 (>>)
Healthcare text mining projects: mining clinical narratives and patient-generated data
We currently run a number of projects to extract various structured data from unstructured clinical narratives and electronic healthcare records (EHRs). In previous projects we have developed a combination of rule-based and machine-learning methods to identify diseses that a patient has or does not have ("disease status"), including identification of various co-morbidities, problems, tests and treatments. We also work on the extraction of medication-related information (such as medication name, dosage, reason for taking, frequency, duration etc.) and clinical temporal text mining. These tasks were assessed as part of an international text mining challenge in the clinical/health-care domain: for more detail, see here (>>), where we have showed continuous success.
Another strand in healthcare text mining is the extraction of subjective information from patient-generated data, such as tweets, blogs or patient's narratives. We have also done some work on analysis of suicide notes (as part of the i2b2 challenges >>).
In collaboration with The Christie Hospital and the University of Salford, we are running "A study using techniques from clinical text mining to compare the narrative experiences of patients with medulloblastoma with factors identified from their hospital records". This project aims to capture the narrative experiences of patients and compare them to the themes that are identified by text mining of the Christie Hospital health records. The findings are intended to provide an evidence-base for clinical service development. This work is funded by The Christie Charity Fund (£25K), and is part of Azad's PhD.
As a continuation of The Christie Hospital's project, we are part of a project led by the University of Salford (Prof Tony Long) on "Systematic analysis of healthcare records and the narrative experiences of children with tumours of the central nervous system and their carers – informing the evolution of selfesteem and healthrelated outcomes for future targeted interventions". This project is funded by the Kidscan Charity (£62K).
The NIHR-funded project on "Enhanced occupational therapy interventions for children and adolescents with central nervous system tumours" aims to use advanced text-mining techniques to support patient involvement and decision making in clinical practice. It is a £250K collaboration with Royal Manchester Children Hospital and University of Salford, due to start in 2013.
Our team currently inolves A. Dehghan, G. Karystianis, J. Keane, S. Stivaros, E. Estlin, A. Kovacevic (collaborator), M. Filannino, G. Nenadic. More information on healthcare text mining tools is available on our HECTA pages (>>).
Integration of Dynamic Documentation Knowledge Services into Siemens Framework
This project is carried out in collaboration with the Clinical Knowledge Management Research Team at Manchester. The team, originated by Professor Alan Rector, has a thirty-year history of leading innovations in clinical user interfaces, development of health terminology/ontologies, and medial records. The Clinical Knowledge Management Research Team and Siemens Medical Solutions have an established collaboration to develop the next generation of context-sensitive clinical systems and associated knowledge management technologies. The core of the methodology is built around the Web Ontology Language (OWL) and related Semantic technologies, which are used to manage context, merge content, and implement associations and rules.
An initial version of the Knowledge Management Environment has been designed and developed in the previous phase of the project, and the current project is focused on integration with Siemens' Hospital Information System, Sorian Clinicals. The project will also explore the role of data analytics, including text/association mining, statistics or probability modelling for knowledge acquisition.
In addition to Prof Rector, the project involves Drs Bijan Parsia (PI) and Goran Nenadic (CoI), with a team of six clinical knowledge engineers and software developers.
Linked2Safety - a next-generation, secure linked-data medical information space for semantically-interconnecting electronic health records and clinical trials systems
The main aim of the Linked2Safety project is to explore the Semantic Web and Linked Data to facilitate semantic interlinking of electronic health records (EHRs) and clinical trials systems for gathering and sharing knowledge to support decision making in medical and clinical research. The vision is to facilitate early detection of patients' safety issues, the identification of adverse events and the identification of a suitable critical mass of patients to participate in small (Phases II and III) or larger scale (Phase IV) clinical trials.
Our role is focused on the design of an interoperable EHR data space and development of bio-marker data mining techniques for adverse events early detection. We will also provide several clinical trials showcases and organise the Clinical research and patients safety Special Interest Group. More details are available here (>>).
Epidemiological text mining
In this project we aim to extract the main characteristic of epidemiological studies from associated medical literature (abstracts only). We focus on design of the study, population, exposure(s), outcome(s), odd ratios and covariates. The domain focus of this research is obesity, as it involves a wide variety of factors that have been studied. The project is part of George's PhD (>>) with Prof I. Buchan (Medicine, HeRC) and G. Nenadic (SCS).
Health eResearch Centre (HeRC) - harnessing electronic health data to improve care for patients and communities
Our research on clinical text mining, processing patient generated data and building interoperable clinical data processing infrastructures is part of a new multimillion-pound centre of excellence in that has been awarded to a consortium led by The University of Manchester. The consortium brings together partners from academia, the NHS, local authorities and industry in a five to 10-year programme. The Medical Research Council (MRC), along with nine other government and charity funders, are investing £4.5 million in HeRC over the next five years, and the total activity with investments from industry and academia will be around £18 million.
One of the main objectives is to enable different research teams to collaborate across different organisations to produce more powerful and timely analyses of anonymised healthcare records. The aim is to combine clinical, social and research data to identify more effective treatments, improve drug safety, assess risks to public health and study the causes of diseases and disability. The Centre will make use of patient data sets available through the Clinical Practice Research Datalink, a £60 million service recently announced by the Medicines and Healthcare Products Regulatory Agency and the National Institute for Health Research. The centre is led by Prof Iain Buchan, and Dr Goran Nenadic is one of CI, co-leading the development of CHIP-SET (Community Health Intelligence Partnership – Semantic Epidemiology Toolkit) along with Prof Carole Goble and John Ainsworth, in particular in the area of text mining and overall bio-health informatics input.
Clinical Temporal Expression Mining
The aim of this project is to extract mentions of temporal expressions in clinical narratives (and patient-generated data) using a combination of rule-based and machine-learning methods. We also aim to normalise those mentions through mapping them to their value (using the ISO-8601 representation (e.g. "2012-10-31T09:00") and type (e.g. Date, Duration, Frequency). This work is part of Michele's (>>) and Azad's (>>) PhDs.
Mining molecular interaction data and its context
The project involves extraction of the context of molecular interaction data from the scientific literature. So far, little attempt has been made to capture the context of molecular interaction, how reliable it is, what is the nature of interaction etc. The project aims to study the way findings, experiments and knowledge about molecular interactions is presented in the literature, and in particular how contextual information that details molecular interactions are encoded and presented. The project implements a text mining framework to extract (from full-text articles) contextual information and link it with data in other resources to support informed decisions for understanding the complexity of interactions. The project is collaboration with Pfizer and thus the focus is placed on pharmaceutically relevant data sets including various pathogens such as HIV, hepatitis viruses, malaria etc.
In previous work (with M. Gerner, S. Farzaneh, C. Bergman) we have developed BioContext, a system for extracting and integrating information about molecular processes in biomedical articles. Using the data extracted by BioContext, it is possible to get an overview of a range of biomolecular processes relating to a particular gene or anatomical location. The current project is part of Dan's PhD (>>), with Prof D. Robertson (Bioinformatics) and Drs B. Sidders (Pfizer) and G. Nenadic as supervisors.
Logical modelling of molecular interactions in the development of thyroid cancer
The main goal of the project is to generate new hypotheses for the understanding and treatment of thyroid cancer with the help of Text Mining and Logical Modelling. Text Mining is used to extract information related to the molecular interactions for thyroid cancer from the biomedical literature (based on BioContext). The second step is to use the extracted information to construct a logical model for thyroid cancer and use it to make predictions about medically useful pathways or drug targets; the hypotheses formed in the second phase will be validated experimentally in the third year. This work is part of Chengkun's PhD, with J-M. Schwartz, G. Brabant and G. Nenadic as supervisors.
Mining scientific methodologies (in computational biology)
The methodologies we use both enable and help define our research. The choice of methodology has a significant impact on the value and quality of the subsequent results. However, as experimental complexity has increased, the choice of appropriate methodologies has become an increasingly difficult task. We have therefore been working on a system for extracting and analysing methodologies in computational biology from full-text journal literature. These are used to analyse the methodological habits of a whole research community (including methods, data sets, tools, etc) and to try and identify those methodologies with specific characteristics (e.g. most popular, most novel, most appropriate choice or the most cited methodology). This is Geraint's PhD topic (>>>), with R. Stevens, D. Robertson and G. Nenadic as supervisors.
Contrasts and contradictions in scientific texts
Detecting and analysing contrasts and contradictions in scientific texts is essential for suggesting further research potentials and discoveries. Finding contrasts and contradictions in text by means of automatic methods is a relatively new area in text mining. Specifically, most biological text mining research has so far focused on mining affirmative statements about the relations amongst entities, although it is of growing interest to find reports on weak or negative relations, or lack there of. Negation detection is a middle step to finding contrasts and contradictions, and has been of special interest in medical text mining, because of the abundance of negative patterns in medical descriptions. The aim of this research is to develop text mining methods to detect and analyse contrasting facts in the biomedical literature and specifically in molecular interactions. This work was part of Fa's thesis ( >>).
Investigating Data Quality Aspects of Question and Answer Reports
As the quantity of available data increases, the level of “quality” varies significantly, and this becomes a critically important factor for the effectiveness of organisations and individuals. Most of the business and scientific data is represented in unstructured and semi-structured formats. However, most current data quality methodologies work solely on structured data from conceptual perspective. Furthermore, question and answer reports are gaining momentum as a way to collect responses that can be used for data brokers, for instance, in business (customer satisfaction reports and FAQ). However these reports suffer from many data quality issues that affect their performance and efficiency in use. Therefore, we have been working on developing a data quality methodology with an associated data quality assistant tool that can improve the data quality of these reports by linguistically analysing them in order to track and identify the data quality problems found in such reports before they are deployed into a data store system or used for data analysis. This is Mona's PhD work, with G. Nenadic and B. Theodoulidis as supervisors.
Text Analytics and Sentiment Analysis in Healthcare Web 2.0
Sentiment analysis is a field in computational linguistics involving identification, extraction and classification of opinions and emotions expressed in natural language. The capture and analysis of such attitudes and opinions in an automated and structured fashion might offer a powerful technology to a number of problem domains, including business intelligence, marketing, national security, and crime prevention and biomedical. This is part of Rosyzie's and Azad's PhD projects.
Named-entity recognition and term mining
Recognizing terms and named entities in research articles and mapping them to unique identifiers is
an important first step in most text mining software. This is a challenging task because of ambiguity and variation in how entities and concepts are
named and used in particular in the biological literature.
Our main focus in this theme is the identification of gene and protein names. This project aims to conduct a critical evaluation of open source gene name identification systems in full-text articles and associated supplementary data files, to integrate gene name identification systems into general bioinformatics workflows, and to develop real-time methods for gene name identification that will assist authors in the publication process. This project will also establish best practices, provide middleware for gene name identification, and integrate gene name identification into the Semantic Web infrastructure for the life sciences. We have been collaborating with J. Hakenberg to improve GNAT.
We are also interested in identification of other semantic classes, such as species names (see LINNAEUS and its demo page), anatomical locations (see GETM), clinical concepts (see i2b2), etc.
This is a long-term project that aims at developing text
mining methods that can provide efficient and sophisticated knowledge
acquisition, offer plausible hypotheses for testing, prevent unnecessary
repetition of previous work, and help in experimental design for
specific research scenarios. We investigate various text mining
approaches to establishing literature-based associations and links among
various biological entities such as proteins, genes, species, cells,
and experiments. The work was partially funded by BBSRC ("Mining term associations to support knowledge discovery in biology") to explore
suitable technologies for modelling user-elicited biological text mining
scenarios to support hypothesis generation, and builds on a previous BBSRC project ("Protein Functional
Classification using Text Data-mining") that has developed automatic text-based classification of proteins to functional categories (based on the Gene ontology) using machine learning techniques and various textual features.
We are specifically interested in extraction of various molecular events, including gene expressions (see GETM), positive and negative regulations, bindings, etc. (see BioNLP as well as "Mining molecular events and their context" below).
There are a number of services and resources available to the bioinformatics community, but meta-data that describe them is typically scarce. This project aims to develop text mining techniques to automatically describe, locate, retrieve and reason about bioinformatics services and resources. We investigate methods that extract descriptions from various document types (articles, reviews, application notes, email archives, discussion forums, etc), and map them to service descriptions using both general service ontologies and domain-specific ontologies. As a working and target environment, the project uses the myGRID/Taverna infrastructure.
There have been numerous efforts to provide tools for storing, extracting and analysing data in life sciences. Interoperability and integration of such efforts is a challenging issue, not only technically (e.g. different formats, protocols, encodings) but also more importantly semantically. We are involved in a number of community-driven initiatives to provide better integration for life science research.
One initiative is to provide harmonised ways for representing and tagging named entities in the life science literature. We are proposing to establish common document formats that facilitate the exchange of annotation results contained in the literature as a complementary approach to the development of interoperable tools. We work towards (a) recommendations for a common syntax to embody entity mentions in publishers' document formats (e.g., into PMC), and (b) provision of a common way to reference semantic types. The initial results have been implemented in the IeXML proposal, which has already been used in some community-wide projects (e.g. CALBC). The original IeXML paper is available here.
Involved: D. Rebholz (EBI), G. Nenadic
Another initiative is to use ontologies and text mining to integrate and mark up data (both structured and unstructured) and provide semantics-based faceted browsing to help users navigate, query and retrieve data. The Ontogrator platform has been developed by the NERC Environmental Bioinformatics Centre and the University of Manchester, with a pilot implementation developed in collaboration with the Genomic Standards Consortium (GSC) that includes integrated content from the StrainInfo, GOLD, CAMERA, Silva and Pubmed databases.
Involved: D. Field (NEBC), N. Morrison (Manchester), D. Hancock, L. Hirschman, G. Nenadic, et al.
As part of the BBSRC-funded pubmed2ensembl project, we have developed a customised and extended version of the Ensembl BioMart by adding gene-related publication information, i.e. PubMed-IDs and PubMed Central-IDs including URL link-outs and other information. The pubmed2ensembl BioMart has an enhanced interface that permits to carry out interactive full-text search queries via NCBI's Entrez Utilities (eUtils), whose search results are applied as an additional filter on the mart datasets. The system also provides DAS link-outs into the Ensembl Genome Browser, where a custom DAS track summarises the publication data that have been accumulated on a per gene basis.
Sentiment analysis is the extraction of attitudes and opinions from human-authored documents. The capture and analysis of such attitudes and opinions in an automated and structured fashion might offer a powerful technology to a number of problem domains, including business intelligence, marketing, national security, and crime prevention. This project aims to develop technologies for extraction and analysis of sentiment from free text using a combination of natural language processing (NLP), text mining and machine learning techniques. The work will evolve building models of sentiment from which suitable templates for extraction will be designed. The current main focus is in the health-care domain.
The ultimate aim of Internet search engines is to index the entire Web by utilising the links found within known pages. Topic-focused crawlers specialise this task by indexing only the subset of the Web which is relevant to some topic or information need. Typically, the user can specify the topic and a set of start (seed) pages, and the topic-focused crawler will traverse the Web via hyperlinks as with general crawlers. However, Topic-Focused Crawlers will make relevance decisions at each step as to whether this page should be included in the resultant index and also prioritise out-going links, based on various features, according to their expected gains. This project aims to use publicly available hierarchical Web Directories to facilitate the prioritsation of out-going links according, not only to their relevance to the desired topic, but also relevance to related ones. This will hopefully mean that when no 'positive' links are available, the crawl should be able to stay as relevant as possible.
Document clustering is a generic problem with wide spread applications within Natural Language engineering. Present research focuses on using text summarization techniques as a pre-processing step for document clustering in the context of automated assessment of student essays. One of the major problems in natural language processing is that a document can contain a very large number of words. If each of these words is represented as a vector coordinate, the number of dimensions would be too high for the document clustering algorithm. Hence, it is crucial to apply pre-processing methods (such as summarisation) that reduce the number of dimensions (words) to be given to the document clustering algorithm, but to keep both the information and quality of what has been presented in original documents.