Gerner, Lars Martin Anders
The University of Manchester; 2012
The amount of biomedical literature available is increasing at an exponential rate and is becoming increasingly difficult to navigate. Text-mining methods can po-tentially mitigate this problem, through the systematic and large-scale extraction of structured information from inherently unstructured biomedical text. This the-sis reports the development of four text-mining systems that, by building on each other, has enabled the extraction of information about a large number of pub-lished statements in the biomedical literature. The first system, LINNAEUS, en-ables highly accurate detection (“recognition”) and identification (“normaliza-tion”) of species names in biomedical articles. Building on LINNAEUS, we im-plemented a range of improvements in the GNAT system, enabling high-throughput gene/protein detection and identification. Using gene/protein identifi-cation from GNAT, we developed the Gene Expression Text Miner (GETM), which extracts information about gene expression statements. Finally, building on GETM as a pilot project, we constructed the BioContext integrated event ex-traction system, which was used to extract information about over 11 million dis-tinct biomolecular processes in 10.9 million abstracts and 230,000 full-text arti-cles. The ability to detect negated statements in the BioContext system enables the preliminary analysis of potential contradictions in the biomedical literature. All tools (LINNAEUS, GNAT, GETM, and BioContext) are available under open-source software licenses, and LINNAEUS and GNAT are available as online web-services. All extracted data (36 million BioContext statements, 720,000 GETM statements, 72,000 contradictions, 37 million mentions of spe-cies names, 80 million mentions of gene names, and 57 million mentions of ana-tomical location names) is available for bulk download. In addition, the data ex-tracted by GETM and BioContext is also available to biologists through easy-to-use search interfaces.