Sophic, Biomax Hone Text-Mining, Curation Tools for Final Phase of Cancer Gene Index
Original story in BioInform
Building a comprehensive cancer gene index
Sophic and Biomax have teamed with the National Cancer Institute (NCI) to develop the "Cancer Gene Index," a single source to help cancer researchers accelerate the search for cancer cures. The NCI Cancer Gene Data Curation Pilot is an attempt to create a database of associations between genes and diseases and genes and drug compounds derived from the biomedical literature. The project involves a mixture of automatic text mining, semi-automatic verification, and manual validation and scoring of results. A pilot study involving 1000 genes was completed in March 2005 by the NCICB, ScenPro, Biomax Solutions Inc. (now Sophic Systems Alliance Inc.) and Biomax Informatics AG. The next phase of this study began on 20 September 2005 under the direction of the prime integrator, Sophic Systems Alliance Inc. (Falmouth, MA) teaming with Biomax Informatics AG (Martinsried Germany) and involves the curation of 1500 additional genes (early 2006).
In the project, Biomax scientists are using the BioLT system with highly refined ontologies and dictionaries to text mine MEDLINE abstracts beginning in 1975 (when the first oncogene was discovered) through 2005. Associations between the genes and cancer diseases are automatically identified and the results are manually validated by 20 annotators to identify all cancer-related genes. In the first phase of the project, detailed manual annotation for gene-disease and gene-compound relationships has been carried out for 1000 cancer genes using controlled vocabularies. These genes are open source and are available on the NCI website.
For each of these genes, annotators read all the cancer-term-associated and drug-term-associated statements (sentences and/or abstracts) and added evidence and roles to each relevant statement. Evidence codes were assigned to qualify the assertion made in the sentence with respect to the association of a cancer or a drug term to a gene name. Role codes describe, in general, the semantic association of the gene and the corresponding cancer or drug term. The role codes were annotated in a way that the following combination links concepts and produces a meaningful sentence: gene name, role code, and cancer or drug term. (read more, PDF 150 KB)
Using the BioXM Knowledge Management Environment, the "Cancer Gene Index" knowledge base can be related to experimental data (e.g., expression or metabolite analysis) and clinical data. To fully leverage the high-quality content, the NCI Cancer Gene Index has been combined with the manually annotated Biomax Human Genome Database within the BioXM system.
The following Biomax software is being used to prepare and perform the analysis:
- The BioLT Literature Mining Tool has been used to automatically identify gene and cancer disease relations (read more)
- Project data have been integrated with clinical data and experimental data using the BioXM Knowledge Management Environment (read more)
ReferenceDocumentation at the NCI:
Cancer Gene Index End User Documentation Creation of the Cancer Gene Index
Towards a comprehensive catalog of gene-disease and gene-drug relationships in cancer.
Schueller CME, Fritz A, Torres Schumann E, Wenger K, Albermann K, Komatsoulis GA, Covitz PA, Wright LW and Hartel F
Poster presented at the Thirteenth ISMB 2005 International Conference on Intelligent Systems for Molecular Biology, Detroit, MI, USA (PDF 860 KB)