Cancer Gene Index

Building a comprehensive resource for cancer research

Biomax has teamed with Sophic and the National Cancer Institute (NCI) to develop the “Cancer Gene Index”, a single resource to help cancer researchers accelerate the search for cancer cures. The NCI Cancer Gene Data Curation Pilot created a database of associations between genes and diseases and genes and drug compounds derived from the biomedical literature. The project involves a mixture of automatic text mining, semi-automatic verification, and manual validation and scoring of results. A pilot study involving 1000 genes was completed in March 2005 by the NCICB, ScenPro, Biomax Solutions Inc. (now Sophic Systems Alliance Inc.) and Biomax. The next phase of the study began in September 2005 and involved the curation of 1500 additional genes (early 2006).

In the project, Biomax scientists used a text-mining system with highly refined ontologies and dictionaries to mine the text of MEDLINE abstracts beginning in 1975 (when the first oncogene was discovered) through 2005. Associations between the genes and cancer diseases were automatically identified and the results were manually validated by 20 annotators to identify all cancer-related genes. In the first phase of the project, detailed manual annotation for gene–disease and gene–compound relationships was carried out for 1000 cancer genes using controlled vocabularies. These genes are open source and available on the NCI website.

For each of these genes, annotators read all the cancer-term-associated and drug-term-associated statements (sentences and abstracts) and added evidence and roles to each relevant statement. Evidence codes were assigned to qualify the assertion made in the sentence with respect to the association of a cancer or a drug term to a gene name. Role codes describe the semantic association of the gene and the corresponding cancer or drug term, in general. The role codes were annotated in a way that the following combination links the concepts and produces a meaningful sentence: gene name, role code, and cancer or drug term. Read more here.

Using the BioXM™ Knowledge Management Environment, the “Cancer Gene Index” knowledge base can be integrated with experimental data (e.g., expression or metabolite analysis) and clinical data. To fully leverage the high-quality content, the NCI Cancer Gene Index has been combined with the Biomax human genome database within the BioXM system. The text mining module has been used to automatically identify gene and cancer disease relations. Project data have been integrated with clinical data and experimental data using the BioXM Knowledge Management Environment.

Towards a comprehensive catalog of gene-disease and gene-drug relationships in cancer.

