Literature Mining

Literature mining – optional

The literature mining module is a customizable tool for intuitive and structured text mining. It combines biological and medical term dictionaries with powerful free-text querying capabilities. The tool provides comprehensive and structured answers to complex questions. Search results can be used for iterative refinement and extension of queries.

The literature mining module was used as the core data-mining component in a project to create an Oncology Knowledge Base in cooperation with a US health organization. In this project, the BioLT literature mining software was used to create an up-to-date index of all human cancer-related genes, including compound and disease relationships.

The literature mining results become part of larger semantic networks within the BioXM Knowledge Management Environment. Text-mining information, such as relationship between genes, diseases and compounds can be graphically displayed, curated and linked. For example, Biomax provides the Oncology Knowledge Base data in the BioXM system.

Benefits

  • Overcomes cumbersome and laborious search strategies
  • Supports keyword-driven and hypothesis-drive searches
  • Provides fast generation of domain-specific overviews
  • Extracts comprehensive and structured information using extendable biomedical dictionaries
  • Supports project-driven biomedical text analyses
  • Provides literature data for inclusion in knowledge management systems

Features

  • Complete and precise data mining from free text
  • Powerful query language (using Boolean operators, dictionaires, wildcards and topics)
  • Large and extendable biomedical dictionary foundation (covers diseases, genes, compounds, taxonomy, etc.)
  • Automatic vocabulary generation for any domain
  • Acronym detection tool for great precision
  • Several ranking methods (chi-square, cosine, etc.)

Conceptual and technical background

The literature mining module for the BioXM Knowledge Management Environment integrates the BioLT engine developed by Biomax. The BioLT engine is used for large-scale data-mining projects and provides a robust, scalable infrastructure for mining text corpora integrated into the BioXM system in a number of different formats such as .pdf, .doc, .ppt, .xls and .txt.

Within the BioXM system, the literature mining module extends the core system with additional search criteria for the advanced query builder and new view-items for the report builder. A number of administrative script extensions allow convenient processing and updating of text-mining analyses. Text corpora and dictionaries can be defined and maintained directly through the main BioXM client application.
Because the literature mining module extends the capabilities of the BioXM query builder, text-mining functionality can be combined with any standard search feature. For example, one can search for “any gene involved in signal transduction co-occurring with any disease within the scope of a sentence within a set of documents mentioning a compound class of interest”. The co-occurring criterion is provided through the literature mining module, while, e.g., the criterion involved in signal transduction uses the inference criterion provided by the BioXM system out-of-the-box.

For a given free-text search term, co-occurring dictionary objects are searched. Matching sentences (or complete documents) are assigned to the dictionary objects in your document set. The dictionary object is identified by one of its synonyms. Matching synonyms are attached to the resulting dictionary term as well. Co-occurrence is tested within a scope: sentence scope or abstract/document scope.

References
  • Knowledge Networks of Biological and Medical Data: An Exhaustive and Flexible Solution to Model Life Science Domains.
    Losko S, Wenger K, Kalus W, Ramge A, Wiehler J and Heumann K (2006) Lecture Notes in Computer Science 4075: 232-239 Abstract