Advanced data-mining methods for knowledge extraction in biological databases
|Title||Advanced data-mining methods for knowledge extraction in biological databases|
|Funding Source||General Secretariat of Research & Technology|
“HRAKLEITOS” is a national scholarship provided by the Greek Ministry of Education for the completion of the PhD dissertation of Sotiris Diplaris. Below is explained the context and scope of this project
During the last years the biological databases have been increased considerably, constituting henceforth a practical tool for biologists. There exist a lot of reasons to search for information in biological databases, e.g.
- While decoding a DNA sequence we should know if it has already been entirely or partially decoded, or if it contains homologous sequences (sequences that emanate from the same ancestor).
- Some databases contain code and comments referring the same sequences. The knowledge of the sequence code or the homologous sequences can facilitate the research.
- Discovery of similar non-coded DNA sequences.
- Search for homologous proteins.
The development of advanced data mining techniques in such databases will constitute a powerful tool for biologists.
The aim of the “HRAKLEITOS” project is to provide biologists with useful knowledge, by mining DNA and protein databases, as well as biological ontologies.
Bioinformatics can exploit data mining techniques in order to detect interesting relations inside the huge mass of biological data. E.g., data mining methods can summarize a sum of genes that correspond to a certain behaviour observed inside an organism.
For the development of such techniques various methods are used, such as graphic models (Bayesian networks, HMMs) and relational algorithms (inductive methods of reasoning for discovering such clusters of genes and modelling a network for their expression). In this context, rule-based data mining techniques are exploited, as well as clustering and classification methods, and neural networks.
Data mining techniques are applied both in DNA and protein bases. Classification algorithms are being developed, in order to answer various biological problems, e.g. in protein family classification or the discovery of proteins responsible for carcinogenesis. Data mining techniques are also applied in biological ontologies. Using methods of text or semantic tree mining, based on rules or statistics, it is possible to determine the cross-correlation between two genes or proteins, thus facilitating the feature selection procedure in performing data mining techniques.