In gene function prediction task, functions are described with Gene Ontology (GO) terms. GO forms a directed acyclic graph, making gene function prediction problem an ideal benchmark for hierarchical multi-label classification algorithms.
On the link below you can find nine large hierarchical multi-label classification data sets published in:
Vidulin, V., Šmuc, T., Džeroski, S., Supek, F. (2018) The evolutionary signal in metagenome phyletic profiles predicts many gene functions. Microbiome, 6(1), 129.
Vidulin, V., Šmuc, T., Supek, F. (2016) Extensive complementarity between gene function prediction methods. Bioinformatics, 32(23), 3645-3653.
We are surrounded by a hidden world of microorganisms that influence human health, have a role in food and beverage preparation, act as decomposers and have many other important functions. Modern sequencing techniques provide an insight into their genes. By knowing biological function of those genes, we can further improve our understanding of microorganisms’ role in the environment. However, the function of many genes is still unknown. My primary research interest is oriented towards machine learning methods for automatic annotation of genes in microorganisms with functions.
Main research directions:
Web searching is typically performed by typing keywords in a search engine, which returns web pages of a topic defined by those keywords. However, a user can obtain more precise results if a web page genre is specified besides the keywords. Web genre represents a form and a function of a web page, enabling a user to find a “Scientific” paper about the topic of text mining.
My research is focused on web genre classification using machine learning methods. Considering that a web page is a complex document that can share conventions of several genres or contain parts from different genres, the web genre classification problem belongs to a group of multi-label classification problems. For example, a story for children belongs to both “Children” and “Prose fiction” genres. Furthermore, web genres naturally form a hierarchy. For example, “Prose fiction” is a type of “Fiction”. These web genre classification properties can be easily mapped to the machine learning task of hierarchical multi-label classification. However, the data sets that capture web genre as a hierarchical multi-label concept are missing. Therefore, we proposed an approach to automatically construct a hierarchy of web genre labels from the data and to apply hierarchical multi-label classification algorithm to construct accurate web genre classifier.
In analysis of complex domains, data mining methods that construct interpretable models frequently construct relations that are statistically significant, but meaningless to a human. We propose a novel method, named Human-Machine Data Mining (HMDM) that combines human understanding and computer data mining methods to extract credible relations, which are at the same time meaningful to the human and statistically supported with data. The method defines a procedure and a toolbox that human uses in interactive and iterative manner to direct computer search towards those parts of the search space with credible relations. Based on credible relations, the human can construct correct conclusions about the domain. HMDM was successfully applied on the problems from macroeconomic, demographic and web genre classification domains.
Searching for credible relations through interactive data mining – Information Sciences, 2014