Vedrana Vidulin Department of Knowledge Technologies, Jožef Stefan Institute

Large gene function prediction data sets for hierarchical multi-label classification

Data sets

Instances Features Labeled instances
Name Labeled Unlabeled Binary Numeric Labels Cardinality Density Download
PP 15,318 6,308 2,071 0 4,145 21.812 0.005
CGN 15,318 6,308 0 5,891 4,145 21.812 0.005
EKM 15,318 6,308 0 8,447 4,145 21.812 0.005
TEP 15,318 6,308 0 7,957 4,145 21.812 0.005
BPS 15,318 6,308 0 1,170 4,145 21.812 0.005
MPP-H 9,556 6,236 0 1,267 3,886 21.494 0.006
MPP-O 14,331 13,487 0 139 4,087 19.799 0.005
MPP-I 3,536 1,095 0 5,049 3,358 27.582 0.008
MPP-16S 3,536 1,095 0 20,570 3,358 27.582 0.008


For the data sets starting with MPP please cite:
Vidulin, V., Šmuc, T., Džeroski, S., Supek, F. (2018) The evolutionary signal in metagenome phyletic profiles predicts many gene functions. Microbiome, 6(1), 129.

For the rest of the data sets please cite:
Vidulin, V., Šmuc, T., Supek, F. (2016) Extensive complementarity between gene function prediction methods. Bioinformatics, 32(23), 3645-3653.


Labels are Gene Ontology (GO) terms. GO forms a directed acyclic graph, making gene function prediction problem an ideal benchmark for hierarchical multi-label classification algorithms. GO has three domains: Biological process, Molecular function and Cellular component. We connected them with a common root node to obtain a single machine learning problem per feature set.


Instances are COG/NOG gene families. We divided instances into two sets: labeled and unlabeled. The former contains instances associated with at least one GO term from the UniProt-GOA database for November 2013. The latter can be used for semi-supervised learning.


The data sets mostly differ in features. Features in PP (phyletic profiles), CGN (conserved gene neighborhoods), EKM (empirical kernel map), TEP (translation efficiency profiles) and BPS (biophysical and protein sequence properties) data sets are computed from genomic data representing 2,071 bacterial and archaeal organisms (from NCBI genome database). In contrast, features in MPP (metagenome phyletic profiles) data sets are computed from metagenomic (MPP-H, MPP-O, MPP-I) and metataxonomic data (MPP-16S) representing microbial world of various natural and engineered environments.

File format

The data sets are in the ARFF format used by Weka machine learning toolbox. The format is extended to encode a hierarchical structure of labels, which is defined in ARFF header as “@ATTRIBUTE class hierarchical” followed by tuples of parent-child pairs. Labels assigned to an instance are separated by the @ sign. Unlabeled instances are assigned to the root node. An algorithm for hierarchical multi-label classification that can be readily applied to these data sets is CLUS-HMC.