Vedrana Vidulin

Data Sets Constructed for Economic Analysis


Data

We collected data representing the high-level knowledge sectors from several statistical databases provided by the following:
- UNESCO Institute for Statistics
- WIPO

The data is comprised of:
- 108 numerical attributes: 48 describing inputs (personnel and financial resources) and outputs of R&D sector and 60 describing higher education sector
- 167 examples – countries

Class

The economic welfare is represented with the attribute GNI per capita, calculated according to the World Bank Atlas method.

GNI stands for Gross National Income and represents the total value of goods and services produced within a country.

GNI per capita was collected from The World Bank database in two forms:
- numerical – in US$
- discrete:
- low – $745 or less (2001); $1005 or less (2010)
- middle – $746-9,205 (2001); $1006-12,276 (2010)
- high – $9,206 or more (2001); $12,277 or more (2010)

Description of attributes

Data sets

Data was collected for the years 2001 and 2010. Since countries are not obliged to report all of the data every year, significant proportion of values were missing. In the case of 2001 data, the problem was alleviated with approximation technique – a missing value was substituted with a value for the closest year available from the time interval between 1999 and 2006. In the case of 2010 data, the problem was alleviated by retaining only those examples/countries with less missing values.

The 2010 data was intended for testing. It contains the same set of attributes and modifications as the 2001 data. The 2010 data sets are marked with that year. All other data sets contain data for 2001.

Besides “standard” data sets that contain attributes collected from the statistical databases, there are two types of data sets that contain constructed attributes. The data sets marked as “modified” contain the attributes constructed by a human to test the hypotheses posed during preliminary analysis with the “standard” data (while executing Human-Machine Data Mining interactive algorithm). In contrast, the data sets marked as “constructed” contain automatically constructed attributes, which are obtained by executing sum, min and max functions on pairs of attributes.

name num. nom. ex. class download
Higher education 60 0 167 discrete csv, arff
Higher education-modified 43 6 167 discrete csv, arff, description of modifications
Higher education-modified 40 0 167 numerical (1,2,3) csv, arff
Higher education-constructed 5370 0 167 discrete csv, arff
Higher education-2010 60 0 125 discrete csv, arff
Higher education-modified-2010 43 6 125 discrete csv, arff
Higher education-constructed-2010 5370 0 125 discrete csv, arff
R&D 48 0 167 discrete csv, arff
R&D 48 0 104 discrete csv, arff
R&D 48 0 104 numerical (1,2,3) csv, arff
R&D-modified 62 5 167 discrete csv, arff, description of modifications
R&D-constructed 3432 0 167 discrete csv, arff
R&D-2010 48 0 78 discrete csv, arff
R&D-modified-2010 62 5 78 discrete csv, arff
R&D-constructed-2010 3432 0 78 discrete csv, arff
High-level knowledge 108 0 167 discrete csv, arff
High-level knowledge 108 0 167 numerical (in US$) csv, arff

In the table:
- num. = numerical attributes
- nom. = nominal attributes
- ex. = examples
- numerical (1,2,3) = numerical class obtained by encoding the values of the discrete class: low as 1, middle as 2, and high as 3

Publications

Vidulin, V., Bohanec, M. and Gams, M. (2014) Combining Human Analysis and Machine Data Mining to Obtain Credible Data Relations. Information Sciences, 288: 254-278.

paper

Vidulin, V. (2012) Searching for Credible Relations in Machine Learning, PhD Thesis, Jožef Stefan International Postgraduate School, Ljubljana, Slovenia.

thesis, presentation

Vidulin, V. and Gams, M. (2011) Impact of High-Level Knowledge on Economic Welfare Through Interactive Data Mining. Applied Artificial Intelligence, 25(4): 267-291.

paper

Content