Structurally a dataframe
is a 2D data structure with columns of potentially different types. You can think of it like a spreadsheet or a SQL table.
Functionally it recalls the corresponding pandas object but:
- allows just a subset of the operations of the Python Pandas Dataframe.
ultra::dataframe
covers the basic use case scenarios and isn't intended as a replacement for other tools which allows extensive data pre-processing; - by default automatically splits an example in features (input) and label (output). It supports storing unlabeled examples (e.g. for unsupervised learning task or for storing examples to be classified);
- is more row oriented (whereas Pandas Dataframe is quite column oriented).
Basic functionality
Import data (CSV)
std::istringstream dataset(R"(
A, B, C, D
a0, 0.0, 0, d0
a1, 0.1, 1, d1
a2, 0.2, 2, d2)");
dataframe d;
d.read_csv(dataset);
Here we've:
d.columns[0].name() == "A"
d.columns[1].name() == "B"
d.columns[2].name() == "C"
d.columns[3].name() == "D"
By default the first column (column 0
) is the output column. User can specify a different column:
std::istringstream dataset(R"(
A, B, C, D
a0, 0.0, 0, d0
a1, 0.1, 1, d1
a2, 0.2, 2, d2)");
d.read_csv(dataset, dataframe::params().output(2));
The output column is shifted to the first position, so:
d.columns[0].name() == "C"
d.columns[1].name() == "A"
d.columns[2].name() == "B"
d.columns[3].name() == "D"
The parser sniffs the presence of column headers. In case of error (CSV is a textbook example of how not to design a textual file format), user can signal the correct situation via the params::header()
/ params::no_header
member functions.
To access label (output value) / features (input values):
std::cout << "Label of the first example is: " << lexical_cast<double>(d.front().output)
<< "\nFeatures are:"
<< "\nA: " << lexical_cast<std::string>(d.front().input[0])
<< "\nB: " << lexical_cast<double>( d.front().input[1])
<< "\nD: " << lexical_cast<std::string>(d.front().input[2]) << '\n';
For unlabeled examples use the no_output
modifier:
std::istringstream dataset(R"(
A, B, C, D
a0, 0.0, 0, d0
a1, 0.1, 1, d1
a2, 0.2, 2, d2)");
d.read_csv(dataset, dataframe::params().no_output());
In this case:
d.columns[0].name() == ""
d.columns[1].name() == "A"
d.columns[2].name() == "B"
d.columns[3].name() == "C"
d.columns[4].name() == "D"
a surrogate empty output column is added at the beginning and has_value(d.front().output) == false
.
Columns
To access information about the column structure, use the columns
member function:
std::cout << "Name of the first column: " << d.columns[0].name()
<< "\nCategory of the first column: " << d.columns[0].domain();
std::cout << "\nThere are " << d.columns.size() << " columns\n";