Yet another knapsack problem
Given the files in a directory, by what method can I receive the list of the files that in size is closest to, but not larger than, an arbitrary
x
value, given in bytes? Each file can be listed at most once.
(communicated by Peter Sisak, on the Usenet forum comp.softโsys.math.mathematic)
This is readily recognised as a traditional knapsack problem. One might attempt to solve it using standard integer linear programming methods, such as a branch-and-bound strategy. Unfortunately, the actual file sizes make this quite challenging.
Instead, we will use Evolutionary Algorithms to find a reasonable approximation.
The given data are as follows:
const std::vector<std::intmax_t> file_sizes =
{
1305892864, 1385113088, 856397968, 1106152425, 1647145093,
1309917696, 1096825032, 1179242496, 1347631104, 696451130,
746787826, 1080588288, 1165223499, 1181095818, 749898444, 1147613713,
1280205208, 1242816512, 1189588992, 1232630196, 1291995024,
911702020, 1678225920, 1252273456, 934001123, 863237392, 1358666176,
1714134790, 1131848814, 1399329280, 1006665732, 1198348288,
1090000441, 716904448, 677744640, 1067359748, 1646347388, 1266026326,
1401106432, 1310275584, 1093615634, 1371899904, 736188416,
1421438976, 1385125391, 1324463502, 1489042122, 1178813212,
1239236096, 1258202316, 1364644352, 557194146, 555102962, 1383525888,
710164700, 997808128, 1447622656, 1202085740, 694063104, 1753882504,
1408100352
};
const std::intmax_t target_size(8547993600);
Our first approach will be via Genetic Algorithms. Here,
ga::problem prob(file_sizes.size(), {0, 2});
sets up a problem where individuals are vectors of binary digits ({0, 2}
is a half open interval, 2
is excluded).
If the i
-th component of the vector is 1
, the corresponding file is included in the knapsack (hence the length of the vector is file_sizes.size()
).
The choice of the size of population, as well as many other parameters, is left to the framework.
Last but not least the fitness function:
double fitness(const ultra::ga::individual &x)
{
std::intmax_t sum(0);
for (std::size_t i(0); i < x.parameters(); ++i)
if (x[i] && sum + file_sizes[i] <= target_size)
sum += file_sizes[i];
return sum;;
}
This function evaluates an individual (ga::individual
) by calculating the sum of file sizes, ensuring the total remains less than or equal to the target size. Files that are marked as included but would exceed the allowed size are ignored.
Now, to start the evolution:
ga::search search(prob, fitness);
auto result(search.run(5).best_individual);
As with any stochastic algorithm, we could encounter a poor initial random setup. Thus, we instruct the framework to perform five different runs.
The full source for this example can be found in the examples/knapsack/knapsack.cc file.
Differential evolution and integer variables
In its canonical form, the Differential Evolution (DE) algorithm is designed to handle continuous variables only. However, extending it for the optimization of integer variables is relatively easy and requires only minor modifications.
First, for the evaluation of fitness, integer values should be used. Despite this, DE can still internally operate with continuous floating-point values:
double fitness(const ultra::de::individual &x)
{
std::intmax_t sum(0);
for (std::size_t i(0); i < x.parameters(); ++i)
if (x[i] > 0.0 && sum + file_sizes[i] <= target_size)
sum += file_sizes[i];
return sum;
}
The comparison x[i] > 0.0
is used to map a real value (x[i]
) to an integer value in the range \([0;1]\). This mapping is performed only for the purpose of evaluating the fitness value. The mapped values are not assigned elsewhere. Thus, DE works with a population of continuous variables, regardless of whether the corresponding object variable is actually integer-based. This is essential for maintaining population diversity and the algorithm's robustness.
Additionally, we modify the problem
and search
classes:
de::problem prob(file_sizes.size(), {-1.0, 1.0});
de::search search(prob, fitness);
Note the different initialization range for variables: {-1.0, 1.0}
. It is important to emphasize that while the reproduction operation in DE can extend the search beyond the initialized range of the search space, the initialization range plays a critical role in shaping the early stages of the search process, influencing exploration, diversity, and convergence. The search extends beyond the initialized range gradually as the algorithm evolves. Poor initialization can lead to exploration in suboptimal areas or cause slower convergence.
The full source for this example can be found in the examples/knapsack/knapsack02.cc file.
Brute force approach
For those who believe a brute force approach might suffice, consider the following Prolog program:
% Generate valid subsequences whose sum does not exceed Target
valid_subseq([], [], 0, _).
valid_subseq([H|T], [H|Subseq], Sum, Target) :-
valid_subseq(T, Subseq, RestSum, Target),
Sum is RestSum + H,
Sum =< Target.
valid_subseq([_|T], Subseq, Sum, Target) :-
valid_subseq(T, Subseq, Sum, Target).
% Find the best solution that maximises the sum but is =< Target
knapsack(Solution, SolutionSum, Files, Target) :-
aggregate_all(max(SS, SL), valid_subseq(Files, SL, SS, Target), max(SolutionSum, Solution)).
% Example query
?- knapsack(L, S, [1305892864, 1385113088, 856397968, 1106152425, 1647145093,
1309917696, 1096825032, 1179242496, 1347631104, 696451130,
746787826, 1080588288, 1165223499, 1181095818, 749898444, 1147613713,
1280205208, 1242816512, 1189588992, 1232630196, 1291995024,
911702020, 1678225920, 1252273456, 934001123, 863237392, 1358666176,
1714134790, 1131848814, 1399329280, 1006665732, 1198348288,
1090000441, 716904448, 677744640, 1067359748, 1646347388, 1266026326,
1401106432, 1310275584, 1093615634, 1371899904, 736188416,
1421438976, 1385125391, 1324463502, 1489042122, 1178813212,
1239236096, 1258202316, 1364644352, 557194146, 555102962, 1383525888,
710164700, 997808128, 1447622656, 1202085740, 694063104, 1753882504,
1408100352], 8547993600).
... and then take a long vacation ๐.