Performance vs correctness

When performance and correctness collide, the usual fear is that one must be sacrificed for the other. This article tells the story of a refactoring in ULTRA where tightening correctness guarantees around interpreter lifetimes did not slow the system down... it made it faster.

This was not obvious in advance. It only became clear after carefully rethinking invariants, ownership, cache semantics and then validating everything with a concrete benchmark.

The original design problem

In ULTRA (a legacy of VITA), a gp::interpreter executes a gp::individual over many examples. Internally, the interpreter owns:

  • a pointer to the program (gp::individual);
  • a cache matrix;
  • an instruction pointer.

The cache is expensive to initialise but cheap to reuse, so interpreters are typically constructed once and reused across many evaluations.

To support this, early versions of reg_oracle_storage embedded an interpreter directly:

gp::individual ind_;
mutable src::interpreter int_;

This worked... until it didn't.

The hidden invariant

classDiagram class `gp::individual` class `src::interpreter` class reg_oracle_storage { ind_: gp::individual int_: src::interpreter run() } reg_oracle_storage --> `gp::individual` : owns reg_oracle_storage --> `src::interpreter` : owns `src::interpreter` --> `gp::individual` : references

The interpreter contains a pointer to the individual it evaluates. That pointer must always refer to the ind_ stored inside the same reg_oracle_storage object.

Unfortunately, C++ does not enforce that invariant automatically.

After a copy, move or container reallocation, the interpreter could still point to the old individual. The object looked valid, but calling run() could explode spectacularly.

This was not a theoretical concern: std::vector reallocation alone was enough to break the invariant.

For example consider:

struct reg_oracle_storage
{
  gp::individual ind_;
  src::interpreter int_;

  explicit reg_oracle_storage(const gp::individual& ind)
    : ind_(ind), int_(ind_)
  {}

  value_t run(const std::vector<value_t>& input) const
  {
    return int_.run(input);
  }
};

At construction time, everything looks correct:

  • ind_ stores the individual;
  • int_ stores a pointer to ind_;
  • run() simply delegates to the interpreter.

The implicit invariant is:

int_ must always refer to ind_ stored in the same object.

Unfortunately, this invariant is not preserved automatically by the language. Placing these objects inside a std::vector:

std::vector<reg_oracle_storage> v;
v.emplace_back(ind1);
v.emplace_back(ind2);
v.emplace_back(ind3);  // may trigger reallocation

During reallocation:

  • reg_oracle_storage objects are moved (or copied);
  • ind_ is relocated to a new address;
  • int_ is bitwise-moved and still points to the old ind_

After reallocation, the object looks valid but the invariant is broken:

&v[i].int_.program() != &v[i].ind_;  // dangling reference

Calling run() now results in undefined behaviour. The bug is subtle because all members are private, the type is trivially copyable/movable, the failure depends on container growth patterns, the crash may occur far from the actual mistake.

This is a classic example of a hidden lifetime invariant: correctness depends on an unstated relationship between two members.

sequenceDiagram participant V as std::vector participant R as reg_oracle_storage participant I as src::interpreter participant G as gp::individual V->>R: move / copy R->>G: ind_ relocated R->>I: int_ bitwise-moved I-->>G: still points to old address

First reactions (and why they were unsatisfying)

Several obvious fixes were considered:

  • delete copy and move constructors;
  • store everything behind std::unique_ptr;
  • reconstruct the interpreter on every call.

All of these "worked" but at a cost:

  • reduced composability;
  • more heap allocations;
  • unnecessary interpreter rebuilds in hot paths.

What we really wanted was simpler, treat the interpreter as a cache, not as part of the logical state.

That idea turned out to be the key.

A better invariant: the interpreter is ephemeral

The breakthrough was to stop trying to keep the interpreter always valid.

Instead, we adopted a much stronger and simpler rule: the interpreter is only required to be valid inside run().

Every other member function must behave correctly even if no interpreter exists at all.

This immediately suggested a new design. The interpreter is stored as an optional, lazily constructed on first use, and rebound on every execution:

template<>
class reg_oracle_storage<gp::individual, true>
{
public:
  explicit reg_oracle_storage(const gp::individual &ind) : ind_(ind)
  {
    Ensures(is_valid());
  }

  reg_oracle_storage(std::istream &in, const symbol_set &ss)
  {
    if (!ind_.load(in, ss))
      throw exception::data_format("Cannot load individual");

    int_.reset();
    Ensures(is_valid());
  }

  template<class ...Args>
  [[nodiscard]] value_t run(Args&&... args) const
  {
    if (int_) [[likely]]
      int_->rebind(ind_);
    else
      int_.emplace(ind_);

    return int_->run(std::forward<Args>(args)...);
  }

  [[nodiscard]] bool is_valid() const
  {
    return ind_.is_valid();
  }

  bool save(std::ostream &out) const
  {
    return ind_.save(out);
  }

private:
  gp::individual ind_;
  mutable std::optional<src::interpreter> int_{};
};

Key properties of this design:

  • the interpreter is not part of the observable state;
  • copying or moving the object is always safe;
  • the invariant is restored every time it matters;
  • cache reuse is preserved;
  • future maintainers cannot "forget" to rebind.

Correctness is now enforced by structure, not by convention.

What about performance?

This is where things got interesting.

To ensure this refactoring did not introduce a regression, a dedicated benchmark was written to compare four strategies:

  1. Embedded interpreter (baseline)
  2. Interpreter constructed on the fly
  3. Single interpreter with explicit rebind
  4. Optional interpreter with lazy construction and rebind

Each strategy evaluated:

  • 20.000 examples;
  • 1.000 GP individuals;
  • identical programs and inputs.

The results (Xeon E-2146G, clang++ -O2):

              Embedded     On the fly       Rebind      Rebind optional
( 100,  100)       1ms            3ms          1ms                  1ms
( 100, 1000)      18ms           40ms         18ms                 18ms
( 100,10000)     372ms          412ms        339ms                374ms
( 100,30000)    1241ms         1501ms       1216ms               1248ms
( 100,50000)    2140ms         2625ms       2133ms               2154ms
( 500,  100)       8ms           18ms          8ms                  8ms
( 500, 1000)      87ms          199ms         88ms                 87ms
( 500,10000)    1756ms         1995ms       1665ms               1762ms
( 500,30000)    6115ms         7187ms       6067ms               6103ms
( 500,50000)   10611ms        12943ms      10592ms              10531ms
(1000,  100)      16ms           38ms         17ms                 17ms
(1000, 1000)     172ms          410ms        174ms                172ms
(1000,10000)    3391ms         4077ms       3249ms               3389ms
(1000,30000)   12116ms        14490ms      12080ms              12130ms
(1000,50000)   21263ms        25998ms      21324ms              21168ms
(5000,  100)      78ms          192ms         80ms                 78ms
(5000, 1000)     852ms         2026ms        858ms                853ms
(5000,10000)   16801ms        20575ms      16694ms              16723ms
(5000,30000)   60418ms        71138ms      60482ms              60190ms
(5000,50000)  105140ms       128479ms     105374ms             104777ms
xychart-beta title "Interpreter execution strategies (1000 examples × 10000 individuals)" x-axis ["Embedded", "On-the-fly", "Rebind", "Rebind optional"] y-axis "Elapsed time (ms)" 0 --> 4500 bar [3391, 4077, 3249, 3389]

This is the representative mid-scale configuration (1000 examples, 10000 individuals). It's large enough to amortise overheads, small enough to stay readable and it matches the pattern seen everywhere else.

xychart-beta title "Interpreter execution strategies (5000 examples × 50000 individuals)" x-axis ["Embedded", "On-the-fly", "Rebind", "Rebind optional"] y-axis "Elapsed time (ms)" 0 --> 140000 bar [105140, 128479, 105374, 104777]

So this really scales, the gap widens with scale!

Embedded, rebind and rebind optional are statistically indistinguishable; on-the-fly is consistently and increasingly worse. Why?

  • interpreter construction is not free;
  • cache reuse dominates execution cost;
  • the branch on std::optional is perfectly predictable;
  • rebinding is cheaper than reconstruction.

In other words tightening the correctness model has zero structural cost.

This is a rare and satisfying outcome. Across all tested scales, from hundreds to billions of evaluations, lazy interpreter rebinding matches the performance of an embedded interpreter while eliminating a fragile hidden invariant. The cost of restoring correctness is below measurement noise, while the cost of reconstructing interpreters is substantial and scales with population size.

Note on thread safety

Although run() is marked const, it mutates a mutable std::optional<src::interpreter> used as an execution cache.

As a result, reg_oracle_storage is not thread-safe for concurrent calls to run() on the same instance.

In a genetic programming context (including ULTRA), this is usually acceptable: individuals are evaluated in parallel by giving each worker its own oracle or by partitioning the population.

Sharing a single reg_oracle_storage across threads would require additional synchronisation (e.g. a std::mutex or thread-local interpreters), which would reintroduce overhead in the hot path.