Testing

Let Random Failures Challenge Your Assumptions

Ryszard Balcerzak

11 Jun 2025 — 10 min read

Randomness doesn’t usually go hand in hand with automated tests. It feels counterintuitive, almost the opposite of what we expect from tests: consistency and determinism. But it’s not that simple. Controlled randomness, a deliberate and bounded form of non determinism, can actually help us understand our system better. While it might make tests seem less reliable at first, it often reveals implicit assumptions or unclear specs. That, in turn, helps us improve both the tests and the code behind them.

This article explores how using controlled non determinism in tests can lead to stronger specs and better software.

Tests Are Probes, Not Proofs

Let’s get one thing clear: tests are not proofs. They don’t guarantee correctness. They only verify the behaviour under the specific conditions we’ve defined, not every path the system might take. In fact, compared to formal methods or mathematical reasoning, tests are a poor substitute. While testing strengthens our beliefs and gives us confidence about how the system behaves, it does not provide the certainty that formal proofs can offer.

So why do we rely on them so heavily in enterprise development?

Because they’re cheap, or at least cheaper, than trying to prove our code correct through formal verification. Of course, that’s not always the case. We’ve all seen tests that are brittle, hard to set up, and painful to maintain. Some offer so little value that it’s hard to justify their existence. But in general, the assumption holds: writing tests is far less expensive than attempting formal proofs of correctness.

Tests act as probes. They are small experiments we run against our code to gain confidence that it behaves correctly, at least under the specific conditions we’ve spelled out.

And that’s the tradeoff. We trade completeness for speed. In fast-moving business environments, we don’t have the luxury of formally verifying every feature. Tests give us a practical safety net. Not a wall of truth, but a set of warning signs that help us catch problems early and correct the course before they become expensive.

💡

Code can be mathematically proven. Eugenio Moggi’s discovery that computational effects can be modelled using monads revitalised the use of mathematical models in programming, particularly within denotational semantics, enabling rigorous reasoning about code behaviour. In high-stakes fields like aerospace or avionics parts of software undergo formal verification to guarantee correctness. For example, avionics software often follows standards such as DO-178C, which encourage formal methods to ensure systems behave exactly as specified. While full formal proof of an entire system is rare due to complexity and cost, critical safety or mission-critical components are frequently formally verified to minimise risk.

The Fragility of Incomplete Probes

Tests are only as good as the conditions we write them for. They verify that behaviour is correct given specific inputs, but they don't say anything about inputs we didn’t think to test.

We call this the probing limitation: tests don't generalise. They only confirm. That confirmation is comforting but sometimes misleading. For example, for a simple method like below:

public int Divide(int a, int b)
{
    return a / b;
}

A basic division method that divides two integers and returns the result.

We could write test like this:

[Fact]
public void Divide_ten_by_two_returns_five()
{
    var result = Divide(10, 2);
    Assert.AreEqual(5, result);
}

A unit test verifying that dividing 10 by 2 correctly returns 5.

This test passes, confirming that Divide(10, 2) works. But it doesn't tell you anything about Divide(10, 0), or Divide(0, 10), or Divide(-4, 2). It gives confidence for one specific case, but not general correctness.

What we need, then, is a way to make those probes as trustworthy as possible. That’s where determinism enters the conversation.

Why Determinism Matters

A deterministic test is one that always produces the same result given the same code, the same inputs, and the same environment. This is essential for:

Regression testing: ensuring the system hasn’t changed unintentionally.
Trust: when tests pass, we want to believe that things are still okay.
Refactoring: allowing us to safely restructure without altering behaviour.

Non-deterministic tests. Those that fail randomly undermine all of that. They destroy confidence in the test suite and make debugging a guessing game in practice.

In short: you can’t rely on a safety net that sometimes disappears.

Is All Non-Determinism Bad?

Here’s where things get interesting. Sometimes, controlled randomness can help us discover gaps in our specification. Consider tests that randomly generate input values within a certain range.

[Fact]
public void Divide_two_integers()
{
    var a = RandomInt.AnyInteger();
    var b = RandomInt.NonZeroInteger();
    
    int expected = a / b;
    int actual = Divide(a, b);

    Assert.AreEqual(expected, actual, $"Failed for inputs a={a}, b={b}");
}

A test using random integers that verifies Divide returns the correct result and avoids dividing by zero.

At first glance this might seem risky. Aren’t we reintroducing non-determinism? Not necessarily.

If we constrain the random values to a well-defined group of inputs that should all produce the same output then even if the specific inputs vary, the outcomes should not.

If they do, that’s not a failing of the randomness. It’s a sign that our implementation or our specification is ambiguous or incomplete. For example, by using any integers as input arguments, we will eventually encounter cases like Divide(5, 0) or Divide(0, 0), revealing that such scenarios require special handling in our implementation of the Divide method.

[Fact]
public void Divide_two_integers()
{
    var a = RandomInt.AnyInteger();
    var b = RandomInt.AnyInteger();
    
    int expected = a / b;
    int actual = Divide(a, b);

    Assert.AreEqual(expected, actual, $"Failed for inputs a={a}, b={b}");
}

A test using random integers for both inputs, which may cause a divide-by-zero error.

💡

The above definition aligns with the mathematical concept of an equivalence class, where each element belongs to one and only one class defined by an equivalence relation. In testing, this implies that for a given input, we expect a consistent and predictable output, and all inputs considered equivalent according to the same criteria should produce the same result. This simple principle enables us to design effective specifications by partitioning input data into representative groups, helping to capture the correct behaviour across a variety of scenarios.

This idea transforms tests from mere probes into tools of discovery.

Example: A Usage-Based Upgrade System

Let’s consider a slightly more complex and realistic example where we can apply the technique to discover specification gaps. We’ll model a usage-based upgrade system, such as one used by a SaaS platform to nudge customers toward higher pricing tiers. The initial assumptions are deceptively simple:

Users accumulate measurable usage (API calls, data volume etc.).
Each pricing tier defines a minimum and maximum threshold of usage it can accommodate.
A user is considered for upgrade when their usage approaches the upper limit of their current tier.

The system calculates a normalised "upgrade pressure" score between 0.0 and 1.0, indicating how close the user is to their maximum allowed usage. These rules sound straightforward, and it’s easy to write code that appears to satisfy them. Let’s follow a TDD approach and start by writing a test that captures these requirements.

[Theory]
[InlineData(0, 1000, 500, 0.5)]    // Midpoint
[InlineData(0, 1000, 0, 0.0)]      // Exactly at min
[InlineData(0, 1000, 1000, 1.0)]   // Exactly at max
[InlineData(100, 1000, 0, 0.0)]    // Below min, clamped
[InlineData(0, 1000, 1500, 1.0)]   // Above max, clamped
public void Calculates_upgrade_pressure_correctly_for_specific_cases(
    int minThreshold, int maxThreshold, int usage, decimal expected)
{
    var calculator = new UpgradePressureCalculator(minThreshold, maxThreshold);

    var result = calculator.Calculate(usage);

    Assert.Equal(expected, result);
}

Unit tests verifying the upgrade pressure calculation for boundary values and inputs outside the defined range.

To make this test pass, let’s implement the UpgradePressureCalculator class:

public class UpgradePressureCalculator
{
    private readonly int _minThreshold;
    private readonly int _maxThreshold;

    public UpgradePressureCalculator(int minThreshold, int maxThreshold)
    {
        _minThreshold = minThreshold;
        _maxThreshold = maxThreshold;
    }

    public decimal Calculate(int usage)
    {
        var range = _maxThreshold - _minThreshold;
        var delta = usage - _minThreshold;

        return Math.Clamp((decimal)delta / range, 0.0m, 1.0m);
    }
}

Calculates a normalised upgrade pressure value between 0 and 1 based on usage relative to minimum and maximum thresholds.

This naive implementation satisfies the test we wrote in the first step. However, it’s based on some implicit assumptions about the behaviour we’re testing. To surface those assumptions more explicitly, let’s introduce an element of randomness and generate a wider range of input values that are acceptable, at least from the method signature’s point of view:

[Theory]
[MemberData(nameof(CalculateUsageTestData))]
public void Calculates_upgrade_pressure_correctly_for_randomized_inputs(
    int minThreshold, int maxThreshold, int usage, decimal expected)
{
    var calculator = new UpgradePressureCalculator(minThreshold, maxThreshold);

    var result = calculator.Calculate(usage);

    Assert.True(
        result == expected,
        $"Failed for minThreshold={minThreshold}, maxThreshold={maxThreshold}, usage={usage}. Expected: {expected}, Actual: {result}");
}

public static IEnumerable<object[]> CalculateUsageTestData()
{
    int min = AnyInt();
    int max = AnyInt();

    int range = max - min;

    yield return new object[] { min, max, min + range / 2, 0.5m };   // Midpoint
    yield return new object[] { min, max, min, 0.0m };               // Exactly at min
    yield return new object[] { min, max, max, 1.0m };               // Exactly at max
    yield return new object[] { min, max, min - 1, 0.0m };           // Below min (clamped)
    yield return new object[] { min, max, max + 1, 1.0m };           // Above max (clamped)
}

Tests the upgrade pressure calculation with various randomised input ranges, ensuring correct handling of values at, below, and above the thresholds.

Since tests verify behaviour under specific, discrete input conditions rather than serving as a general proof of correctness, problems might not show up immediately. But once we introduce randomised inputs, we may eventually hit a combination that causes the test to fail. That non-determinism might be hard to swallow at first but it often reveals a hidden assumption or a gap in the specification. And it's always better to uncover that in a development environment than in production. Let's analyse a couple of such scenarios:

Error Message: Failed for minThreshold=-889706730, maxThreshold=1264934496, usage=1264934497. Expected: 1.0, Actual: 0.9999999995327814700682499279

Error Message: Failed for minThreshold=789508071, maxThreshold=-1008130390, usage=789508070. Expected: 0.0, Actual: 0.000000000556285383126323753

Example failure messages showing precision issues and unexpected results when input thresholds have extreme or reversed values.

This failed test run raised several important questions:

Should we allow negative values in threshold and usage definitions? Most likely not. However, our current implementation does allow them.
Should we allow the minimum threshold to be greater than or equal to the maximum? Probably not. While allowing them to be equal might be debatable and could warrant clarification from the business, allowing max < min is clearly a bug.
Why does the normalised value fall just short of 1.0m or 0.0m when it logically shouldn't? This is a red flag. When the usage exceeds the maximum threshold (or is below the minimum), we would expect a clean clamp to 1.0m (or 0.0m). The fact that we're seeing values like 0.9999999995m indicates subtle issues in the math.

The initial tests based on discrete, "typical" values are still passing though.

Example: Refactoring

Clearly, we haven’t captured all the invariants related to usage and threshold ranges. To address this, we’ll encapsulate these rules in dedicated, domain-specific types, eliminating primitive obsession. Each type will be supported by its own set of focused tests.

During our investigation, we also discovered the root cause of those subtle miscalculations: we were dividing two integers and then promoting the result to a decimal. With large values involved, the resulting decimal was slightly off from the expected 1.0m or 0.0m. While decimal avoids many typical floating-point issues, it can still introduce rounding inaccuracies in certain scenarios.

To fix this, we'll revise the implementation of the Calculate method to avoid depending on rounding behaviour in edge cases, and we’ll move the logic closer to the domain types it operates on. Let's take a look at the UsageAmount and ThresholdRange classes, which represent the volume of data accumulated by the user and the corresponding threshold ranges defining each pricing tier in our SaaS offering.

public class UsageAmount
{
    public int Value { get; }

    public UsageAmount(int value)
    {
        if (value < 0)
            throw new ArgumentOutOfRangeException(nameof(value), "Usage amount cannot be negative.");

        Value = value;
    }

    public static implicit operator int(UsageAmount usage) => usage.Value;
}

public class ThresholdRange
{
    public UsageAmount Min { get; }
    public UsageAmount Max { get; }

    public ThresholdRange(UsageAmount min, UsageAmount max)
    {
        if (min.Value >= max.Value)
            throw new ArgumentException($"Invalid range: min ({min.Value}) must be less than max ({max.Value}).");

        Min = min;
        Max = max;
    }

    public decimal CalculateUpgradePressure(UsageAmount usage)
    {
        if (usage.Value <= Min.Value)
            return 0.0m;

        if (usage.Value >= Max.Value)
            return 1.0m;

        var range = Max.Value - Min.Value;
        var delta = usage.Value - Min.Value;

        return (decimal)delta / range;
    }
}

Defines strong types for usage and threshold ranges with validation, ensuring safe and precise calculation of upgrade pressure.

Let's look at the tests. What’s important here is that, even though we still use randomised input parameters, we’ve organised them into well-defined groups that should trigger the same logic in the system (based on input equivalence classes). This ensures that test outcomes are stable and predictable. If any test still fails unexpectedly, it's likely another gap in the specification or a bug in the code. In such cases, we trigger our discovery process again, this time focused specifically on the narrowed down problem.

public class ThresholdRangeTests
{
    [Fact]
    public void Ctor_allows_valid_min_and_max()
    {
        var min = new UsageAmount(AnyIntBiggerOrEqualThan(0));
        var max = new UsageAmount(AnyIntBiggerThan(min.Value));

        var range = new ThresholdRange(min, max);

        Assert.Equal(min.Value, range.Min.Value);
        Assert.Equal(max.Value, range.Max.Value);
    }

    [Fact]
    public void Ctor_throws_when_max_is_less_than_min_or_equal()
    {
        var min = new UsageAmount(AnyIntBiggerOrEqualThan(0));
        var max = new UsageAmount(AnyInt(0, min.Value + 1));

        Assert.Throws<ArgumentException>(() => new ThresholdRange(min, max));
    }

    [Theory]
    [MemberData(nameof(CalculateUsageTestData))]
    public void Calculates_upgrade_pressure_correctly_for_randomized_inputs(
        int minThreshold, int maxThreshold, int usage, decimal expected)
    {
        var range = new ThresholdRange(
            new UsageAmount(minThreshold),
            new UsageAmount(maxThreshold));

        var result = range.CalculateUpgradePressure(new UsageAmount(usage));

        Assert.True(
            result == expected,
            $"Failed for minThreshold={minThreshold}, maxThreshold={maxThreshold}, usage={usage}. Expected: {expected}, Actual: {result}");
    }

    public static IEnumerable<object[]> CalculateUsageTestData()
    {
        int min = AnyIntBiggerThan(0);
        int max = AnyIntBiggerThan(min);
        
        int evenRange = AnyInt(2, (int.MaxValue  - min) / 2) * 2; // guarantees even difference without causing overflow

        yield return new object[] { min, min + evenRange, min + evenRange / 2, 0.5m };   // Midpoint
        yield return new object[] { min, max, min, 0.0m };                               // Exactly at min
        yield return new object[] { min, max, max, 1.0m };                               // Exactly at max
        yield return new object[] { min, max, min - 1, 0.0m };                           // Below min (clamped)
        yield return new object[] { min, max, max + 1, 1.0m };                           // Above max (clamped)
    }

    [Fact]
    public void Calculate_upgrade_pressure_returns_expected_value_when_usage_is_within_range()
    {
        int minThreshold = AnyIntBiggerOrEqualThan(0);
        int maxThreshold = AnyIntBiggerThan(minThreshold);
        int usage = AnyInt(minThreshold, maxThreshold);

        var range = new ThresholdRange(
            new UsageAmount(minThreshold),
            new UsageAmount(maxThreshold));

        var result = range.CalculateUpgradePressure(new UsageAmount(usage));

        Assert.InRange(result, 0.0m, 1.0m);
    }
}

Unit tests validating the construction and behaviour of ThresholdRange, including edge cases and randomised inputs for CalculateUpgradePressure.

Final Thoughts: From Tests to Types

I must say this approach is especially valuable in complex, unfamiliar domains where misunderstandings and gaps are common. It shifts how we think about tests, not just as confirmations of specific scenarios but as deliberate attempts to generalise and capture invariants, thereby improving the specifications and, consequently, the code they define. This mindset requires discipline and rigour, but it helps prevent bugs and leads to clearer, more maintainable code.

The point of this example isn’t just to highlight a test strategy. It’s to show how this technique can become a valuable asset in your toolbox and how tests can drive deeper design improvements:

They surface ambiguous or missing requirements.
They help make implicit assumptions explicit.
They guide us toward better domain modelling.
They prevent bugs, not just catch them.

In the end, randomness in inputs shouldn’t lead to randomness in outcomes. If it does, we’ve either found a bug or a blind spot in our understanding. Use that insight. Let your tests evolve from simple checks into powerful discovery tools.

And when they do? Don’t stop at fixing the test. Refine the spec. Strengthen the types. Let your code speak your intent more clearly. Because the real goal isn’t just passing tests, it’s understanding what we’re building and building it right.

The code examples discussed in this article are available on GitHub. You can find the complete implementation here.

Let Random Failures Challenge Your Assumptions

Ryszard Balcerzak

Tests Are Probes, Not Proofs

The Fragility of Incomplete Probes

Why Determinism Matters

Is All Non-Determinism Bad?

Example: A Usage-Based Upgrade System

Example: Refactoring

Final Thoughts: From Tests to Types

Read more

Beyond Compelling: Building Meaningful Platforms

In the Pursuit of Modularity

Thin Line Between Generic and Core Subdomains

How to Simulate Double Dispatch in C#