Testing that your scopes aren't lying

A GPU compute shader can render a 1920x1080 vectorscope in under a millisecond. The equivalent CPU implementation takes roughly 80 to 120 milliseconds, depending on the machine and the complexity of the trace. That is not a subtle difference. It is the difference between a scope that updates at video rate and one that updates like a progress bar.

So you write the GPU path. You port the math to WGSL or GLSL, you dispatch a compute shader that bins every pixel into the correct chrominance coordinate, you accumulate intensity into a texture, and you ship it. The scope renders beautifully. It looks right.

But “looks right” is doing a lot of work in that sentence.

Why the CPU path exists

Oscillio renders scopes on the GPU because it has to. Real-time video monitoring at 24fps or higher demands it. But every scope type also has a CPU implementation that produces the same output, pixel for pixel. This is not a fallback for machines without a GPU. It is the reference.

The CPU path is written in straightforward Rust. No SIMD intrinsics, no threading tricks, no clever bit manipulation. It processes pixels in scanline order, applies the colour math with f64 precision, and writes into a plain Vec<u32> framebuffer. It is slow. It is also unambiguous. When the CPU path says a particular input pixel maps to a particular position on the vectorscope at a particular intensity, that answer is definitionally correct, because the CPU path is the spec.

This might seem like a lot of code to maintain for something users never see. It is. But the alternative is worse: a GPU-only pipeline where correctness is defined by “does it look right to a human squinting at the output.” That is not a definition. That is a hope.

What goes wrong on the GPU

GPU compute shaders introduce three categories of precision problems that do not exist in scalar CPU code.

Floating point differences. GPUs typically operate in f32. Many GPU architectures implement fused multiply-add (FMA) operations that produce slightly different rounding than separate multiply and add on the CPU. For most graphics work this is irrelevant. For a scope, it means a pixel that should land at chrominance coordinate (0.312, 0.329) might land at (0.312, 0.330). One least-significant-bit of error in the bin index. Multiply that across two million pixels and you get a vectorscope that is subtly, systematically wrong. Not visibly wrong on any single frame. Wrong in a way that would mislead a colourist making precise skin tone judgments.

Atomic operation ordering. Vectorscope rendering is fundamentally a histogram problem. Multiple pixels map to the same bin, and their intensities need to be accumulated. On the CPU, this is a simple bins[index] += intensity. On the GPU, thousands of shader invocations run simultaneously, so accumulation requires atomic operations. Atomic adds on integer values are well-defined and consistent. But the order in which those atomics execute is not deterministic. If you are accumulating floating point values with atomicAdd, the result depends on execution order because floating point addition is not associative. The sum 0.1 + 0.2 + 0.3 does not necessarily equal 0.3 + 0.1 + 0.2 in IEEE 754.

Oscillio sidesteps this by quantising intensity to integer values before atomic accumulation. But that quantisation introduces its own error budget that needs to be tracked and tested.

Driver-specific behaviour. The same WGSL shader running on the same logical GPU can produce different output depending on the driver version. We have seen a case where an Intel Arc driver change altered the rounding behaviour of f32-to-u32 conversion inside a compute shader, shifting scope traces by one pixel on certain input values. The shader was correct. The driver was correct. The combination was different from what it was three months earlier.

Building the test harness

Testing image-producing code is not like testing a function that returns a number. You cannot write assert_eq!(gpu_output, expected) because “expected” is a 1024x1024 texture with millions of possible correct values, and exact equality is the wrong metric anyway.

Oscillio’s scope test harness works in three layers.

Layer 1: Golden master comparison

For each scope type (waveform, vectorscope, histogram, CIE diagram), a set of reference input frames are rendered through the CPU path. The output textures are stored as 16-bit PNG files in the test fixtures directory. These are the golden masters.

The test runs the same inputs through the GPU path and compares the output against the golden master. But “compares” is not pixel-exact equality. It is a per-pixel tolerance check: for each pixel, the absolute difference in each channel must be below a threshold. For Oscillio, that threshold is 2 in 8-bit space (roughly 0.8% intensity error) and 1 in the bin-index domain.

When a golden master comparison fails, the harness writes three files: the expected image, the actual image, and a difference map that amplifies the error by 10x so it is visible to a human reviewer. This matters because scope precision bugs often produce differences that are invisible at 1:1 zoom but meaningful at the signal level.

Layer 2: Statistical validation

Golden master tests catch regressions, but they do not catch systematic bias. If the GPU path consistently shifts the vectorscope trace 0.5 pixels to the right, it might pass the per-pixel tolerance on every individual pixel while still being wrong in aggregate.

The statistical layer computes summary metrics across the entire output: mean error, max error, error standard deviation, and the 99th percentile error. These metrics are compared against budgets defined per scope type. The vectorscope budget is tighter than the waveform budget because chrominance precision matters more than luminance precision for the use cases where people rely on vectorscopes.

The harness also checks for bias. If the mean signed error (not absolute error) is consistently positive or negative, that indicates a systematic offset, not random noise. The bias threshold is set at 0.1 in 8-bit space. Anything above that triggers a failure even if every individual pixel is within tolerance.

Layer 3: Synthetic edge cases

Real video frames are poor test inputs because they exercise the common case. Most pixels in most frames produce similar chrominance values, cluster around similar luminance levels, and do not stress the boundaries of the scope’s coordinate space.

The synthetic test suite generates inputs designed to exercise the edges: single-pixel inputs at extreme chrominance coordinates, flat fields at specific code values, gradients that sweep through the entire gamut, patterns that force every pixel into the same histogram bin (maximum atomic contention), and frames with values at the exact boundary between adjacent bins.

The single-bin-contention test is particularly valuable. It creates an input where every pixel maps to the same vectorscope coordinate. The CPU path produces a clean result because accumulation is sequential. The GPU path dispatches thousands of atomic adds to the same memory location simultaneously. If the atomic accumulation has any precision loss, it shows up here as a discrepancy between the GPU sum and the CPU sum. This test caught a bug in our original implementation where we were using atomicAdd on f32 values instead of quantised integers. The error was tiny per pixel but compounded to a visible intensity difference at high contention.

Running it

The test suite runs on CI against three GPU backends: Vulkan (Linux, discrete GPU), Metal (macOS, Apple Silicon), and DX12 (Windows, NVIDIA). Each backend gets its own tolerance budgets because driver precision varies. The Metal path is slightly more precise than the Vulkan path on the same logical operations, which we attribute to Apple’s tighter shader compiler behaviour, though we have not confirmed that.

Tests run on every commit that touches the scope rendering code. A full run takes about 40 seconds, most of which is GPU setup and teardown. The actual rendering and comparison is fast. The CI machines have GPUs specifically because headless CPU-only testing would not catch the bugs that matter.

When a test fails, the review process is manual. A developer looks at the difference map, decides whether the error is a real bug or a legitimate precision change (like a driver update), and either fixes the shader or updates the golden master. Updating a golden master requires a comment in the commit explaining why the old reference was wrong or why the new precision characteristic is acceptable.

This is deliberately not automated. The decision to accept a new precision baseline is a judgment call about signal integrity, and we do not want that decision made by a threshold.

The cost of two paths

Maintaining both CPU and GPU implementations of every scope type is real work. When the rendering algorithm changes, both paths need updating. When a new scope feature ships, it ships twice. The CPU path is typically 3x more code than the GPU path because scalar code cannot lean on implicit parallelism.

The alternative is trusting the GPU output because it looks right. We tried that early on. It produced scopes that were wrong by 1 to 2 pixels on certain input values, in ways that no human reviewer caught during development, but that a colourist would eventually notice when making critical adjustments near the edges of legal gamut. A scope that is wrong at the margins is worse than no scope at all, because it teaches you to distrust the tool.

The CPU reference path costs us roughly two weeks of additional engineering per scope type. The confidence it buys is worth more than that.