Integer-accurate colour math

You’ve measured a white bar. Your waveform reads 940. The question nobody asks: is that 940, or is it 939.9997 that got rounded up at the last moment before display?

For creative grading, the difference is invisible. For measurement, it’s the difference between a signal that passes QC and one that doesn’t. Between a scope you can trust and one that’s guessing.

The answer depends on whether your colour math is running in floating point or fixed-point integer arithmetic. The two give different results more often than you’d expect.

The BT.709 matrix

Every HD video signal encodes colour as YCbCr. The conversion from RGB is defined in ITU-R BT.709, and the matrix coefficients are specific:

Y  =  0.2126 R + 0.7152 G + 0.0722 B
Cb = -0.1146 R - 0.3854 G + 0.5000 B
Cr =  0.5000 R - 0.4542 G - 0.0458 B

These are the numbers. They come from the CIE luminous efficiency function weighted for the 709 primaries. They’re not negotiable and they’re not approximate. The standard defines them to four decimal places.

For 10-bit video, the output values are scaled into a defined range. Luma (Y) occupies code values 64 to 940. Chroma (Cb and Cr) occupies 64 to 960, with 512 representing zero. These are also exact. A full-white signal is Y = 940, not 939 or 941.

Where floating point breaks

Take a simple case: convert a 10-bit RGB triplet of (940, 940, 940) to YCbCr. This is peak white. The luma result should be exactly 940 and both chroma channels should be exactly 512.

In floating point, the computation looks straightforward. Multiply each component by its coefficient, sum the results, scale to the output range, and round. But floating point doesn’t represent these coefficients exactly. The number 0.7152 becomes 0.71520000000000001 in double precision (the nearest representable IEEE 754 value). That difference is on the order of 10^-17, which seems negligible. Multiply it by a 10-bit value, add it to two other similarly imprecise products, scale the result by 876 (the luma range), and the accumulated error starts to matter.

The result isn’t 940.0. It’s 939.9999999999998 or 940.0000000000002, depending on the platform, the compiler’s floating-point optimisation flags, and whether the CPU is using 80-bit extended precision internally before rounding to 64-bit. On some paths you get 940. On others you get 939 after truncation, or 941 after an unlucky rounding cascade.

This isn’t hypothetical. Different implementations of the same formula produce different integer results for the same input because the floating-point intermediate values aren’t identical. Two scopes built by different developers, both implementing BT.709 correctly, can disagree by one code value. For a creative tool, that’s irrelevant. For a measurement tool, it means one of them is wrong.

The problem compounds at the edges

The worst cases aren’t peak white. They’re the values near the legal limits where a single code value determines whether a signal passes or fails. A luma value of 940 is legal. A value of 941 is not. If your conversion produces 940.4999 and you round down, you get 940: legal. If the same input on a different code path produces 940.5001 and you round up, you get 941: illegal.

The same problem appears at the bottom of the range. Code value 64 is legal black. Code value 63 is below legal range. A conversion that produces 64.0001 and one that produces 63.9999 will give you different answers for the same input.

Chroma has the same issue. The neutral point is 512 for both Cb and Cr. A pure grey input should produce exactly 512 on both chroma channels. In floating point, you often get 511.9999 or 512.0001. After rounding, that’s either 512 or 511 or 513. The error is one code value. In a creative context, nobody would ever see it. In a scope, it means your chroma trace has a DC offset on grey that shouldn’t exist.

Float vs integer YCbCr

BT.709 conversion — same input, different rounding

Input RGB (8-bit)

R128

G128

B128

Y  = 0.2126 × R + 0.7152 × G + 0.0722 × B
Cb = (B − Y) / 1.8556 + 128
Cr = (R − Y) / 1.5748 + 128

Channel	Float (exact)	Float (rounded)	Integer path
Y	128.0000	128	128
Cb	128.0000	128	128
Cr	128.0000	128	128

How integer math fixes it

The fix is to do what the hardware did: work entirely in integers. The BT.709 coefficients can be represented as ratios of integers with a fixed denominator. The standard approach uses a power-of-two denominator so the final division can be a bit shift.

For example, using a 16-bit scale factor of 2^16 (65536):

0.2126 * 65536 = 13933.4336  -> round to 13933
0.7152 * 65536 = 46871.3472  -> round to 46871
0.0722 * 65536 =  4732.2192  -> round to 4732

Now the luma calculation becomes:

Y_scaled = 13933 * R + 46871 * G + 4732 * B
Y = (Y_scaled + 32768) >> 16

That + 32768 is the rounding bias. Adding half the denominator before the right shift gives you correct rounding (round-half-up) instead of truncation. Every intermediate value is an integer. Every operation is exact. There’s no accumulated floating-point error because there are no floating-point operations.

The key detail is choosing the integer coefficients so they sum to exactly the denominator. 13933 + 46871 + 4732 = 65536. This guarantees that equal RGB inputs produce an output that’s exactly equal to the input (after scaling), because the sum of the products equals the input value times the denominator. Peak white in, peak white out. No drift.

When the coefficients don’t sum exactly, you adjust the largest one. Since the green coefficient has the most significant bits, nudging it by one count has the smallest relative effect. Better to be off by 1/65536 on the green coefficient than to have the entire system drift by one code value on common inputs.

The rounding bias matters

The choice of rounding strategy isn’t cosmetic. Truncation (just shifting without adding the bias) introduces a systematic negative error. Every result is biased toward the lower code value. On average, you’re off by half a code value, and the error is always in the same direction. A scope that truncates will read consistently low.

Round-half-up (adding 0.5, or the integer equivalent of half the denominator) eliminates the systematic bias. The error on any individual sample is at most half a code value, and it’s equally likely to round up or down. There’s no DC offset in the error.

For a waveform monitor, this distinction is the difference between a trace that sits cleanly on the graticule line and one that’s consistently one pixel low. On a Tektronix hardware scope, the conversion was done in dedicated silicon that used integer math with correct rounding. The trace sat on the line. If your software scope truncates instead of rounding, your trace won’t match the hardware, and the hardware was right.

The inverse matters too

Conversion is bidirectional. YCbCr comes in from the video signal, gets converted to RGB for display and analysis, then possibly back to YCbCr for output. Each conversion is an opportunity for error. If the forward conversion uses integer math and the inverse uses floating point (or vice versa), the round-trip isn’t lossless even when mathematically it should be.

This is why Oscillio runs the entire signal path in integer arithmetic. The YCbCr-to-RGB conversion, the measurement calculations, the histogram binning, the trace rendering coordinates. Every operation that touches a code value uses integer math with explicit rounding control. A value that enters the pipeline as 940 is still 940 when it reaches the display, not 939 and not 941.

The alternative is to hope that the floating-point path through your particular compiler on your particular CPU produces the right answer for every possible input. Hope is not a measurement strategy.

Why this affects you

If you’re building a creative tool (a colour grading panel, a preview window, a thumbnail generator) none of this matters. The visual difference between Y = 940 and Y = 939 is zero. Nobody can see it. Use whatever math is most convenient.

If you’re building a measurement tool, or using one, it matters completely. A waveform monitor exists to tell you the exact code values in your signal. If the monitor’s internal math introduces a one-code-value error, the monitor is reporting a signal that isn’t the signal you sent it. It’s a measurement instrument that doesn’t measure accurately. The fact that the error is small doesn’t make it acceptable. Voltmeters don’t get to be off by one millivolt because the difference is small.

The hardware scope vendors understood this. The Tektronix WFM7120 didn’t use floating point for signal conversion. Neither did the Leader LV-5800. They used dedicated integer pipelines with controlled rounding, because measurement is not approximation.

Software scopes that want to be taken seriously as measurement tools need to meet the same bar. When the trace says 940, the signal should be 940. Not 939.9997 rounded up. Not 940.0003 rounded down. Exactly 940, because the math that produced it was exact.