Container surgery without a DAW

A sound editor asks you to trim two seconds of silence off the head of a field recording. The file is a 48kHz/24-bit polyphonic BWF, eight channels, 47 minutes long. About 1.3 gigabytes.

The normal approach: open the file in Pro Tools or a similar DAW, select the region, bounce it out. The DAW decodes the entire file into its internal mixing engine, applies the edit, then re-encodes and writes a new file. For a trim operation that removes silence, this involves billions of unnecessary sample conversions. It also takes time, consumes memory, and introduces an encode cycle that, depending on the DAW’s export settings, might not even be bit-identical to the original.

There is another way. If you understand the structure of the container, you can do the trim with byte arithmetic. No decoding. No encoding. No quality loss. Just moving a pointer.

The anatomy of a WAV file

A WAV file is a RIFF container. RIFF (Resource Interchange File Format) is one of the simplest container formats ever designed. The entire structure is built from chunks, each following the same pattern:

[4-byte ID] [4-byte size (little-endian)] [payload bytes...]

Every chunk declares what it is and how big it is. A parser can walk through the file by reading 8 bytes, noting the chunk ID and size, then either parsing the payload or skipping ahead by size bytes to reach the next chunk. That’s the whole protocol.

A standard WAV file contains at minimum two chunks inside the RIFF envelope:

fmt (note the trailing space, it’s a 4-byte ID) describes the audio format. Sample rate, bit depth, channel count, block alignment. This chunk is small, typically 16 or 18 bytes of payload. Everything a decoder needs to know about how to interpret the raw samples.

data contains the actual interleaved PCM samples. In a 47-minute 8-channel 48kHz/24-bit file, this chunk is nearly the entire file. The samples sit there in order, channel by channel within each sample frame, frame after frame. No compression, no indexing, no headers within the data. Just numbers.

A Broadcast Wave Format (BWF) file adds more chunks to this base:

bext carries the broadcast extension metadata defined in EBU Tech 3285. The critical field is time_reference, a 64-bit sample count from midnight that gives the file its position on the timeline. Also holds originator information, origination date and time, UMID, and optionally loudness measurements per EBU R 128.

iXML holds production metadata as XML. Scene, take, tape name, track names, channel assignments, circled-take flags, recorder model, project name. This is the chunk that lets a conform tool reconnect a recording to the script supervisor’s notes.

You might also find axml, LIST, JUNK (padding for future in-place edits), or ds64 (the 64-bit size extension for files over 4 GB). But the principle never changes. Chunks, one after another, each self-describing.

BWF chunk structure154.95 MB

Offset0x0

RIFF

Container header: type WAVE

Size154.95 MB total

Offset0xC

fmt

Format: 8ch / 48kHz / 24-bit PCM

Size16 B

Offset0x24

data

Interleaved PCM audio samples

Size154.94 MB

Offset0x93C422C

bext

Broadcast extension: timecode, UMID, originator

Size602 B

Offset0x93C448E

iXML

Production metadata: scene, take, track names

Size3.3 KB

Head trim0.000s (0 samples)

0s2s4s6s8s10s

Trimming without touching the audio

Back to the trim operation. We need to remove two seconds of silence from the head of an 8-channel, 48kHz, 24-bit file.

First, the math. Two seconds at 48,000 samples per second is 96,000 sample frames. Each frame contains 8 channels at 3 bytes per sample (24-bit), so one frame is 24 bytes. The number of bytes to remove: 96,000 * 24 = 2,304,000.

The trim operation is:

Parse the chunk table. Walk through the file, note every chunk ID, its offset, and its size. Don’t read the data payloads, just the 8-byte headers.
Write a new RIFF header. The total RIFF size decreases by 2,304,000 bytes.
Copy the fmt chunk verbatim. Nothing about the format has changed.
Write the data chunk header with its new size (original data size minus 2,304,000), then seek past the first 2,304,000 bytes of the original data chunk and copy the rest.
Update the bext chunk. The time_reference field needs to advance by 96,000 samples, because the file now starts two seconds later on the timeline.
Copy all other chunks (iXML, axml, JUNK, anything unknown) verbatim.

The result is a bit-identical subset of the original audio with corrected metadata. No sample was decoded. No sample was re-encoded. The PCM values in the output file are the exact same bytes that existed in the input file, just starting from a different offset.

This is what makes it lossless in the strictest sense. Not “perceptually lossless” or “lossless compression.” Literally the same bytes.

Channel extraction

The same chunk-level thinking applies to extracting individual channels from a polyphonic file. A dialogue editor might need just the boom track (channel 1) from an 8-channel poly.

PCM interleaving means the samples are laid out as:

[ch1 s0] [ch2 s0] [ch3 s0] ... [ch8 s0] [ch1 s1] [ch2 s1] ...

Extraction walks through the data chunk one frame at a time, copying only the bytes for the desired channel. For our 24-bit 8-channel file, each frame is 24 bytes. Channel 1 occupies bytes 0 through 2 of each frame. The operation reads 24 bytes, writes 3 bytes, advances, repeats.

The new file gets a fmt chunk with num_channels set to 1 and block_align recalculated accordingly. The data chunk shrinks to one-eighth its original size. The bext timecode stays the same, because the temporal position hasn’t changed. The iXML gets updated to reflect a mono file with the correct track name.

No decode. No encode. The sample values in the output are, again, the same bytes that existed in the input, just with the other channels’ bytes removed from between them.

Metadata rewrite

Sometimes the audio is fine and the metadata is wrong. Track names that don’t match the actual content. A scene/take field left over from the previous setup. A timecode reference that’s off by a frame because the recorder’s word clock drifted.

In a chunk-based format, metadata rewrite is the simplest operation of all. Parse the chunk table. Modify the bext or iXML payload. Recalculate the chunk size (it might change if the XML is shorter or longer). Write the file back: RIFF header with the new total size, then every chunk in order, with the modified metadata chunk carrying its updated payload.

The data chunk passes through untouched. You can pipe it from input to output without even reading it into memory. Just copy the bytes.

For small metadata corrections, you can sometimes operate in-place. If the new iXML payload is the same length as the old one, or shorter (padded with nulls), you can seek to the chunk’s payload offset and overwrite it directly. This is why many production recorders write a JUNK chunk after iXML: the padding reserves space for later metadata edits without rewriting the entire file.

Why this matters

The alternative to all of this is to treat audio files as opaque blobs that only a DAW can open. Need to trim? Launch Pro Tools. Need a mono extraction? Launch Pro Tools. Need to fix a track name? Launch Pro Tools.

For a facility processing hundreds of files per day, that workflow does not scale. A DAW is an editing environment, not a file manipulation tool. Using it for container surgery is like opening Photoshop to rename a JPEG.

The chunk structure of RIFF/BWF is simple enough that a purpose-built tool can perform these operations in seconds, streaming data from input to output without holding the entire file in memory. A 1.3 GB poly trim allocates a few kilobytes for the chunk table and a read buffer. The operation is I/O-bound, not CPU-bound, because there is no processing. Just byte copying with arithmetic.

This is the kind of operation Spool handles: batch trim, channel extraction, metadata correction, format validation, all operating at the container level. Not because we wanted to avoid using a DAW, but because the operations themselves don’t require one. They never did.

Understanding the container format well enough to operate on it directly isn’t a shortcut. It’s the correct level of abstraction for the problem. A trim is a byte offset change. A channel extraction is a stride operation. A metadata fix is a chunk rewrite. When you see them that way, the 30-second DAW bounce for a two-second trim starts to look like the workaround, not the other way around.