WAV, BWF, AIFF, PCM: what's actually inside an audio file — datum docs

You record production sound at 48 kHz, 24-bit. The file lands on a drive as a .wav. Your metadata tool calls it a BWF. The delivery spec says “uncompressed PCM.” A colleague asks if AIFF would work instead.

These terms get used interchangeably, and they shouldn’t be. They describe different things. Once you sort out which is which, the confusion disappears and you can make informed decisions about format, bit depth, sample rate, and file size.

PCM is the encoding, not a format

Pulse Code Modulation is a method of representing sound as numbers. That’s all it is. Take an analogue audio signal, measure its amplitude at regular intervals, write down each measurement as a binary number. The result is a stream of values that, played back in order at the original rate, reconstructs the signal.

PCM was invented by Alec Reeves at a Paris laboratory in 1937, originally for telephone systems. The idea is elegant: if you sample fast enough and with enough precision, the numbers ARE the sound. No compression, no transform, no information discarded.

A raw PCM stream is just numbers. No header, no metadata, no indication of how to interpret them. You need three pieces of information before you can play it back:

Sample rate. How many measurements per second. 48,000 is common.
Bit depth. How many bits per measurement. 24 is common.
Channel count. How many interleaved streams. Mono is 1, stereo is 2, a production sound poly file might be 8 or 24.

Without all three, the numbers are meaningless. Open a raw PCM file in a hex editor and you’ll see bytes. You won’t hear audio until something tells you the shape of the data.

This is exactly the problem that containers solve.

WAV wraps PCM in a structure

WAV (Waveform Audio File Format) is a container. Microsoft and IBM designed it in 1991 as part of RIFF, the Resource Interchange File Format. The word “WAV” is short for “waveform.” The three-letter extension comes from the 8.3 filename limit in DOS.

The entire structure is built from chunks. Every chunk follows the same pattern:

[4-byte ID] [4-byte size] [payload...]

A chunk says what it is and how big it is. A parser walks through the file reading chunk headers, either processing the payload or skipping ahead to the next chunk. The outermost chunk is RIFF with a form type of “WAVE.” Inside it, a minimal WAV file has two chunks:

fmt (four bytes, with a trailing space) carries the format metadata. Sample rate, bit depth, channel count, byte rate, block alignment. Everything a decoder needs to interpret the PCM data. This chunk is tiny. Usually 16 or 18 bytes.

data carries the actual PCM samples, interleaved channel by channel within each sample frame. In a long recording, this chunk is nearly the entire file. No compression, no indexing within it. Just numbers in order.

That’s the whole format. Two chunks, one small and one large, inside a RIFF envelope. Everything else in the WAV ecosystem is optional chunks layered on top of this base.

One important detail: WAV is little-endian throughout. Byte order was determined by the Intel x86 architecture that RIFF was built for. This will matter when we get to AIFF.

BWF is WAV with professional metadata

Broadcast Wave Format is not a different format from WAV. It IS a WAV file. The only structural difference is the addition of a bext chunk, defined by the European Broadcasting Union in EBU Tech 3285.

The bext chunk carries the metadata that broadcast and film production need:

Time reference: a 64-bit sample count from midnight that gives the file its position on the timeline. At 48 kHz, midnight is sample zero. If time_reference is 172,800,000, the file starts at exactly 01:00:00:00 timecode.
Originator: which recorder or system created the file.
Origination date and time: when it was recorded.
UMID: a 64-byte Unique Material Identifier per SMPTE 330M, intended to make every recording globally unique.
Loudness values: integrated loudness, loudness range, maximum true peak, maximum momentary, maximum short-term. All per EBU R 128.
Coding history: a free-text field documenting the encode/decode chain the file has passed through.

Production BWF files typically also carry an iXML chunk with scene/take metadata, track names, and channel assignments. While iXML is not part of the BWF specification itself, it’s nearly universal in professional production sound. The combination of bext and iXML is what makes a production sound file self-describing: not just audio, but context.

The critical design decision: BWF is backward-compatible. Any tool that reads WAV can open a BWF file. The bext chunk is simply ignored by tools that don’t understand it. The audio plays fine. You lose the metadata, but you don’t lose access to the recording.

Format relationships

PCM is the encoding. Everything else is a container. Click to inspect.

PCM encoding (the audio data)

WAVRIFF container

fmtdata

BWFWAV + metadata

fmtbextiXMLdata

RF64 / BW6464-bit sizes

ds64fmtbextdata

AIFFIFF container

COMMSSND

Big-endian (Motorola heritage)

PCMEncoding

Pulse Code Modulation. A stream of numbers representing amplitude at regular intervals. Not a file format.

Sample values (amplitude at each time step)

Requires external knowledge: sample rate, bit depth, channels

No header, no metadata, no structure

The raw material inside every uncompressed audio file

AIFF: Apple’s container

Audio Interchange File Format was designed by Apple Computer in 1988, based on Electronic Arts’ IFF (Interchange File Format) from the Commodore Amiga. It solves exactly the same problem as WAV: wrapping PCM audio in a structured container with format metadata.

The structure is nearly identical in concept. Chunks with IDs and sizes. An outer “FORM” container with type “AIFF.” Two essential inner chunks:

COMM (Common) carries the format metadata: channel count, number of sample frames, sample size (bit depth), and sample rate stored as an 80-bit extended precision floating-point number. This last detail is pure Apple eccentricity. An 80-bit float for the sample rate when an integer would do fine.

SSND (Sound Data) carries the PCM samples.

Why AIFF exists at all

Byte order. In the 1980s, Apple’s Macintosh and Commodore’s Amiga used Motorola 68000 processors, which store the most significant byte first (big-endian). IBM PCs used Intel x86 processors, which store the least significant byte first (little-endian). RIFF/WAV stores everything little-endian because it was built for Intel. IFF/AIFF stores everything big-endian because it was built for Motorola.

Same audio. Same numbers. Different order of bytes within each number. A 16-bit sample with value 1000 is stored as E8 03 in WAV and 03 E8 in AIFF.

Today, every modern processor handles both byte orders efficiently, so the distinction is academic. But the formats persist. If you work primarily in Apple environments with Logic Pro or other macOS tools, you’ll encounter AIFF. In broadcast and film, WAV and BWF dominate.

There’s also AIFF-C (or AIFC), a variant that supports compressed audio in the container. In practice, it’s rarely used for compression. Some tools use AIFF-C with compression type “none” to store little-endian PCM in an AIFF container (internally called “sowt,” which is “twos” backwards). It’s a historical footnote more than a practical concern.

Bit depth: how many levels per sample

Each PCM sample is a measurement of amplitude, stored as a binary number. The bit depth determines how many possible values that number can take, which directly determines the dynamic range of the recording.

The rule: every additional bit doubles the number of possible values, adding approximately 6.02 dB of dynamic range. Precisely: 20 * log10(2) = 6.0206 dB.

16-bit gives you 65,536 possible values and roughly 96 dB of dynamic range. This is the CD standard (Red Book, 1980, Sony and Philips). 96 dB is enough for consumer playback when combined with dithering, but it leaves limited headroom for processing.

24-bit gives you 16,777,216 possible values and roughly 144 dB of dynamic range. This is the professional recording standard. For context, 144 dB exceeds the dynamic range of any microphone or analogue preamp ever built. The best front-end electronics top out around 130 dB. You have more precision in the numbers than in the physics of the recording chain.

32-bit integer gives you about 4.3 billion values and 192 dB. Rarely used in practice because 32-bit float is better in every way that matters.

Why 32-bit float is different from 32-bit integer

This is the important one. Integer formats have a fixed scale. The maximum value is the maximum value. If the signal exceeds it, the waveform is truncated. The peaks are gone. This is clipping, and it’s permanent.

32-bit float uses IEEE 754 single-precision floating-point representation. Instead of a fixed grid of values, each sample is stored as a mantissa (about 24 bits of precision, giving roughly 144 dB of instantaneous dynamic range) and an exponent that scales the value. Numbers above 1.0 are perfectly valid. Numbers above 10.0 are perfectly valid. The representable range extends to approximately 3.4 * 10^38.

In practice, this means 32-bit float audio cannot clip during recording. If the talent suddenly yells and the signal goes 40 dB above what your gain was set for, the numbers just get bigger. Pull the fader down in post and the audio is undistorted. No information was lost because the format had room.

This is why modern field recorders like the Sound Devices MixPre series and Zoom F-series default to 32-bit float. The trade-off is file size: 4 bytes per sample instead of 3 for 24-bit, a 33% increase. For production sound where a clipped take means a reshoot, the extra storage is trivially worth it.

Bit depth comparison

Each bit adds ~6 dB of dynamic range. 32-bit float is a different animal.

Possible values

16,777,216

2^24

Dynamic range

144.5 dB

24 bits * 6.02 dB/bit = 144.5 dB

Clips above 0 dBFS?

Yes

Signal above max value is lost forever

Relative dynamic range

16-bit

96.3 dB

24-bit

144.5 dB

32-bit

192.6 dB

32-bitf

extends beyond →

~1528 dB

Professional recording standard. Exceeds the dynamic range of any microphone or preamp (~130 dB).

Sample rate: how often you measure

The Nyquist-Shannon sampling theorem says that to accurately reconstruct a signal, you must sample at more than twice its highest frequency. Human hearing reaches about 20 kHz, so a sample rate above 40 kHz captures everything we can hear.

The two standard rates split along industry lines, and that split is still with us.

44.1 kHz is the CD standard, and the reason for the specific number is an engineering constraint from 1979. Early digital audio recordings were stored on U-matic video tape. NTSC video has 245 usable lines per field, 60 fields per second. Three samples per line gives 245 * 3 * 60 = 44,100. Sony and Philips adopted this rate for the Compact Disc in 1982, and the music industry followed.

48 kHz was adopted by the EBU and SMPTE for broadcast television and film. It’s a rounder number with slightly more headroom above 20 kHz for anti-aliasing filter design. Every professional video format uses 48 kHz audio: DV, HDV, HDCAM, XDCAM, every broadcast codec. If you’re working in film or television, your audio is 48 kHz.

96 kHz and above are used in some music production and archival workflows. Whether the additional bandwidth (capturing frequencies up to 48 kHz) provides audible benefit is debated. The practical benefits are better anti-aliasing filter performance and more headroom for processing operations that might create harmonics above 20 kHz.

Why 44.1 and 48 persist

The music industry standardised on 44.1 kHz. Broadcast standardised on 48 kHz. Neither was willing to adopt the other’s rate. This split causes sample-rate conversion every time music meets broadcast, which happens constantly: every song in a TV show, every music video, every advertisement. The conversion is routine and nearly transparent with modern algorithms, but it’s a computational step that exists because two industries made different choices forty years ago and neither blinked.

When files get too big: RF64 and BW64

WAV has a size limit. The RIFF header stores the total file size as a 32-bit unsigned integer: maximum 4,294,967,295 bytes, just under 4 GB. For a 48 kHz, 24-bit, 8-channel recording, you hit this limit in about an hour. For 96 kHz multichannel work, minutes.

RF64, defined by the EBU in Tech 3306 (2007), solves this by adding a ds64 chunk containing 64-bit versions of the file size, data chunk size, and sample count. The original 32-bit fields are set to 0xFFFFFFFF as a flag meaning “look in ds64 for the real value.” Everything else works the same.

BW64 extends RF64 further with support for ADM (Audio Definition Model) metadata, used in immersive audio formats like Dolby Atmos and MPEG-H. The outer chunk ID changes from “RIFF” to “BW64” and additional chunks carry object-based audio descriptions.

Professional tools handle RF64 well. Consumer tools often don’t. If you’re working with long multichannel recordings and your files are bouncing off the 4 GB wall, check whether your downstream tools support RF64 before assuming they do.

The relationship, clearly

Here it is, the thing nobody draws clearly enough:

PCM is the encoding. It’s a method of representing sound as numbers. It’s not a file format. It has no header, no metadata, no structure. Just a stream of amplitude values.

WAV is a container that wraps PCM audio inside a RIFF structure with format metadata (sample rate, bit depth, channels). It’s the Windows/Intel-heritage container.

AIFF is a container that wraps the same PCM audio inside an IFF structure with format metadata. It’s the Apple/Motorola-heritage container. Same audio, different byte order.

BWF is a WAV file with an additional bext chunk carrying broadcast metadata: timecode, originator, date, loudness measurements. It’s not a different format. It’s WAV with extra information.

RF64/BW64 are WAV extensions that break the 4 GB size limit with 64-bit size fields.

These are not competing formats. They’re layers. PCM is what the audio is. The containers are how you package it. BWF is the professional packaging. The audio inside all of them is the same: numbers, in order, representing sound.

When someone says “send me a WAV,” they mean “send me uncompressed audio in a container I can open.” When the delivery spec says “BWF, 48/24, timecoded,” it means the same audio in the same container, with the bext chunk filled in so the file knows where it belongs on the timeline.

Once you see it as layers rather than alternatives, the container becomes something you can inspect and modify rather than something opaque. And that changes how you work.