Skip to content

rameauv/jpeg-codec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A JPEG Baseline Sequential codec written according to the ITU-T 81 spec

  • Why this project?
    • I wanted to understand how JPEG works, especially the maths behind it, and I think writing a JPEG codec is a good way to do so.
    • I wanted to learn RUST, and this kind of project seemed to be a nice way to do so.

This codec uses a by-the-book implementation of the DCT; a production-ready implementation would use the AAN algorithm instead.

What I learned/Memo for future me

This is not an exhaustive explanation about the JPEG standard; the ITU-T 81 already does this well, but rather an explanation about the things I did not understand, mostly the maths that powers it.

First, what is JPEG? JPEG stands for Joint Photographic Experts Group. The "joint" in JPEG refers to a collaboration between CCITT (now ITU-T) and ISO, and the goal was to work toward establishing the first international standard for compression of continuous-tone (Smooth, unbroken gradations of color and shade) still images. The standard contains 4 modes of operation:

  • Sequential encoding: Each image component is encoded in a single top-to-bottom, left-to-right scan.
  • Progressive encoding: The image is encoded in multiple scans for applications in which transmission time is long, and the viewer prefers to watch the image build up in multiple coarse-to-clear passes.
  • Lossless encoding 🤯 : the image is encoded to guarantee exact recovery of every source image sample value (even though the result is low compression compared to the lossy modes).
  • Hierarchical encoding: The image is encoded at multiple resolutions, so that lower-resolution versions may be accessed without first having to decompress the image at its full resolution.

Since the project is about the baseline sequential mode of operation, I will mostly focus on this mode.

How does the baseline sequential DCT mode of operation work?

The baseline sequential DCT mode of operation is not tied to a specific color space and can accept up to 255 components.

As part of the standard, it is possible to define a subsampling scheme for each component and take advantage of the fact that the human eye is more sensitive to brightness than to colors by reducing the resolution of the chrominance channels when the components are in the YCbCr color space, for example.

Then each component is divided into 8x8 blocks. If the size of a component is not a multiple of 8, the component is padded with zeros or by duplicating the last row or column.

Then each block is passed through a 2D Forward Discrete Cosine Transform II (FDCT II) with a frequency that goes from 0 to 7 in both the horizontal and vertical directions, allowing the 8x8 block to be represented by a combination of 64 basis functions, each with an influence on the block represented by the coefficients resulting from the 2D FDCT II.

The 1D FDCT is defined as follows:

$$X_k = \sum_{n=0}^{N-1} x_n \cos\left[\frac{\pi}{N}(n+\frac{1}{2})k\right] \hspace{10mm} \text{for } k = 0, \dots N - 1$$

k: The frequency of the basis function. N: The number of points

Let's decompose the formula:

  • cos: Why a cosine function? We use a cosine function because it can represent well the patterns that can be found in real photos, like solid colors with k=0 (a patch of blue sky), gradients with k=1 because the cosine function starts and ends at its maximum value when doing a full rotation. This allows for compacting most of the information in the low frequencies and discarding the high frequencies at the quantization step.

  • pi/N(n+0.5): The goal of this part is to map each point between 0 and pi, but why do we need the +0.5 shift? The reason is that we want the 8 samples to be vertically centered on the cosine wave, so we want the sum of the 8 samples to be equal to 0 for all frequencies (k) except k=0. Because we have a vertical symmetry when k is even (because we do full rotations), if we just stared at 0, we would not be able to have a horizontal symmetry. We would have 2 points at the positive max at the extremities, but no points at the negative max, and the sum of the points would be positive and not 0, meaning that the points would be off-center. By shifting by half the angle of a sample, you ensure that no sample sits directly on the peak and that the samples are nicely centered.

Figure_1
  • k: This is the frequency. Because the first part of the formula goes roughly from 0 to pi, a frequency of 1 corresponds to half a rotation, and a frequency of 2 corresponds to a full rotation. A frequency of 0 is just a constant value of 1 (cos(0) = 1)

From 1D to 2D FDCT II

The 2D FDCT II is a separable transform: we can compute it by applying the 1D FDCT II first to each row of the 8x8 block, then applying the 1D FDCT II to each column of the intermediate result, for each result from the first dimension, if the result of the second dimension is low the final result will be also low and if the result from the second dimention is high then the result from the first dimention will be kept. For an 8x8 input block $S_{yx}$ where $y,x \in {0,\dots,7}$, and the 2D FDCT II coefficient $S_{uv}$, this gives us the following formula:

1D FDCT II:

$$X_k = \sum_{n=0}^{N-1} x_n \cos\left[\frac{\pi}{N}(n+\frac{1}{2})k\right] \hspace{10mm} \text{for } k = 0, \dots N - 1$$

2D FDCT II:

$$S_{uv} = \sum_{x=0}^{N-1} \sum_{y=0}^{N-1} S_{yx} \cos\left[\frac{\pi}{N}\left(x+\frac{1}{2}\right)u\right] \cos\left[\frac{\pi}{N}\left(y+\frac{1}{2}\right)v\right] \hspace{10mm} \text{for } u, v = 0, \dots N - 1$$

If we simplify it by plugging in N, we get:

$$S_{uv} = \sum_{x=0}^{7} \sum_{y=0}^{7} S_{yx} \cos\left[\frac{\pi u (2x+1)}{16}\right] \cos\left[\frac{\pi v (2y+1)}{16}\right]$$

Scaling

The FDCT II, as defined above, is orthogonal but not orthonormal. Orthogonality means the basis functions are perpendicular to each other (their inner product is zero), but orthonormality additionally requires each basis function to have a unit norm (length = 1).

A sample set of size N can be represented as an N-dimensional vector. For the 1D FDCT II with sequence size $N=8$, we would have 8 8D vectors, in this case, basis vectors.

Each 1D FDCT II basis vector $B_k$ for frequency $k = 0\dots7$ is: $$B_k = \left[\cos\left(\frac{\pi k (2\cdot0+1)}{16}\right), \cos\left(\frac{\pi k (2\cdot1+1)}{16}\right), \dots, \cos\left(\frac{\pi k (2\cdot7+1)}{16}\right)\right] \in \mathbb{R}^8$$

To ensure that the energy of the pixels is preserved (Parseval's theorem), we need an orthonormal transform. For that, we need to calculate the current norms of our vectors when k=0 and when k>0 (non-oscillating and oscillating).

Norm of a vector using the Pythagorean theorem

The norm of a vector is the generalization of the Pythagorean theorem to N dimensions.

For a 2D vector $[a, b]$, the norm is the hypotenuse of a right triangle:

To extend to 3D, we recursively apply the Pythagorean theorem:

This pattern continues recursively for any dimension N. For an 8D vector $[v_0, v_1, \dots, v_7]$, the norm is: $$|v| = \sqrt{v_0^2 + v_1^2 + \dots + v_7^2}$$

The scaling factor for each basis vector is simply the reciprocal of its norm: $s_k = 1 / |B_k|$.

Calculating scaling factors for 1D FDCT II

DC basis vector ($k=0$): All components are $\cos(0) = 1$: $$B_0 = \left[1, 1, 1, 1, 1, 1, 1, 1\right]$$ Applying the Pythagorean theorem: $$|B_0| = \sqrt{1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2} = \sqrt{8} = 2\sqrt{2}$$ Therefore, the scaling factor is: $$s_0 = \frac{1}{|B_0|} = \frac{1}{2\sqrt{2}} = \frac{\sqrt{2}}{4} \approx 0.3536$$

AC basis vector ($k>0$):

Below are the samples for $k=1$ (odd frequency) and $k=2$ (even frequency):

---
config:
  themeVariables:
    xyChart:
      plotColorPalette: '#FFA500, #0000FF'
---
xychart
    title "FDCT samples for k=1 and k=2"
    x-axis [0, 1, 2, 3, 4, 5, 6, 7]
    y-axis -1 --> 1
    line [0.9808, 0.8315, 0.5556, 0.1951, -0.1951, -0.5556, -0.8315, -0.9808]
    line [0.9239, 0.3827, -0.3827, -0.9239, -0.9239, -0.3827, 0.3827, 0.9239]
Loading

Legend:

  • $$\textcolor{#FFA500}{■}$$ $k=1$ (odd frequency)
  • $$\textcolor{#0000FF}{■}$$ $k=2$ (even frequency)

Case 1: $k=1$ (odd frequency): For $k=1$, the basis vector samples the cosine wave at 8 points (shown in the diagram above). Formally:

$$B_1 = \left[\cos(\pi/16), \cos(3\pi/16), \cos(5\pi/16), \cos(7\pi/16), \cos(9\pi/16), \cos(11\pi/16), \cos(13\pi/16), \cos(15\pi/16)\right]$$

Applying the Pythagorean theorem to its 8 samples:

$$ \begin{aligned} |B_1| &= \sqrt{(0.9808)^2 + (0.8315)^2 + (0.5556)^2 + (0.1951)^2 + (-0.1951)^2 + (-0.5556)^2 + (-0.8315)^2 + (-0.9808)^2} \\ &= \sqrt{0.9619 + 0.6914 + 0.3086 + 0.0380 + 0.0380 + 0.3086 + 0.6914 + 0.9619} \\ &= \sqrt{4.0000} = 2 \end{aligned} $$

Therefore, the scaling factor is:

$$s_1 = \frac{1}{|B_1|} = \frac{1}{2} = 0.5$$

Case 2: $k=2$ (even frequency): For $k=2$, the basis vector samples a cosine wave with 2 cycles across the 8 points (shown in the diagram above). Formally:

$$B_2 = \left[\cos(\pi\cdot2/16), \cos(\pi\cdot6/16), \cos(\pi\cdot10/16), \cos(\pi\cdot14/16), \cos(\pi\cdot18/16), \cos(\pi\cdot22/16), \cos(\pi\cdot26/16), \cos(\pi\cdot30/16)\right]$$ Using the periodicity and symmetry of cosine ($\cos(\theta + \pi) = -\cos(\theta)$), this simplifies to:

$$B_2 = \left[\cos(\pi/8), \cos(3\pi/8), \cos(5\pi/8), \cos(7\pi/8), -\cos(\pi/8), -\cos(3\pi/8), -\cos(5\pi/8), -\cos(7\pi/8)\right]$$

Applying the Pythagorean theorem to its 8 samples:

$$\begin{aligned} |B_2| &= \sqrt{(0.9239)^2 + (0.3827)^2 + (-0.3827)^2 + (-0.9239)^2 + (-0.9239)^2 + (-0.3827)^2 + (0.3827)^2 + (0.9239)^2} \\ &= \sqrt{0.8536 + 0.1464 + 0.1464 + 0.8536 + 0.8536 + 0.1464 + 0.1464 + 0.8536} \\ &= \sqrt{4.0000} = 2 \end{aligned}$$

Therefore, the scaling factor is:

$$s_2 = \frac{1}{|B_2|} = \frac{1}{2} = 0.5$$

This result holds for all $k>0$ due to the orthogonality properties of the FDCT. For any AC frequency $k = 1\dots7$, the norm is exactly 2, so $s_k = 1/2$ for all AC components.

2D FDCT II scaling factors

For 2D, each basis matrix is the outer product of two 1D basis vectors (need to dig more on this): $B_{u,v}(x,y) = B_u(x) \cdot B_v(y)$. This creates an 8×8 = 64-dimensional basis for the 2D space.

$$|B_{u,v}| = |B_u| \cdot |B_v|$$

Therefore, the 2D scaling factor is the product of the 1D scaling factors:

$$s_{u,v} = \frac{1}{|B_{u,v}|} = \frac{1}{|B_u| \cdot |B_v|} = \frac{1}{|B_u|} \cdot \frac{1}{|B_v|} = s_u \cdot s_v$$

Applying the 1D results:

Component $|B_u|$ $|B_v|$ $|B_{u,v}| = |B_u| \cdot |B_v|$ Scaling factor $s_{u,v} = s_u \cdot s_v$
DC $(0,0)$ $2\sqrt{2}$ $2\sqrt{2}$ $2\sqrt{2} \times 2\sqrt{2} = 8$ $(1/2\sqrt{2}) \cdot (1/2\sqrt{2}) = 1/8$
Edge $(0,v>0)$ or $(u>0,0)$ $2\sqrt{2}$ $2$ $2\sqrt{2} \times 2 = 4\sqrt{2}$ $(1/2\sqrt{2}) \cdot (1/2) = 1/(4\sqrt{2})$
AC $(u>0,v>0)$ $2$ $2$ $2 \times 2 = 4$ $(1/2) \cdot (1/2) = 1/4$

Deriving the 1/4 factor and C(u), C(v) from the scaling factors

From our earlier calculations, we have three distinct 2D scaling factors:

  • DC $(0,0)$: $s = 1/8$
  • Edge $(0,v>0)$ or $(u>0,0)$: $s = 1/(4\sqrt{2})$
  • AC $(u>0,v>0)$: $s = 1/4$

Notice that the AC case has the simplest scaling: $1/4$. This is the base scaling factor. It comes from the product of two 1D AC scaling factors:

$$s_{\text{AC-2D}} = s_{\text{AC-1D}} \times s_{\text{AC-1D}} = \frac{1}{2} \times \frac{1}{2} = \frac{1}{4}$$

The DC case needs an additional factor of $1/\sqrt{2}$ in each dimension (because $$s_0 = 1/(2\sqrt{2}) = (1/2) \times (1/\sqrt{2})$$), and the Edge case needs it in one dimension.

Therefore, we can factor out the $1/4$ and absorb the DC-specific corrections into $C(u)$ and $C(v)$:

Component Target scaling $1/4 \times$ Required $C(u) C(v)$
DC $(0,0)$ $1/8$ $1/4$ $1/2$
Edge $1/(4\sqrt{2})$ $1/4$ $1/\sqrt{2}$
AC $1/4$ $1/4$ $1$

Since $C(u) C(v) = 1/2$ when $u=0$ and $v=0$, and $C(u) C(v) = 1/\sqrt{2}$ when exactly one of $u$ or $v$ is 0, we get:

$$C(k) = \frac{1}{\sqrt{2}} \text{ for } k=0, \quad C(k) = 1 \text{ for } k>0$$

This gives:

  • DC: $(1/4) \times (1/\sqrt{2}) \times (1/\sqrt{2}) = (1/4) \times (1/2) = 1/8$
  • Edge: $(1/4) \times (1/\sqrt{2}) \times 1 = 1/(4\sqrt{2})$
  • AC: $(1/4) \times 1 \times 1 = 1/4$

The resulting scaled 2D FDCT II formula is then:

$$S_{vu} = \frac{1}{4} C(u) C(v) \sum_{x=0}^{7} \sum_{y=0}^{7} S_{yx} \cos\left(\frac{\pi u (2x+1)}{16}\right) \cos\left(\frac{\pi v (2y+1)}{16}\right)$$

Which is the exact FDCT II formula you will find in the ITU-T 81 specification.

For reference, the JPEG inverse DCT uses the same scaling:

$$S_{yx} = \frac{1}{4} \sum_{u=0}^{7} \sum_{v=0}^{7} C(u) C(v) S_{vu} \cos\left(\frac{\pi u (2x+1)}{16}\right) \cos\left(\frac{\pi v (2y+1)}{16}\right)$$

The orthonormality of the FDCT II ensures that applying the forward DCT followed by the inverse DCT recovers the original input perfectly (within floating-point precision).


Quantization and Zigzag Scanning

Why quantize?

The FDCT coefficients represent the "importance" of each basis function in reconstructing the original block. The DC coefficient (top-left, $$F(0,0)$$) represents the average brightness of the 8×8 block — a single value. The low-frequency AC coefficients (near the top-left) represent broad gradients and large-scale patterns. The high-frequency AC coefficients (near the bottom-right) represent fine details, edges, and rapid changes.

Human vision is much more sensitive to low-frequency information (smooth gradients, large features) than to high-frequency information (fine texture, noise). Quantization exploits this by:

  1. Dividing each coefficient by a quantization value (and rounding to an integer)
  2. Using larger quantization values for high frequencies (more aggressive rounding = more compression)
  3. Using smaller quantization values for low frequencies (less rounding = better quality preservation)

The quantization process

For each 8×8 block of FDCT coefficients $F(u,v)$, quantization is performed element-wise: $$F_{\text{quantized}}(u,v) = \text{round}\left(\frac{F(u,v)}{Q(u,v)}\right)$$

where $Q(u,v)$ is the quantization value at position $(u,v)$ from a quantization table.

Quantization tables

JPEG defines two standard quantization tables for the YCbCr color space:

Luminance (Y) table — used for the brightness component, with smaller values (less compression):

16  11  10  16  24  40  51  61
12  12  14  19  26  58  60  55
14  13  16  24  40  57  69  56
14  17  22  29  51  87  80  62
18  22  37  56  68 109 103  77
24  35  55  64  81 104 113  92
49  64  78  87 103 121 120 101
72  92  95  98 112 100 103  99

Chrominance (Cb, Cr) table — used for color components, with larger values (more compression, since human vision is less sensitive to color details):

17  18  24  47  99  99  99  99
18  21  26  66  99  99  99  99
24  26  56  99  99  99  99  99
47  66  99  99  99  99  99  99
99  99  99  99  99  99  99  99
99  99  99  99  99  99  99  99
99  99  99  99  99  99  99  99
99  99  99  99  99  99  99  99

These tables were designed empirically based on psychovisual experiments. Larger values mean more quantization (more zeros, more compression).

Adjusting quality with the quantization tables

The quantization tables can be scaled by a quality factor (typically 1-100) to control the trade-off between file size and image quality.

For a given quality value $Q$ (1 = worst, 100 = best), the quantization table entries are scaled by a factor $S$:

$$S(Q) = \begin{cases} \frac{5000}{Q} & \text{for } 1 \leq Q \leq 50 \\ 200 - 2Q & \text{for } 51 \leq Q \leq 100 \end{cases}$$

The scaled quantization value at each position is: $$Q_{\text{scaled}}(u,v) = \text{round}\left(Q_{\text{base}}(u,v) \times \frac{S(Q)}{100}\right)$$

Why the piecewise formula?

The scaling factor formula is piecewise because it was designed empirically by the Independent JPEG Group (IJG) to provide perceptually uniform quality steps across the 1-100 range. The relationship between quality $Q$ and scaling factor $S$ is:

---
config:
  themeVariables:
    xyChart:
      plotColorPalette: '#FFA500, #0000FF'
      showDataLabel: true
---
xychart-beta
    title "JPEG Quality Scaling: S(Q)"
    x-axis [1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100]
    y-axis "Scaling Factor" 0 --> 5000
    line [5000, 1000, 500, 333.33, 250, 200, 166.67, 142.86, 125, 111.11, 100, 90.91, 83.33, 76.92, 71.43, 66.67, 62.50, 58.82, 55.56, 52.63, 50]
    line [198, 190, 180, 170, 160, 150, 140, 130, 120, 110, 100, 90, 80, 70, 60, 50, 40, 30, 20, 10, 0]
Loading

Legend:

  • $$\textcolor{#FFA500}{■}$$ Hyperbolic scaling $S = 5000/Q$ for $1 \leq Q \leq 50$
  • $$\textcolor{#0000FF}{■}$$ Linear scaling $S = 200 - 2Q$ for $51 \leq Q \leq 100$

The role of zigzag scanning

After quantization, most high-frequency coefficients become zero (especially at higher compression levels). The 8×8 block of quantized coefficients needs to be encoded efficiently for storage.

Zigzag scanning reorders the 2D block into a 1D sequence by following a diagonal pattern. This ordering groups low-frequency coefficients (which are typically non-zero) at the beginning and high-frequency coefficients (which are typically zero) at the end, creating long runs of zeros that compress efficiently.

image

This ordering groups low-frequency coefficients (which are typically non-zero) at the beginning of the sequence and high-frequency coefficients (which are typically zero after quantization) at the end.

Entropy Coding

After being reordered in zig-zag, the quantized coefficients are entropy encoded using Huffman encoding to save space. Directly encoding each coefficient value would require a huge Huffman table with thousands of entries. Instead, JPEG uses a category-based encoding approach: coefficients are first mapped to a category (based on their magnitude, the number of bits required to represent the number), and then the Huffman table encodes the category. The actual value is then read from the bitstream by reading a category number of bits. For AC coefficients, this takes the form of a (0 runlength, category) packed byte; for DC coefficients, it's just the category itself.

DC Coefficient Encoding The DC coefficient of each 8×8 block represents its average brightness. Since adjacent blocks typically have similar brightness, JPEG uses differential encoding:

  • For the first block in the image: encode the absolute DC value directly

  • For all subsequent blocks: encode the difference from the previous block's DC

    $DC_{\text{diff}} = DC_{\text{current}} - DC_{\text{previous}}$

The DC differences are typically small integers (often 0 or ±1), making them highly compressible. A dedicated DC Huffman table optimized for this distribution is used to encode them.

AC Coefficient Encoding After zigzag scanning, the AC coefficients typically contain long runs of zeros (especially at higher frequencies). JPEG uses run-length encoding (RLE) to exploit this sparsity:

  • Each non-zero coefficient is paired with the count of consecutive zeros preceding it in the zigzag order
  • The pair format is: $(runlength, value)$

These RLE pairs are then entropy-coded using a dedicated AC Huffman table optimized for the typical distribution of (runlength, category) pairs, which feature many long zero runs and small non-zero values.


JPEG Interchange Format

The interchange format is the coded representation of compressed image data for exchange between application environments.

image image

File structure: markers and segments

It is a sequence of segments, each beginning with a marker. Markers are special 2-byte codes that identify the type of segment that follows. All markers start with 0xFF, followed by a unique byte. It contains the metadata of the image, the quantization tables, and the Huffman tables. The details for each segment and the interchange format can be found in the ITU T.81 standard.

Marker Name Description
0xFFD8 SOI Start of Image — first marker in every JPEG file
0xFFE0 APP0 Application marker 0 — contains JFIF identifier and version
0xFFC0 SOF0 Start of Frame (Baseline FDCT) — image dimensions, components
0xFFC4 DHT Define Huffman Table — Huffman coding tables
0xFFDB DQT Define Quantization Table — quantization tables
0xFFDA SOS Start of Scan — begins compressed image data
0xFFD9 EOI End of Image — last marker in every JPEG file

Negative value encoding in entropy-coded segments

In the JPEG Interchange Format, negative values in the entropy-coded segments are encoded using a specific method that is different from the regular 2's complement representation. The process involves adjusting the negative value to fit within the bit representation scheme used for positive values, thereby allowing variable-length encoding of both positive and negative values. The value is adjusted by flipping the bits of the absolute value of the negative number with a mask of size t, so Abs(V) XOR (2^t - 1) or the formula V + (2^t - 1), where V is the negative value and t is the category (number of bits required to represent the value). Because the length can vary, positive numbers always start with an MSB (most significant bit) of 1. By flipping the bits this way for negative numbers, we ensure that the MSB is always 0 for negative numbers and that we can differentiate positive from negative numbers by looking at the MSB.

example: For V = -123 Abs(V) = 123 (1111011) t = 7 (7bits) 2^t = 1 << t = 128 (10000000) 2^t - 1 = 127 (1111111) V + (2^t - 1) = -123 + 127 = 127 - 123(1111111 - 1111011) = 4 (0000100) The result is a 7-bit sequence that starts with an MSB of 0.

which is the same as:

    01111011  (Positive 123 as 8-bit int)
XOR 01111111  (Mask of size t=7)
=   00000100  (Final JPEG Suffix as 8-bit int)
     0000100  (Final JPEG Suffix)

To recover the original value, we subtract (2^t - 1) from the decoded value.

For V = 4 and t = 7 (remember that t(the category) is what is being entropy encoded) 2^t = 128 2^t - 1 = 127 V - (2^t - 1) = 4 - 127 = -123


Decoding

The decoding process is essentially just the reverse of the encoding process, with the Inverse Discrete Cosine Transform (IDCT) replacing the FDCT.

Inverse DCT (IDCT):

$$f(x,y) = \frac{1}{4} \sum_{u=0}^{7} \sum_{v=0}^{7} C(u) C(v) F(u,v) \cos\left(\frac{(2x+1)u\pi}{16}\right) \cos\left(\frac{(2y+1)v\pi}{16}\right)$$

where $C(k) = 1/\sqrt{2}$ for $k=0$, else $1$ — the same normalization as the forward DCT.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages