Skip to content

feat: unified CCITT decoder with G3 2D, byte alignment, and resource limits#26

Draft
lilith wants to merge 2 commits into
pdf-rs:masterfrom
lilith:feat/unified-decoder
Draft

feat: unified CCITT decoder with G3 2D, byte alignment, and resource limits#26
lilith wants to merge 2 commits into
pdf-rs:masterfrom
lilith:feat/unified-decoder

Conversation

@lilith
Copy link
Copy Markdown
Contributor

@lilith lilith commented Apr 9, 2026

Draft PR addressing several open issues and feature gaps.

New: unified Decoder struct

A single Decoder configurable via DecodeOptions, replacing the need to choose between Group3Decoder and Group4Decoder at the type level. Maps directly to PDF CCITTFaxDecode parameters and TIFF Group3Options/Group4Options tags.

let opts = DecodeOptions {
    columns: 1728,
    rows: Some(2000),
    encoding: EncodingMode::Group3_2D { k: 4 },
    rows_are_byte_aligned: true,
    end_of_line: true,
    end_of_block: false,
    black_is_1: false,
    msb_first: true,
    limits: Some(Limits {
        max_pixels: Some(100_000_000),
        max_input_bytes: Some(10 * 1024 * 1024),
    }),
};
decode(data.iter().copied(), opts, |transitions| {
    // transitions are &[u32] — supports widths beyond u16::MAX
    for color in pels32(transitions, opts.columns) {
        // process pixel
    }
})?;

The existing decode_g3, decode_g4, Group3Decoder, and Group4Decoder are unchanged — fully backwards compatible.

New features

u32 transitions (partially addresses #13)

The unified decoder uses u32 internally for transitions and width, enabling images wider than 65535 pixels. Default Limits caps at u16::MAX range for safety — callers opt in by raising limits. Legacy API remains u16.

Group 3 2D decoding (addresses #5 G3 thread)

Mixed 1D/2D coding per T.4. Reads the tag bit after each EOL to select 1D or 2D decoding per line. The 2D path reuses the existing G4 mode-code logic. Tested with a real G3 2D image generated by libtiff's tiffcp -c g3:2d.

Byte-aligned mode (rows_are_byte_aligned)

Consumes padding bits between lines to reach the next byte boundary. Needed for TIFF Group3Options bit 2 and PDF EncodedByteAlign. Tested against Go's x/image/ccitt aligned test files — both G3 and G4 aligned variants now decode correctly.

EndOfBlock=false support

When end_of_block is false, the decoder uses rows to determine when to stop instead of requiring EOFB/RTC markers. Handles TIFF strips that end without termination markers.

EndOfLine=false support

G3 data without EOL markers between lines. Run-lengths terminate at columns width instead of requiring an EOL marker. Needed for some PDF streams.

LSB-first bit order (msb_first=false)

ByteReader::new_lsb() reverses bits within each input byte. For TIFF FillOrder=2.

Resource limits

Limits { max_pixels, max_input_bytes } rejects oversized images early and bounds input consumption. Returns DecodeError::LimitExceeded on violation. Default limits cap at u16 range.

Test coverage

9 new tests using Go's x/image/ccitt corpus (153×55 bw-gopher, BSD-3-Clause, 56KB):

  • Legacy parity: unified G4 and G3 produce identical output to decode_g4/decode_g3
  • Go corpus: G4 normal, G3 normal, G3/G4 pixel match, G4 aligned, G3 aligned
  • G3 2D: 176×8 image from tiffcp -c g3:2d
  • Limits: max_pixels rejects oversized dimensions

26 total tests pass (23 unit + 2 check + 1 errors).

What this doesn't address (yet)

Stacks on #24 (cargo fmt).

@lilith lilith force-pushed the feat/unified-decoder branch from f26fb35 to d0b17bc Compare April 9, 2026 20:50
lilith added 2 commits April 9, 2026 22:24
…limits

New `unified` module provides a single `Decoder` struct configurable
via `DecodeOptions`, handling G4, G3 1D, and G3 2D in one code path.
Maps directly to PDF CCITTFaxDecode parameters and TIFF Group3/4Options.

New capabilities: G3 2D (mixed 1D/2D with K parameter), byte-aligned
mode, EndOfBlock=false, EndOfLine=false (Modified Huffman), LSB-first
bit order, resource limits (max_pixels, max_input_bytes), native u32
transitions (default Limits caps at u16 range).

API: #[non_exhaustive] on DecodeOptions/EncodingMode/Error. Own Error
type — legacy DecodeError unchanged. decode() convenience function,
pels32() for u32 transitions.

Changes to existing code: 3 helpers made pub(crate) in decoder.rs,
new private decode_2d_line extracted. ByteReader gains new_lsb(),
bytes_consumed(), align_to_byte(). Zero changes to public API.
tests/unified.rs (9 tests): inline Go corpus data (BSD-3, ~1.2KB),
legacy parity, G3/G4/aligned/G3-2D/limits.

tests/external_corpus.rs (8 tests): Pillow TIFFs (MIT-CMU, 5KB) for
LSB bit order, G3 no-EOL, crash regression. libtiff TIFF inlined for
minimal G3. SHA-256 hash verification of all 43 committed test files
against libtiff ground truth (reference-hashes.tsv).

test-files/pillow/: 5 small TIFFs (5KB total).
test-files/reference-hashes.tsv: libtiff-verified decode hashes.
@lilith lilith force-pushed the feat/unified-decoder branch from 9f251ec to 4edfe27 Compare April 10, 2026 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant