Skip to content

feat(pipeline): SparqlItemSelector.maxResults + drop requireNonEmptyData#396

Merged
ddeboer merged 3 commits into
mainfrom
feat/sampling-cap-and-cleanup
May 18, 2026
Merged

feat(pipeline): SparqlItemSelector.maxResults + drop requireNonEmptyData#396
ddeboer merged 3 commits into
mainfrom
feat/sampling-cap-and-cleanup

Conversation

@ddeboer
Copy link
Copy Markdown
Member

@ddeboer ddeboer commented May 18, 2026

Summary

Two related changes to how @lde/pipeline-based pipelines express sampling caps and conformance:

  1. feat(pipeline): add a maxResults option to SparqlItemSelector that genuinely caps the total bindings yielded across all paginated pages.
  2. fix(pipeline-shacl-sampler): use that new option so samplesPerClass actually limits the sampled subjects per sh:targetClass. Was a real bug — the cap wasn't applied.
  3. refactor(pipeline)!: remove the requireNonEmptyData decorator. With the proper sampler cap in place, the right semantic for ‘no target class matched’ is SHACL’s vacuous-truth (conforms: true), not a synthesised non-conformance. Consumers that need to distinguish ‘untested’ from ‘tested and passed’ can read quadsValidated > 0.

Why

The sampling cap bug

Until now, shaclSampleStages built its subject selector with LIMIT N in the SELECT, intending it as a total cap on samples. SparqlItemSelector interprets a query-level LIMIT as page size, not as a total — and proceeds to paginate with OFFSET until the source is exhausted. The effect: a target class with thousands of instances was fully walked rather than capped at N.

Concrete impact: a recent end-to-end run against lod.uba.uva.nl/Cinema-Context/Cinema-Context validated 890K quads instead of an expected handful, and the on-disk SHACL report file reached 258 MB.

maxResults

Adds an explicit option to SparqlItemSelector for the "I want at most N items total" case:

new SparqlItemSelector({
  query: 'SELECT DISTINCT ?s WHERE { ?s a <Class> }',
  maxResults: 50,
});

When set:

  • The first page's LIMIT is clamped to maxResults so the endpoint never returns more than the cap.
  • Pagination stops as soon as maxResults bindings have been yielded.
  • Independent of any in-query LIMIT, which still controls page size in the multi-page case.
  • Setting maxResults: 0 is a valid no-op — the selector yields nothing without issuing any fetch.

The sampler now uses this in place of the buggy in-query LIMIT.

Removing requireNonEmptyData

The decorator flipped a validator's report to non-conforming when quadsValidated === 0. After more thought (with the sampling cap fixed), that semantic conflates two genuinely different categories of dataset:

Dataset quadsValidated Honest verdict
Uses SCHEMA-AP-NDE classes, has violations > 0 non-conforming
Uses SCHEMA-AP-NDE classes, valid > 0 conforming
Doesn’t use SCHEMA-AP-NDE at all (e.g. Linked.Art) 0 not applicable

requireNonEmptyData reported the third case as ‘non-conforming’, which is dishonest — a Linked.Art dataset isn’t non-conformant to SCHEMA-AP-NDE any more than it’s non-conformant to FOAF or DCAT-AP-SL. It’s a different model.

The proper consumer-side filter is quadsValidated > 0 AND conforms = true for ‘tested and passed’. SHACL’s vacuous-truth default for empty targets is the more honest signal; the decorator was an opt-in deviation with limited downstream value.

Pre-release, so the removal lands without a back-compat shim. Documented in packages/pipeline/README.md so consumers know which signal to read for which question.

Breaking change

requireNonEmptyData is removed from @lde/pipeline’s exports. Consumers that imported it (notably the DKG pipeline) need to drop the import and rely on quadsValidated for their ‘untested’ filter. Coordinated update in dataset-knowledge-graph PR #280.

Notes

  • 4 new tests in selector.test.ts cover the maxResults paths (cap across pages, first-page clamp, no extra page request, zero-maxResults no-op).
  • sampleStages.test.ts updated for the now-LIMIT-less subject selector query.
  • Coverage thresholds on @lde/pipeline adjusted downward (3 fewer functions; less code → slightly lower coverage ratio).

ddeboer added 3 commits May 18, 2026 11:48
Cap the total bindings the selector yields across all paginated
pages — useful for sampling stages that want a fixed N, for testing
and prototyping against a small slice, or for bounded pipelines
that need a safety cap.

Independent of any in-query LIMIT clause, which still controls page
size. The first page asks for the configured page size as-is; total
cap and page size stay orthogonal. The last (partial) page's LIMIT
is shrunk to the remaining cap so the endpoint doesn't over-fetch
on the remainder. maxResults: 0 is a valid no-op — the selector
yields nothing without issuing any SPARQL request.
The selector used to put 'LIMIT N' in the SELECT query intending it
as a total cap, but SparqlItemSelector reads in-query LIMIT as the
*page size* and continues paginating with OFFSET — so a target class
with thousands of instances was fully walked instead of capped at N.

Switch to the new maxResults option on SparqlItemSelector, which
caps the total bindings yielded across pages. buildSubjectSelectorQuery
drops the LIMIT clause it used to emit.
The decorator existed to flip a validator's report to non-conforming
when no quads were ever validated for a dataset. In practice that
conflates two distinct dataset categories: those that genuinely fail
the profile, and those that simply don't use any of the profile's
target classes (e.g. a Linked.Art dataset evaluated against
SCHEMA-AP-NDE).

Consumers can read quadsValidated > 0 to distinguish 'tested and
passed' from 'untested'. SHACL's vacuous-truth rule for
sh:conforms = true on an empty target set is the more honest
default; the decorator was an opt-in deviation with limited use.

Pre-release, so no back-compat shim.
@ddeboer ddeboer force-pushed the feat/sampling-cap-and-cleanup branch from 5a2f2cb to fa69645 Compare May 18, 2026 09:49
@ddeboer ddeboer merged commit e60be56 into main May 18, 2026
2 checks passed
@ddeboer ddeboer deleted the feat/sampling-cap-and-cleanup branch May 18, 2026 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant