feat(pipeline): SparqlItemSelector.maxResults + drop requireNonEmptyData#396
Merged
Conversation
Cap the total bindings the selector yields across all paginated pages — useful for sampling stages that want a fixed N, for testing and prototyping against a small slice, or for bounded pipelines that need a safety cap. Independent of any in-query LIMIT clause, which still controls page size. The first page asks for the configured page size as-is; total cap and page size stay orthogonal. The last (partial) page's LIMIT is shrunk to the remaining cap so the endpoint doesn't over-fetch on the remainder. maxResults: 0 is a valid no-op — the selector yields nothing without issuing any SPARQL request.
The selector used to put 'LIMIT N' in the SELECT query intending it as a total cap, but SparqlItemSelector reads in-query LIMIT as the *page size* and continues paginating with OFFSET — so a target class with thousands of instances was fully walked instead of capped at N. Switch to the new maxResults option on SparqlItemSelector, which caps the total bindings yielded across pages. buildSubjectSelectorQuery drops the LIMIT clause it used to emit.
The decorator existed to flip a validator's report to non-conforming when no quads were ever validated for a dataset. In practice that conflates two distinct dataset categories: those that genuinely fail the profile, and those that simply don't use any of the profile's target classes (e.g. a Linked.Art dataset evaluated against SCHEMA-AP-NDE). Consumers can read quadsValidated > 0 to distinguish 'tested and passed' from 'untested'. SHACL's vacuous-truth rule for sh:conforms = true on an empty target set is the more honest default; the decorator was an opt-in deviation with limited use. Pre-release, so no back-compat shim.
5a2f2cb to
fa69645
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related changes to how
@lde/pipeline-based pipelines express sampling caps and conformance:feat(pipeline): add amaxResultsoption toSparqlItemSelectorthat genuinely caps the total bindings yielded across all paginated pages.fix(pipeline-shacl-sampler): use that new option sosamplesPerClassactually limits the sampled subjects persh:targetClass. Was a real bug — the cap wasn't applied.refactor(pipeline)!: remove therequireNonEmptyDatadecorator. With the proper sampler cap in place, the right semantic for ‘no target class matched’ is SHACL’s vacuous-truth (conforms: true), not a synthesised non-conformance. Consumers that need to distinguish ‘untested’ from ‘tested and passed’ can readquadsValidated > 0.Why
The sampling cap bug
Until now,
shaclSampleStagesbuilt its subject selector withLIMIT Nin the SELECT, intending it as a total cap on samples.SparqlItemSelectorinterprets a query-levelLIMITas page size, not as a total — and proceeds to paginate withOFFSETuntil the source is exhausted. The effect: a target class with thousands of instances was fully walked rather than capped at N.Concrete impact: a recent end-to-end run against
lod.uba.uva.nl/Cinema-Context/Cinema-Contextvalidated 890K quads instead of an expected handful, and the on-disk SHACL report file reached 258 MB.maxResultsAdds an explicit option to
SparqlItemSelectorfor the "I want at most N items total" case:When set:
LIMITis clamped tomaxResultsso the endpoint never returns more than the cap.maxResultsbindings have been yielded.LIMIT, which still controls page size in the multi-page case.maxResults: 0is a valid no-op — the selector yields nothing without issuing any fetch.The sampler now uses this in place of the buggy in-query
LIMIT.Removing
requireNonEmptyDataThe decorator flipped a validator's report to non-conforming when
quadsValidated === 0. After more thought (with the sampling cap fixed), that semantic conflates two genuinely different categories of dataset:quadsValidatedrequireNonEmptyDatareported the third case as ‘non-conforming’, which is dishonest — a Linked.Art dataset isn’t non-conformant to SCHEMA-AP-NDE any more than it’s non-conformant to FOAF or DCAT-AP-SL. It’s a different model.The proper consumer-side filter is
quadsValidated > 0 AND conforms = truefor ‘tested and passed’. SHACL’s vacuous-truth default for empty targets is the more honest signal; the decorator was an opt-in deviation with limited downstream value.Pre-release, so the removal lands without a back-compat shim. Documented in
packages/pipeline/README.mdso consumers know which signal to read for which question.Breaking change
requireNonEmptyDatais removed from@lde/pipeline’s exports. Consumers that imported it (notably the DKG pipeline) need to drop the import and rely onquadsValidatedfor their ‘untested’ filter. Coordinated update indataset-knowledge-graphPR #280.Notes
selector.test.tscover themaxResultspaths (cap across pages, first-page clamp, no extra page request, zero-maxResults no-op).sampleStages.test.tsupdated for the now-LIMIT-less subject selector query.@lde/pipelineadjusted downward (3 fewer functions; less code → slightly lower coverage ratio).