feat: add WDL module specification and symbolic imports#765
feat: add WDL module specification and symbolic imports#765claymcleod wants to merge 26 commits into
Conversation
Introduces a peer specification at `modules/SPEC.md` covering the `module.json` manifest, `module-lock.json`, dependency resolution from Git tags, content hashing, Ed25519 signing, and credential management. Adds the `from` keyword and a new symbolic import form to `SPEC.md` with symmetric namespacing rules for tasks, workflows, structs, and enums. Remote URL imports are soft-deprecated. Ships JSON Schemas for both the manifest and lockfile. Closes #758. Closes #226.
a29acf3 to
b4f063a
Compare
A `module-lock.json` file is no longer required for every module. It is required for modules whose consumers need reproducible builds, but may be omitted by modules intended as libraries where version resolution is deliberately left to the consumer. The spec now also states that lockfiles apply only to the module they sit in: upstream lockfiles are not consulted during downstream resolution, so consumers remain in control of their transitive version choices. Clarifies engine responsibility for verifying cached module content against the lockfile and for keeping the lockfile consistent with the resolved tree. Drops the now-stale "Lockfile as a specification requirement" rationale bullet, and the "Duplicate dependencies over conflict resolution" bullet whose premise doesn't carry its weight.
24cedf4 to
ceb0fb2
Compare
|
@peterhuene and I have agreed to land a simpler model than the original eight-form proposal. Symbolic imports are interchangeable with quoted imports — the build system resolves a symbolic path to the equivalent of a quoted import. All three forms accept either source type with identical semantics. The three forms:
What changed from the prior eight-form model:
This model fits the existing WDL mental model (no new namespace concept to teach), is easier to implement, and parses with no lookahead beyond |
|
After much discussion, @peterhuene and I have agreed that this is the approach we're going to take.
If we decide later on that namespaces are needed, then the directives to encapsulate members within a namespace ( The rules for the simpler version we're going to try first are below.
Other notables:
|
0cf7f31 to
c5db66c
Compare
c5db66c to
e5de62e
Compare
f3aa495 to
a186061
Compare
77d89ea to
3a4d9e9
Compare
998e4cb to
89d639c
Compare
DavyCats
left a comment
There was a problem hiding this comment.
Nice! Thanks for writing this up!
| An import statement takes one of three forms. Every form accepts the same two source styles, and the source style affects only how the build system locates the imported document. | ||
|
|
||
| 1. `import <source> [as <alias>] (alias <Old> as <New>)*`. User-defined types (structs and enums) from `<source>` are copied into the importing document's scope. Tasks and workflows from `<source>` are accessible only through the import's namespace, which defaults to the filename minus the `.wdl` extension for a quoted URI or to the last component of the path for a symbolic module path. `as <alias>` overrides the default namespace; `alias <Old> as <New>` renames a struct or enum as it is copied. `alias` cannot rename tasks or workflows. See [Fully Qualified Names & Namespaced Identifiers](#fully-qualified-names--namespaced-identifiers) for how the namespace is used. | ||
| 2. `import * from <source>`. Every task, workflow, and user-defined type from `<source>` enters the importing document's scope. No namespace is introduced. |
There was a problem hiding this comment.
Personally, I'm not a huge fan of these kinds of imports. They obfuscate where the imported task/workflow is actually defined. I'm wondering how others feel about this.
There was a problem hiding this comment.
I used to feel this way when I was resolving them manually / by eye. But LSPs have mostly solved this issue for me, now I just click 'jump to definition' (also in e.g. python where they're also argued against for the same reason.)
There was a problem hiding this comment.
Yeah and it's a little tricky too because of the design decisions that were made prior in WDL (i.e., #160). It's actually a split situation today where import "foo.wdl" means:
- Import all of the user-defined types (structs and enums) into the current document's scope, effectively losing the tether to where they came from, and
- Make all of the tasks and workflows available at
foo.<task>/foo.<workflow>but don't export those as part of the scope.
This asymmetry poses real challenges to making any sort of module system usable under the current syntax.
Additionally, WDL, as it stands today, has no real concept of a formal namespace (there is something referred to as a namespace in the spec, but it's not allowed to be passed between document's using their scope).
Short of introducing namespaces more formally (which has a whole other set of language implications), we had to find a new syntax that made sense to actually pull the tasks and workflows into the document's scope alongside the UDTs while still feeling WDL-like. This was the best we could come up with, but I'm open to other ways to express this.
There was a problem hiding this comment.
Agree that this can obfuscate what actually gets into the document, but It is a common pattern in most languages so I think a lot of people will expect this to exist.
|
|
||
| 1. `import <source> [as <alias>] (alias <Old> as <New>)*`. User-defined types (structs and enums) from `<source>` are copied into the importing document's scope. Tasks and workflows from `<source>` are accessible only through the import's namespace, which defaults to the filename minus the `.wdl` extension for a quoted URI or to the last component of the path for a symbolic module path. `as <alias>` overrides the default namespace; `alias <Old> as <New>` renames a struct or enum as it is copied. `alias` cannot rename tasks or workflows. See [Fully Qualified Names & Namespaced Identifiers](#fully-qualified-names--namespaced-identifiers) for how the namespace is used. | ||
| 2. `import * from <source>`. Every task, workflow, and user-defined type from `<source>` enters the importing document's scope. No namespace is introduced. | ||
| 3. `import { <member> [as <Name>], ... } from <source>`. Only the listed items enter the importing document's scope. A per-member `as <Name>` renames the selected item locally. A trailing comma after the last member is permitted. No namespace is introduced. |
There was a problem hiding this comment.
For consistency with how form 1 is explained, I guess ... should be *?
| 3. `import { <member> [as <Name>], ... } from <source>`. Only the listed items enter the importing document's scope. A per-member `as <Name>` renames the selected item locally. A trailing comma after the last member is permitted. No namespace is introduced. | |
| 3. `import { (<member> [as <Name>],)* } from <source>`. Only the listed items enter the importing document's scope. A per-member `as <Name>` renames the selected item locally. A trailing comma after the last member is permitted. No namespace is introduced. |
Or maybe + if at least one needs to be present.
Or is ... supposed to have a separate functionality, like "also import everything else but don't rename it".
There was a problem hiding this comment.
Yeah sounds good, I'll clarify this.
| version 1.4 | ||
|
|
||
| import "csvkit.wdl" # tasks via `csvkit` namespace; structs/enums in scope | ||
| import "csvkit.wdl" as csv # tasks via `csv` namespace; structs/enums in scope | ||
| import * from "csvkit.wdl" # tasks, workflows, structs, enums all in scope | ||
| import { CsvSort } from "csvkit.wdl" # only `CsvSort` in scope | ||
| import { CsvSort as MySort, CsvSortStable } from "csvkit.wdl" # `MySort` and `CsvSortStable` in scope | ||
| import openwdl/csvkit # tasks via `csvkit` namespace; structs/enums in scope | ||
| import { CsvSort } from openwdl/csvkit # only `CsvSort` in scope |
There was a problem hiding this comment.
This is missing as example of form 1's alias old as new syntax.
8743d5f to
26df2b6
Compare
Symbolic sub-paths now resolve directly to `<module-folder>/<sub-path>.wdl` rather than naming a nested module's `module.json`. Each dependency points to exactly one module folder; consumers reach additional modules in the same repository by declaring them as separate dependencies with distinct `path` values. The recursive scan that registered every nested manifest in a source is removed. The manifest's `index` field is renamed to `entrypoint`. A root module import resolves to the entrypoint file; a missing entrypoint surfaces an engine-specific error at import time. A new `exclude` field accepts gitignore-style globs identifying files unreachable via symbolic import. `exclude` governs the public import surface only and does not affect content hashing, signing, or quoted within-module imports. Sub-path components remain WDL identifiers, so `..` is rejected at parse time. The spec now states this as a normative no-escape guarantee that engines may rely on when performing sparse Git checkouts of dependency module folders.
26df2b6 to
fb174f5
Compare
Prepend the spec's domain-separation prefix `wdl-module-content\0v1\0` to the SHA-256 input in `Hasher::finalize` per openwdl/wdl#765 commit 505e954, and rewrite `detects_path_content_boundary_collision` to exercise the actual collision the length prefixes prevent (a file name absorbing what would otherwise be its content's bytes), rather than asserting determinism on two trivially-different file lists.
patmagee
left a comment
There was a problem hiding this comment.
Looking amazing! Round 1 of comments done.. I still have a lot to go :D
| "required": ["git", "commit"], | ||
| "properties": { | ||
| "git": { "type": "string", "format": "uri" }, | ||
| "commit": { "type": "string", "pattern": "^[0-9a-f]{40}$" }, |
There was a problem hiding this comment.
Should this be sha to make the it explicit what the pattern is enforcing?
| An import statement takes one of three forms. Every form accepts the same two source styles, and the source style affects only how the build system locates the imported document. | ||
|
|
||
| 1. `import <source> [as <alias>] (alias <Old> as <New>)*`. User-defined types (structs and enums) from `<source>` are copied into the importing document's scope. Tasks and workflows from `<source>` are accessible only through the import's namespace, which defaults to the filename minus the `.wdl` extension for a quoted URI or to the last component of the path for a symbolic module path. `as <alias>` overrides the default namespace; `alias <Old> as <New>` renames a struct or enum as it is copied. `alias` cannot rename tasks or workflows. See [Fully Qualified Names & Namespaced Identifiers](#fully-qualified-names--namespaced-identifiers) for how the namespace is used. | ||
| 2. `import * from <source>`. Every task, workflow, and user-defined type from `<source>` enters the importing document's scope. No namespace is introduced. |
There was a problem hiding this comment.
Agree that this can obfuscate what actually gets into the document, but It is a common pattern in most languages so I think a lot of people will expect this to exist.
|
|
||
| ## Module Directory Layout | ||
|
|
||
| A module is a directory containing a `module.json` manifest at its root and one or more `.wdl` files. There is no additional organizational construct—no workspace type, no grouping file, no hierarchy requirement. |
There was a problem hiding this comment.
For the time being this format will work, but I am sure at some point the question of module distribution and packaging is going to come up. This will be needed to solve for issues like offline dependency resolution, caching and maybe the future central dependency managment layer
- Python has wheels (
.whl) - Java has jars (
.jar) - Rust has crates (
.crate)
these are all either zipped or gzipped folders that contain the package manifest and all other files.
I would propose a similar format and build a wdl module: .wdlm that is essentially just a zip file of the Module directory
|
|
||
| ``` | ||
| csvcut-wdl/ | ||
| module.json |
There was a problem hiding this comment.
does a module ship with it's own module-lock.json? If there are conflicts how are these resolved?
| 1. Each file's relative path (computed per [Content Hashing](#content-hashing)) is treated as a sequence of `/`-separated components. The path must contain no `.` or `..` components, must not be absolute (no leading `/`, no Windows-style drive letter), and must not contain a null byte. | ||
| 2. Symbolic links inside a module are permitted only if their targets, after full resolution, fall inside the module root. A symlink whose target escapes the module root makes the module invalid. | ||
| 3. The names `module.json`, `module-lock.json`, and `module.sig` are reserved for files at the module root. A file with any of these names appearing at any other path within the module tree is a validation error. | ||
| 4. A module that, when materialized on disk, would produce any path failing these rules is invalid; engines must refuse to load it and surface a clear error identifying the offending entry. |
There was a problem hiding this comment.
If a file in a module references a relative path outside the module root directly (not in the dependencies) would the module be considered invalid?
More generally what do import statements within a module look like?
|
|
||
| - **`tag`** — a specific Git tag name (e.g., `"v1.2.0-rc1"`). Not subject to semver resolution. | ||
| - **`branch`** — a Git branch name. The resolved commit varies over time; the lockfile pins the exact commit at resolution time. | ||
| - **`commit`** — a Git commit SHA. Any prefix that uniquely identifies a commit in the source repository is accepted; the resolver expands the prefix to the full 40-character SHA at lock time and records the full SHA in `module-lock.json`. The most precise and immutable selector. |
| ```json | ||
| { | ||
| "dependencies": { | ||
| "local_utils": { "path": "../../shared/utils" } |
There was a problem hiding this comment.
the implication of this is that it breaks out of the module root. To that extent it feels like the module spec is doing double duty:
- defining a boundary for a reusable workflow
- defining the dependencies for a workflow that may or may not be a module
If we allow relative paths pointing outside of the current module root we will satisfy 2, but not 1. If we restrict to not allow that then we will only satisfy 1 and not 2
| "homepage": "https://csvkit.readthedocs.io/" | ||
| } | ||
| ], | ||
| "dependencies": {} |
There was a problem hiding this comment.
can we expand the dependencies section to include an example?
| import * from csvcut # tasks, workflows, and UDTs all in scope unqualified, no namespace | ||
| ``` | ||
|
|
||
| A module with multiple files curates its default surface by importing them from the entrypoint: |
There was a problem hiding this comment.
can an entrypoint wdl consist of ONLY imports and no workflow/task/enum etc?
|
|
||
| Sub-path resolution is a direct file lookup. Intermediate directories along the sub-path do not need to contain `module.json` files, and any nested `module.json` files that happen to be present along the way are ignored. | ||
|
|
||
| If the manifest's `exclude` field matches the path that resolution would otherwise read—either the entrypoint file for a root module import or `<sub-path>.wdl` for a sub-path import—the engine must treat the import as unresolvable and surface a missing-file error that names the path. The file's presence on disk is irrelevant; an excluded path is not part of the module's public import surface. Excluding the entrypoint therefore makes root module imports of that dependency fail while leaving non-excluded files reachable by sub-path. Quoted imports inside the module (e.g., the entrypoint's own `import "helper.wdl"`) are file-relative and are unaffected by `exclude`. |
There was a problem hiding this comment.
this feels like a weakly enforceable security feature. if the file exists they can just address it directly instead of resolving via the module
This introduces a module system to WDL in two parts: a new peer specification at
modules/SPEC.mdcovering the ecosystem (manifest, resolution, lockfile, hashing, signing, credentials, registry), and language-level additions inSPEC.md(thefromkeyword and two new import forms).The module specification defines:
module.json, the manifest format, including core fields, upstreamtoolsprovenance, and adependenciesobject supportingversion,tag,branch,commit, and localpathselectors.module-lock.json, a lockfile that pins the fully resolved tree with commit SHAs and content checksums.<dep-name>[/<sub-path>]) map to modules, including tag-to-manifest version consistency and Go-style path-prefixed tags for multi-module repositories.module.sig, with trust-on-first-use recorded in the lockfile.openwdl.github.io/registrythat aggregates metadata without becoming a resolution dependency.The language changes in
SPEC.md:fromas a keyword in WDL 1.4 documents. Earlierversiondeclarations continue to parsefromas an ordinary identifier.import <source> [as <alias>] (alias <Old> as <New>)*, the existing namespaced form.import * from <source>, which brings every task, workflow, and user-defined type from<source>into the importing document's scope, with no namespace.import { <member> [as <Name>], ... } from <source>, which brings only the listed items, with an optional per-memberas <Name>rename, also with no namespace.<source>is either a quoted URI (resolved per Import URIs) or an unquoted symbolic module path of the form<dep>[/<sub-path>](resolved through the consuming module'smodule.json). Once resolved, the two styles produce identical scoping in every form.JSON Schemas for both
module.jsonandmodule-lock.jsonship undermodules/schemas/and validate the in-spec examples.Full context: RFC discussion #700.
Before submitting this PR, please make sure:
README.mdor other documentation needs updating to account for these changes.CHANGELOG.mddescribing the change and linking back to your pull request.CONTRIBUTING.mddocument.For OpenWDL team members:
Closes #226.
Closes #758.