Rework the Python worker process spec (#378)#423
Conversation
Address review findings on the piercefreeman#378 rework (all behavior-preserving): - Re-log the resolved PYTHONPATH: once at prepare() time (with working_dir and whether the bundled package root or cwd was used) and on the per-spawn line, restoring the operability signal dropped in the rework. - Derive Clone + Debug on PreparedConfig and Spec for parity with Config and to allow debug-logging resolved launch values. - Document that the environment is snapshotted at prepare() and not re-read on worker recycle, and that a prepare_blocking panic is surfaced as PrepareError::Join rather than resumed.
- Extract the payload-wait + timeout-to-SpawnError mapping into a Timeouts::await_payload method; spawn() now calls it directly. - Simplify is_executable_file to a match with an is_file guard. - Pull the unix exec-bit test into a named is_executable helper with a named EXECUTABLE_BITS constant (no std is_executable exists).
Generalize the lessons from the piercefreeman#378 worker-python rework into the AGENTS.md code_feedback guide, per its Code Review process: keep blocking IO off the async hot path (materialize once via prepare()/spawn_blocking), and don't panic on infallible trait paths (return a typed error from an async constructor instead).
|
Thanks! I'll review this is a couple days, a bit busy currently. |
MOZGIII
left a comment
There was a problem hiding this comment.
A but preliminary, as I didn't look too deep, but nonetheless some feedback. It seems like this PR also changes timeouts - I suggest we restrict this PR to python worker refactors only, as to avoid deliberation on the timeouts structure and etc; if possible of course.
|
yes that sounds good to me, thanks for the clarification |
Reviewer (MOZGIII, piercefreeman#423) asked to restrict this PR to python worker refactors only and defer the timeout-structure rework: "this PR also changes timeouts - I suggest we restrict this PR to python worker refactors only, as to avoid deliberation on the timeouts structure". - Revert worker-process and worker-process-pool back to main (the ShutdownParams->Timeouts rename and await_payload extraction are gone). - worker-python keeps its refactor (PreparedConfig caching, panic-free prepare_spawn_params, symlink/exec-aware find_executable) but no longer owns a timeouts field; prepare_spawn_params emits the existing hardcoded SpawnParams/ShutdownParams shape, identical to main.
intendednull
left a comment
There was a problem hiding this comment.
@claude looks like the config revert clobbered a few lines. lets make sure the reversion is clean, no trace left behind
- config.rs: revert cosmetic trailing-period doc edits on the three unchanged Config fields (diff clutter, no semantic change). - lib.rs: trim the Spec doc to what it is and rely on the PreparedConfig link; restore the timeout block (comment + Duration::from_secs values) to match main byte-for-byte; re-import Duration instead of fully qualifying. - AGENTS.md: reword the panic-free rule per review wording.
intendednull
left a comment
There was a problem hiding this comment.
A small organizational nit. Please mark resolved when done
config.rs is now a pure structural definition of Config and its builders. The resolution work — default_runner and the PATH executable-lookup family (find_executable/is_executable_file/ is_executable) plus their tests — moves next to its only consumer, Config::prepare in prepared.rs. Addresses the two organizational review comments on PR piercefreeman#423.
prepared.rs is an extension of the config definition, not a peer module. Nest it as config::prepared using the project's file+folder layout (src/config.rs + src/config/prepared.rs) and re-export PreparedConfig/PrepareError through config. Crate-root public API is unchanged. Addresses PR piercefreeman#423 review comment.
main accepted any is_file(); this crate added the 0o111 exec-bit check. Assert is_executable_file rejects the crate's own (non-exec) Cargo.toml so that fix is covered. Seam-free, no temp file.
|
@intendednull Thanks for the PR! Waymark's still a really new project and we haven't yet formalized our contributor or AI guidelines yet. Will circle back to this ticket once we get that nailed. |
Both variants carried the cause only via #[source], so Display-only call sites (smoke's eprintln) printed the context line with no underlying io::Error or JoinError detail. Sibling worker crates (managed-process, worker-process, worker-remote-bringup) all embed the source in the message; match that.
The comment referenced `main`'s old is_file()-only behavior, which stops being true the moment this branch merges into main. State the invariant under test instead.
Worker launch config was resolved by Config::prepare(), welding host probing onto the declarative config type, and each of the three bins repeated the same prepare-then-wire dance. Split the two concerns: - Resolution is now a free `resolve(Config) -> SpecFactory` whose probe runs once under spawn_blocking; Config is inert declarative data. - `SpecFactory` (was `PreparedConfig`) is named for its job: `build(addr)` mints a `Spec` once the late-bound bridge address is known. `Spec` fields are private, constructible only via the factory. - New `worker-python-bringup` crate owns the compose; each bin collapses to a single `start(...)` call. `worker-remote-bringup` stays generic and python-free. `PrepareError` -> `ResolveError`; `config/prepared.rs` moves to top-level `resolve.rs` (resolution is no longer an extension of config). AGENTS.md rule examples updated to the new names.
intendednull
left a comment
There was a problem hiding this comment.
@claude minor adjustments, please fix
| let factory = &self.factory; | ||
|
|
||
| tracing::info!(python_path = %python_path, ?reservation_id, "configured python path for worker"); | ||
|
|
||
| // Build the command | ||
| let mut command = tokio::process::Command::new(&self.config.script_path); | ||
| command.args(&self.config.script_args); | ||
| let mut command = tokio::process::Command::new(&factory.program); | ||
| command.args(&factory.args); | ||
| command | ||
| .arg("--bridge") | ||
| .arg(self.bridge_server_addr.to_string()) | ||
| .arg("--worker-id") | ||
| .arg(reservation_id.to_string()); | ||
|
|
||
| // Add user modules | ||
| for module in &self.config.user_modules { | ||
| for module in &factory.user_modules { | ||
| command.arg("--user-module").arg(module); | ||
| } | ||
|
|
||
| command.env("PYTHONPATH", python_path); | ||
|
|
||
| if let Some(dir) = working_dir { | ||
| tracing::info!(?dir, "using package root for worker process"); | ||
| command.current_dir(dir); | ||
| } else { | ||
| // TODO: move this fallible initialization outside of this impl. | ||
| let cwd = std::env::current_dir().expect("failed to resolve current directory"); | ||
| tracing::info!( | ||
| ?cwd, | ||
| "package root missing, using current directory for worker process" | ||
| ); | ||
| command.current_dir(cwd); | ||
| } | ||
| command.env("PYTHONPATH", &factory.python_path); | ||
| command.current_dir(&factory.working_dir); | ||
|
|
||
| tracing::info!( | ||
| ?reservation_id, | ||
| working_dir = %factory.working_dir.display(), | ||
| python_path = %factory.python_path, | ||
| "prepared python worker spawn params" | ||
| ); |
There was a problem hiding this comment.
Here we are doing SpecFactory -> Command. Lets factor this out. It could either be a standalone fn, or maybe it would make more sense to attach to SpecFactory. Lets follow existing patterns here, whatever practice is favored in this codebase. I'm leaning towards standalone, so lets just do that for now if unclear.
|
|
||
| /// The blocking half of [`resolve`]. Must only be called from a blocking | ||
| /// context (e.g. inside `spawn_blocking`). | ||
| fn probe(config: Config) -> Result<SpecFactory, ResolveError> { |
There was a problem hiding this comment.
| fn probe(config: Config) -> Result<SpecFactory, ResolveError> { | |
| fn resolve_blocking(config: Config) -> Result<SpecFactory, ResolveError> { |
Per PR piercefreeman#423 review on the resolve/SpecFactory rework: - Rename the module file resolve.rs -> spec_factory.rs (named for the type it defines, not the verb). - Rename the private blocking half probe() -> resolve_blocking() to pair with the async resolve(). - Extract the SpecFactory -> Command assembly out of prepare_spawn_params into a standalone build_command(); the spawn log stays in prepare_spawn_params so build_command is a pure builder.
The pre-rework Spec exposed its fields publicly; keep that visibility rather than sealing construction behind a pub(crate) constructor. Drop Spec::new and have SpecFactory::build construct the struct directly.
With Spec's fields public again, build() was a trivial constructor that
direct struct construction expresses more clearly (the house style; the
pre-rework code built Spec the same way). The bringup closure now
constructs `Spec { bridge_server_addr, factory }` directly, and
SpecFactory is pure resolved data with no methods.
Polish only the docs this PR added or changed; pre-existing docs carried
from main are left untouched.
- resolve(): add the `# Errors` section the fallible-fn docs use.
- is_executable_file/is_executable: imperative mood, matching the file.
- build_command: plain intra-doc link form for the Command type.
- worker-python-bringup: add a module doc, document StartError + its
variants, and capitalize Python in the start() doc.
- config.rs script_path: intra-doc-link resolve.
- AGENTS.md: the runtime-IO rule cited var("PATH"); the code uses
var_os("PATH").
[WIP] will mark when ready for final review, but any feedback is appreciated
Moves all of the fallible I/O from config resolution to happen before worker spawn
Fixes #378