Skip to content

feat: initial CI/release scaffolding + Phase 0 bug fixes (vsock/serial/timer)#2

Closed
tolgakaratas wants to merge 8 commits intomasterfrom
feat/01-baseline-and-bug-fixes
Closed

feat: initial CI/release scaffolding + Phase 0 bug fixes (vsock/serial/timer)#2
tolgakaratas wants to merge 8 commits intomasterfrom
feat/01-baseline-and-bug-fixes

Conversation

@tolgakaratas
Copy link
Copy Markdown

Summary

Foundation layer (8 commits): the initial GitHub Actions CI + release pipeline, README + LICENSE + dev tooling scaffold, and the three production bug fixes that closed out Phase 0.

Groups

Initial scaffolding (5 commits)

  • 9c467c6 fix: code quality, musl compatibility, cross-platform module gating
  • 72e8325 build: optimize release profile and project metadata
  • d0ec713 ci: add GitHub Actions CI/CD with release automation
  • 1bc4b37 docs: add project documentation and README badges
  • f4c9ef8 chore: add development tooling and release configuration

Phase 0 bug fixes (3 commits)

  • f62c673 fix(vsock): unblock guest agent loop after PID 1 fork
  • ddecd5c fix(serial): tee guest output per-byte + replay on attach
  • fec217a fix(timer): re-enable kvmclock + tsc-deadline on fresh vCPUs

Test plan

  • make ci passes locally (fmt + clippy + test + deny + audit)
  • CI workflows green on this branch's HEAD
  • E2E pass (covered by later PRs in stack)

Stack

This is PR 1 of 8 in the CI/CD overhaul stack. Each subsequent PR builds on this one. Merge order: 1 → 2 → … → 8.

Source code changes (no CI/infrastructure):
- Cross-platform module gating: storage/virtio keep tests portable,
  Linux-only modules gated with cfg(target_os = "linux")
- Shared compat module (IoctlReq, SendPthreadT) for glibc/musl differences
- All clippy lints resolved via cargo fix + cargo clippy --fix on Rust 1.95
- musl static build compatibility: SYS_renameat2 raw syscall, platform-
  correct ioctl types, Send wrapper for pthread_t
- Fix _host_offset naming bug in balloon inflate (compile error on Linux)
- Platform-conditional cast for libc::S_IFMT (u16 macOS, u32 Linux)
- dead_code allow on modules with forward-declared upstream API
- rustfmt applied with max_width=120

Verified: 0 clippy errors on Linux (rust:1.95) and macOS, 266+188 tests pass.
- profile.release: LTO fat, codegen-units=1, panic=abort, strip=true
- Cargo.toml: homepage, repository, keywords, MSRV 1.87
- Workspace members: add rust-version = "1.87"
- rustfmt.toml: max_width=120 matching original codebase style
- .editorconfig: consistent settings across editors
- Makefile: add shift-left targets (make ci, make fix, make lint)
- .gitignore: add VM artifact patterns (*.img, *.qcow2)
Workflows:
- build.yml: fmt, clippy, musl static build+test, MSRV 1.87 check,
  cargo-deny, security audit (with smart change detection)
- release-please.yml: conventional commits to automated release PRs
- release.yml: x86_64+aarch64 musl static binaries, SHA256 checksums,
  cosign keyless signing, SLSA attestation, SBOM (SPDX)
- security-scan.yml: weekly cargo audit, cargo deny, CodeQL Rust
- dependabot.yml: weekly cargo+actions updates with semantic grouping
- dependabot-auto-merge.yml: auto-squash-merge patch/minor updates

Templates:
- Issue templates (bug report, feature request)
- Pull request template with checklist
- SECURITY.md: vulnerability reporting via GitHub private advisories
- CONTRIBUTING.md: setup, shift-left local CI (make ci), pre-commit
  hooks, conventional commits, code style guide
- CHANGELOG.md: initial file for release-please automation
- README.md: CI status, license, and MSRV badges
- mise: rust + cargo-binstall + pre-commit; setup/ci tasks
- pre-commit: cargo autofix on commit, test+deny on push
- deny.toml: license allowlist (MIT/Apache/BSD/ISC), advisory checks
- release-please: Rust release type, version sync, changelog sections
clone-init forks the agent off PID 1 of the initrd before exec'ing
systemd. Inside that fork-descendant, every blocking sleep call
(usleep/nanosleep/std::thread::sleep) never wakes — the kernel timer
state for the child is wedged. The pre-execve usleep(50_000) killed
the child mid-sleep, and the agent's heartbeat loop wedged on its
first SO_RCVTIMEO recv after sending Ready.

- crates/clone-init/src/main.rs: drop the pre-execve usleep; child
  setsid + execve immediately so the kernel doesn't park it.
- crates/guest-agent/src/main.rs: replace every blocking sleep with
  libc::sched_yield() loops; mark the vsock fd O_NONBLOCK and use
  MSG_PEEK + MSG_DONTWAIT for recv pacing.
- src/virtio/vsock.rs: log every TX op so heartbeat-cadence regressions
  are visible in the VMM stderr stream.
Serial::write buffered guest stdout until \n or 256 bytes, so
no-trailing-newline payloads (notably the `clone login: ` prompt
agetty prints and then sleeps in ppoll) never reached the
/tmp/clone-{pid}.console socket. `clone attach` showed nothing.

- src/vmm/serial.rs: tee every byte to console_fd immediately;
  retain an 8 KiB rolling history of recent output.
- src/vmm/mod.rs: on console-client attach, replay the history before
  registering the live tee fd, so a late `clone attach` still sees
  the boot banner and login prompt that were already printed.
Vcpu::new masked off TSC-deadline (CPUID.1.ECX[24]) and the kvmclock
feature bits (CPUID.0x40000001.EAX[0,3,24]) to dodge a fork/restore
bug where MSR_KVM_SYSTEM_TIME_NEW didn't round-trip through GET_MSRS.
Cost: the guest fell back to TSC calibration via PIT/HPET, the
in-kernel irqchip under-delivered ticks on idle APs, and idle CPUs
received ~zero LOC interrupts. systemd then wedged in
synchronize_rcu_normal because the grace period waits for every CPU
to pass through a quiescent state, which a tickless idle AP never does.

- src/vmm/vcpu.rs: keep both TSC-deadline and kvmclock bits in fresh
  CPUID. Pin TSC frequency via set_tsc_khz(get_tsc_khz()) so the
  guest doesn't have to calibrate against PIT/HPET. Fork path
  (from_template) keeps its existing snapshot-aware MSR handling.
- src/main.rs: drop the rcupdate.rcu_expedited=1 +
  rcu_normal_after_boot=0 cmdline workaround now that the underlying
  timer path is fixed.

Verified on Ubuntu rootfs, 2 vCPUs, 1 GB RAM:
  before: LOC cpu0=102 cpu1=0 over 17 min, clocksource=tsc-early,
          systemd in D-state on synchronize_rcu_normal
  after:  LOC cpu0=17936 cpu1=1030 over ~1 min, clocksource=tsc
          (kvm-clock available), systemd S-state and reaches login.
@tolgakaratas
Copy link
Copy Markdown
Author

Closing — wrong target repo. Will re-open against unixshells/clone:master once upstream PR strategy is finalized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant