feat(linkedin): add recommended jobs adapter with GraphQL pagination support#51
Open
RickSanchez88E wants to merge 71 commits into
Open
feat(linkedin): add recommended jobs adapter with GraphQL pagination support#51RickSanchez88E wants to merge 71 commits into
RickSanchez88E wants to merge 71 commits into
Conversation
Adds `linkedin recommended` adapter for crawling LinkedIn JYMBII algorithm recommended jobs via GraphQL API. Supports automatic pagination, Easy Apply detection via footerItems EASY_APPLY_TEXT, workplace type parsing, and unlimited mode (--limit 0). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…quest signatures, pagination, and test commands
Local LLM (qwen3) → structured JSON → Supabase pipeline: - 5-module Python pipeline: config, preprocess, LLM, db, orchestrator - Grammar-constrained generation via llama.cpp json_schema - 3-attempt retry at temp=0: standard → repair → minimal - Atomic claim/upsert via Supabase RPC functions - Stale processing reaper, dead-letter queue, extraction_runs tracking - Per-run report: console summary + failed-jobs detail + JSON report Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
linkedin recommended --limit 0 --with_jd triggers long-running commands that scroll the full job list and fetch descriptions for each, which can exceed the previous 30-second HTTP timeout. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Add clean_linkedin_jobs.py pipeline that extracts URLs from multiple fields, normalizes URLs, validates LinkedIn records (require easy_apply or external_url), and maps apply_url/source_channel/apply_type correctly. Includes: - clean_linkedin_jobs.py: HTML cleaning, URL extraction cascade, salary parsing, batch dedup, dead letter queue - sync_autocli_jobs.py: Supabase RPC upsert with source_channel/apply_type - 23 unit tests with TDD (clean + sync + validation + URL mapping) - 5 migrations: schema, url_hash, source_channel/apply_type, drop url_hash unique constraint, old data cleanup - daemon health check wait in main.rs bad_count invariant: 776 -> 0 (after cleanup + pipeline fix)
Chrome debugger can detach mid-command on SPA pages (e.g. LinkedIn), returning "Detached while handling command". This error was not in the retry list, causing the extension to give up immediately instead of re-attaching and retrying. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extension `WINDOW_IDLE_TIMEOUT` (30s) would fire during evaluate steps that run longer than the timeout (e.g. --limit 0 fetching all LinkedIn recommended jobs). Added activeCommands counter per workspace so the idle timer only starts when no commands are in-flight. Added `scripts/autocli-baseline.sh` with 8 pre-flight checks (autocli binary, Chrome process, daemon, extension, LinkedIn reachability, DNS, output dir, disk space) with structured timestamped logging and --json output. Includes 13-test suite at `scripts/test_baseline.sh`.
`check_extension_freshness` compares dist/background.js mtime against a refresh marker file (.baseline-last-refresh). On first run (no marker) it warns; when dist is newer than last refresh it fails with a clear hint to use --refresh-extension. `--refresh-extension` uses browser-harness CDP to navigate to chrome://extensions, find the AutoCLI card, and click its reload button, then updates the marker. Test suite now has 15 tests covering all freshness scenarios.
sync_autocli_jobs.py looked for "apply_type" key in raw records, but LinkedIn raw data uses "easy_apply". Records from this pipeline were silently defaulted to apply_type='unknown'. Added a fallback check for the "easy_apply" field to correctly classify LinkedIn easy-apply jobs. Also ran a SQL migration to fix 271 existing rows that were affected.
…pply_url When the same Workday (ATS) job arrives with different LinkedIn apply_url shapes, the identity_hash now uses a canonical ATS URL rather than the raw apply_url. New _extract_canonical_job_url() prefers ATS external_urls over LinkedIn referrer URLs, and _canonicalize_url() normalizes scheme/host case, strips trailing slashes, and removes tracking params (utm_*, source, share_id, gh_src, lever-source, etc.). LinkedIn URLs are preserved as metadata on the apply_url field without affecting identity. --dry-run report now includes canonical_distinct_jobs count and duplicate groups grouped by identity_hash. 22 tests covering URL helpers, canonicalization, and the Ameresco regression case where same Workday URL produces same identity_hash regardless of LinkedIn apply_url presence. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Create scripts/job_priority_config.py with all configuration constants, regex patterns, and keyword sets for the deterministic job priority scoring system. Contains no scoring logic -- only configuration to be imported by the scorer, sync pipeline, backfill scripts, and tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pure, deterministic scoring engine for AutoCLI jobs with 8 components: compensation, role fit, seniority, work arrangement, application path, freshness, data completeness, and source quality. Includes penalty system, hard-reject guard, and tier mapping (high/medium/low/reject). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
REPEATED_PUNCT_RE used {2,} which matches 3+ total consecutive punctuation
chars (e.g. "!!!" -> "!"). Changed to {1,} so 2+ consecutive chars are
collapsed (e.g. "!!" -> "!", "!!!" -> "!").
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Import score_job in sync_autocli_jobs.py and call it per-record - Pass ScoreResult fields (priority_score, priority_tier, priority_version, priority_signals) to upsert_job RPC - Add --disable-scoring flag for testing - Report priority score distribution in dry-run mode - Add comprehensive test suite (104 tests across 14 classes) covering all 8 scoring components, penalties, hard-reject guard, edge cases, and integration scenarios Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Migration 20260509182000: add priority scoring columns to jobs.jobs table (priority_score, priority_tier, priority_version, priority_signals, priority_scored_at) - Migration 20260509184000: add update_job_priority_score RPC that only touches scoring fields (not the full row), with schema-scoped and public wrappers - scripts/backfill_priority_scores.py: batch backfill script with --force, --limit, --dry-run, --env-file options; reconstructs job_data from raw_record or DB columns; reports per-row scores, tiers, and errors Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Codex <noreply@openai.com>
- Rename priority_version column to priority_scorer_version in both migrations - Add 'unknown' to priority_tier check constraint - Fix indices to include last_seen_at desc and priority_score desc per spec - Add --min-priority-score and --priority-tier CLI flags for optional filtering - Enhance dry-run with top_priority_jobs, low_priority_count, priority_tiers - Add source-quality summary (recruiter/aggregator/raw-jd-fallback counts) - Update backfill RPC param name to match column rename
Design covers: - 6-container stack: chrome (Stagehand), daily (cron+FastAPI), cloudflared, prometheus, grafana - Cloudflare Tunnel + Access for public exposure of /vnc /cdp /api /jobs /grafana - GHCR + Watchtower pull-based deploy - Phased acceptance criteria with verification commands Worktree: feat/daily-microservice (branched from main).
Critical fixes:
- Add prereq section for autocli BrowserBridge CDP-wiring patch
- Fix cargo build to use package name 'autocli' (was '-cli')
- Switch /jobs to client.schema('jobs').table('jobs') API
- Use /json/list + page target (was /json/version, browser-level)
- Rewrite ws host localhost->autocli-chrome:9222
- Standardize on SUPABASE_SERVICE_ROLE_KEY
- Make API_RUN_TOKEN actually enforced + tested in Phase 4
- Add machine-verifiable Cloudflare Access gate before cdp ingress
High-severity fixes:
- Feature branches publish :branch-*+:sha-* only; :main from main
- Pin cloudflared/prometheus/grafana to specific semver
- Switch Cloudflare Tunnel to --token mode (no config.yml mix)
- Replace path routes with 5 subdomains (avoids prefix-strip)
- Split Access into two policies per Application (Token OR Email)
- Drop Grafana Infinity plugin dependency
- VNC password generated random in prod (no 'stagehand' default)
- shred only temp copy of operator secrets, never the source
- Unify retry to 3-attempts/15-60-240s across code+runbook+metrics
- Add explicit restart: unless-stopped to autocli-daily
- Specify Prometheus metrics_path: /api/metrics
- Unified CI build context = repo root for both Dockerfiles
- Note GHCR creds already configured on target host
Bugs: - L103: component table referenced stale /metrics path -> /api/metrics - L209/L236: github.ref_name with '/' produces invalid Docker tags; switch to docker/metadata-action's type=ref,event=branch which slugifies - L321: /json/new requires PUT, not POST (Chrome >= M86) - L354: jobs.autocli/ routed to backend root but /jobs is the actual route; drop the jobs subdomain entirely, serve via api.autocli/jobs (4 subdomains) - L473: Phase 0 build context disagreed with CI; unify on repo root - L522: Phase 4 step 2 implied Service Token works on vnc/grafana where no machine policy exists; split per-subdomain expectations - L526: Phase 4 probed cdp.autocli before the spec said cdp ingress was added; split Phase 4 into 4a (pre-CDP gate) / 4b (add cdp ingress) / 4c (cdp probes) - L549: Phase 5 status call missing Bearer Risks: - L486: Phase 1 status call missing Bearer; added Also: - Fix '6 services' / '6 new containers' counts; actual count is 5 - Update §2.2 boundaries note from /json/version to /json/list + PUT /json/new
Find or create a CDP page target on autocli-chrome:9222. - GET /json/list, pick first type:page - if list is empty, PUT /json/new?about:blank (Chrome >= M86) - rewrite host (localhost:9223 -> autocli-chrome:9222) so the WS URL is reachable from the daily container's network namespace - write to /run/cdp-endpoint.env (sourced by run-daily.sh) - 60s retry budget; exit 1 on timeout (entrypoint exits non-zero, restart: unless-stopped recreates container until chrome ready).
- flock -n to prevent cron + /api/run from colliding - per-attempt cdp-discover refresh (page id may have rotated) - runs autocli linkedin recommended -> JSON -> sync_autocli_jobs.py - unified retry: 3 attempts at 15s/60s/240s (SPEC §5.2) - writes /data/output/last_run.json consumed by /api/status.
Boot-time cdp-discover gate, then runs supercronic + uvicorn in parallel under tini. wait -n exits as soon as either child dies, so compose's restart policy can pick up failure modes (e.g. uvicorn panic, supercronic crash).
03:00 daily LinkedIn pull + 04:00 30-day output retention sweep (SPEC §5.2). TZ resolved by the container's TZ=Europe/London.
After rebase onto local main, scripts/job_priority_scorer.py and scripts/job_priority_config.py are present. sync_autocli_jobs.py imports them at runtime, so the daily image must ship all three.
uv-managed; pins fastapi/uvicorn/supabase/prometheus-client/httpx to compatible ranges. Lockfile checked in so the Dockerfile's 'uv sync --frozen' is reproducible.
Used by POST /api/run to spawn run-daily.sh non-blockingly. is_running() is a non-destructive flock probe so /api/status can report in_progress without affecting the actual run.
Routes per SPEC §5.1:
GET /api/health [open] chrome reachability + cdp file probe
GET /api/metrics [open] Prometheus exposition (delta-aware counters)
GET /api/status [Bearer] last_run.json + in_progress
POST /api/run [Bearer] spawn run-daily.sh, 409 if already running
GET /api/logs [Bearer] tail of latest log (default 200 lines)
GET /jobs [Bearer] Supabase 'jobs.jobs' read proxy via
client.schema('jobs').table('jobs').
Import style B: 'import trigger' (flat), because entrypoint.sh does
'cd /app/api && uvicorn main:app' — no package context, flat import works.
9 tests covering: - /api/status, /api/run, /api/logs, /jobs all return 401 without Bearer and 401 with wrong Bearer - /api/status default-shape + reflects last_run.json - /api/metrics is open and contains the autocli_daily_ family - /api/health returns 503 when chrome:9222 unreachable. conftest.py adds deploy/daily/api to sys.path (flat import, matching entrypoint.sh's 'cd /app/api && uvicorn main:app' invocation). Prometheus registry is cleared before each fresh module import to avoid duplicate-timeseries errors across test fixtures.
Single job scraping autocli-daily:8080/api/metrics every 15s. metrics_path is required because FastAPI mounts under /api/*.
- Datasource: Prometheus at prometheus:9090 (uid prom-autocli) - Dashboard provider points at /etc/grafana/provisioning/dashboards - autocli.json: time-since-last-run, last exit code, rows-upserted-today, CDP-up %, daily scraped/upserted/skipped time series, duration - No plugin dependencies (Infinity dropped per L313 review).
5 services on shared autocli-net bridge: - autocli-chrome (Stagehand, watchtower-tracked, healthcheck on 9222) - autocli-daily (cron+FastAPI, watchtower-tracked, depends_on chrome healthy, env scoped to Supabase creds only) - cloudflared (Tunnel token mode, depends_on daily healthy) - prometheus (pinned, 90-day retention) - grafana (pinned, anon disabled, signup disabled, admin from env) Named volumes for profile / output / tsdb / grafana state.
Binds host ports under non-conflicting numbers (6081/5902/9223/8081/ 9091/3001) so the operator can keep their existing local Chrome and Grafana running alongside. cloudflared moved to a 'disabled' profile.
All required environment variables with empty values + inline generator hints. Real .env never committed (.gitignore already covers it under '.env').
Quickstart, Cloudflare dashboard checklist, forced-run snippet, common-failure table. Points back at SPEC + PLAN for the why.
3 jobs: 1. build-autocli-binary: cargo build --release -p autocli on ubuntu-latest (linux/amd64) with Swatinem cache; uploads artifact 2. build-chrome-image: builds deploy/chrome from repo-root context; docker/metadata-action generates :main on main, :branch-<slug> on feature branches, :sha-<short> always 3. build-daily-image: downloads the autocli artifact, builds deploy/daily from repo-root context, same tag policy Path filters include rust-toolchain.toml so a toolchain bump triggers a rebuild.
The placeholder value was wrong (build failed with 'computed checksum did NOT match'). Verified by downloading the GitHub release asset and computing sha1sum from the operator's laptop.
CI builds the binary as a separate job and uploads as artifact; Phase 0 locally rebuilds inside a Docker rust container and writes to deploy/daily/bin/. Never commit this file (it's ~8MB).
rick-ubuntu-ssh tunnel's running replica is 2026.3.0 (per Zero Trust dashboard). Our container joins as a 2nd HA replica; matching the connector version avoids mixed-version edge cases.
Prod host (100.108.80.9) already has a process bound to :5900, so the 5900:5900 mapping failed container networking. Native VNC is only a local convenience and is NOT part of the Cloudflare ingress; noVNC on 6080 (+ vnc.autocli route) is the real access path. Container still listens on 5900 internally for websockify -> noVNC.
Chrome DevTools rejects /json* and /devtools Host headers that aren't an IP or localhost. Reaching autocli-chrome by docker service name failed with 'Host header is specified and is not an IP address or localhost'. - cdp-discover.sh: resolve CHROME_HOST -> container IP (getent, python fallback); use the IP for the /json probe AND the rewritten ws:// URL so every Host header Chrome sees is an IP. Re-resolved each run. - main.py /api/health: send Host: localhost on the liveness probe (yes/no check, body unused). Found during Phase 3 server bring-up; daily container was crash-looping on 'chrome unreachable after 60s' despite DNS + same-network OK.
Free Cloudflare zones get Universal SSL covering only <zone> + one-level
*.<zone>. Two-level subdomains like vnc.autocli.<zone> handshake-fail
('Unauthorized' / sslv3 alert) until the operator upgrades to Pro,
Total TLS, or ACM.
Rename across SPEC / PLAN / README:
vnc.autocli.<zone> -> autocli-vnc.<zone>
cdp.autocli.<zone> -> autocli-cdp.<zone>
api.autocli.<zone> -> autocli-api.<zone>
grafana.autocli.<zone> -> autocli-grafana.<zone>
§9 risk nashsu#4 now documents the Free-plan SSL constraint as the reason for
the flat naming.
Host ubuntu-latest gives GLIBC 2.39 binaries that fail to load in the daily runtime image (Debian Bookworm = GLIBC 2.36) with 'GLIBC_2.39 not found'. Pin build container to rust:1.94-slim-bookworm so binary GLIBC requirements match runtime. Also adds a readelf-based check that fails the build if the binary's max GLIBC requirement exceeds 2.36.
`source /run/cdp-endpoint.env` only sets a shell variable; without
export, the autocli child process never sees AUTOCLI_CDP_ENDPOINT and
falls through to BrowserBridge's daemon path
("Chrome is not running"). Wrap source with `set -a`/`set +a` so the
assignment auto-exports as an env var that survives across fork/exec.
sync_autocli_jobs.py pretty-prints its summary with indent=2:
{
"input_rows": 573,
"upserted": 573,
...
}
The old run-daily.sh did 'grep "^{" log | tail -1' which matched only
the opening '{' line, yielding invalid JSON. Subsequent jq parses
failed silently, --argjson got empty values, the final jq -n -> dev/null
overwrote LAST_RUN_JSON with an empty file.
Fix: redirect sync stdout to /tmp/sync-DATE-N.json, also append to
log, then jq parses the captured JSON directly. Status now correctly
reflects rows_scraped/upserted/skipped from each run.
When run-daily.sh did 'exec 9>LOCK; flock 9' and then invoked autocli, bash's FD 9 inherited into the autocli process by default. If autocli took the daemon-path fallback (pre-env-export fix; or any future code path that spawns a daemon), the detached 'autocli --daemon' child inherited FD 9 too and held the lock for its lifetime. is_running() then returned True forever, breaking /api/status. Add '9>&-' to autocli and uv invocations so children can't see or hold the lock. Verified by /proc/<pid>/fd inspection in production.
cdp.rs (item 1): IPage::close was sending Browser.close, which kills the
SHARED Chrome in CDP-direct mode (and every other consumer attached to
it). Made it a no-op with explanation. Callers that need per-page
cleanup should send Target.closeTarget directly.
entrypoint-vnc.sh (item 2): -nopw was overriding -rfbauth and leaving
VNC open with no password. Anyone reaching :5900/6080 (via Tailscale
or any leaked path) could drive the logged-in browser. Removed the
flag; password auth from /root/.vnc/passwd is now enforced.
docker-compose.yml (item 3 + defense-in-depth on 6080): bound both
6080 and 9222 host ports to 127.0.0.1 only. Public path is Cloudflare
Tunnel + Access; direct host-port access would bypass every auth layer.
Backup: 'ssh -L 6080:localhost:6080' from a Tailscale-connected box.
backfill_priority_scores.py (items 5 + 6): client.table('jobs.jobs')
queried a literal 'jobs.jobs' name in public schema (always 0 rows);
fixed to client.schema('jobs').table('jobs'). Filter also moved from
priority_score.is.null (already NOT NULL DEFAULT 0 post-migration, so
matches nothing) to priority_scored_at.is.null (the only honest 'never
scored' signal).
crontab + Dockerfile + .env.example (items 8 + 9): CRON_SCHEDULE and
OUTPUT_RETENTION_DAYS env vars were placebos — supercronic reads
/etc/cron.d/autocli verbatim and does not env-substitute. Dropped the
misleading env knobs from compose / Dockerfile / .env.example and added
a comment in crontab explaining the contract.
NOT addressed in this commit:
- Item 4 (migration upsert priority overwrite) — needs a follow-up
migration; pre-existing in main.
- Item 7 (/jobs schema) — empirically returns 500 rows with a loose
filter; PostgREST DOES expose the jobs schema in this project. The
reviewer's hypothesis was incorrect for this Supabase config. Pushing
back on this one with evidence.
- Items 10, 11 — pre-existing sync_autocli_jobs.py issues from main;
worth a separate cleanup PR.
Items 1, 2, 3 from PR review #4466756456: 1) New migration 20260516120000_fix_priority_upsert_data_loss.sql: recreates jobs.upsert_job so the ON CONFLICT DO UPDATE branches on the function PARAMETER (p_priority_score IS NOT NULL) instead of excluded.priority_score (which the INSERT body had already coerced from NULL to 0, making the case-when always true and silently zeroing prior scores). Same correction for priority_tier / scorer_version / signals / scored_at. Applied to production via Supabase MCP — verified success: True. 2) New migration 20260516120100_enable_jobs_jobs_rls.sql + GRANT migration: turns on RLS on jobs.jobs with a select-only policy for anon/authenticated, grants USAGE on the jobs schema and SELECT on the table to those roles. Server .env now uses the real anon JWT for SUPABASE_ANON_KEY (sync writes still use SUPABASE_SERVICE_ROLE_KEY which bypasses RLS). Combined with Cloudflare Access + Bearer this gives defence in depth. 3) /jobs endpoint now filters on created_at (database insert time) instead of post_time (LinkedIn original posting date — almost always older than today for fresh scrapes). Doc string updated; created_at added to the SELECT projection so clients can see it. Verified by direct REST against PostgREST + by python-in-container test (3 rows returned for since=today).
Companion to 20260516120100. RLS policies don't grant SELECT; PostgREST also needs the role to have USAGE on the schema and SELECT on the table. Already applied to production via Supabase MCP but the file was missing from the PR — without it a fresh project provisioning from these migrations would have count=0 on /jobs until the GRANT was applied manually.
feat: daily LinkedIn microservice + autocli CDP wiring + supporting fixes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a new
linkedin recommendedcommand that crawls LinkedIn's personalized job recommendation feed (JYMBII algorithm, at/jobs/collections/recommended/). Unlike the existinglinkedin searchadapter (REST Voyager API), this endpoint uses GraphQL (/voyager/api/graphql) and requires a browser session.File:
adapters/linkedin/recommended.yamlTechnical Details
voyagerJobsDashJobCards.*(version-hashed, discovered dynamically via Performance API)strategy: headerwith CSRF token extracted fromJSESSIONIDcookiestartoffset--limit 0crawls until no more items (limit > 0 ? limit - fetched : BATCHloop)footerItems[].type === "EASY_APPLY_TEXT"(noteasyApplyUrlwhich doesn't exist in this API)secondaryDescription.textparentheses, e.g."London (Hybrid)"→workplace_type: "Hybrid"Output Columns
rank,title,company,location,workplace_type,salary,posted_time,applicant_count,easy_apply,urlUsage
How to Test
Prerequisites: Chrome must be open with LinkedIn signed in, and the AutoCLI Chrome extension must be installed.
Known Quirks / Pitfalls
:) and parentheses to remain raw (not URL-encoded) in GraphQL variables. FullencodeURIComponentcauses HTTP 400. The adapter uses a partial-encode-then-decode approach.totalCountfield.--limit 0fetches incrementally until the server returns an empty batch.applicant_count: Unlike the REST search API, this GraphQL endpoint'sjobPostingCarddoesn't include applicant count. Column is preserved but always returns"N/A".easyApplyUrlfield: Easy Apply detection usesfooterItemstype — verified via 200-job crawl with ~30% Easy Apply rate.performance.getEntriesByType('resource'), so no hardcoded ID to maintain.On-site/Hybrid/Remoteand strips it from the location field.