Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 128 additions & 0 deletions .claude/skills/audit-spark-commit/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
name: audit-spark-commit
description: Audit a single Apache Spark commit to determine whether it impacts DataFusion Comet. Reads the contributor guide for the rubric, fetches the commit, proposes a verdict, and updates dev/spark-commit-audit.md after the user reviews.
argument-hint: <commit-hash>
---

Audit Apache Spark commit `$ARGUMENTS` for impact on DataFusion Comet.

The full process and rubric live in
`docs/source/contributor-guide/spark_commit_audit.md`. Read that page first
so the rubric is in context. The steps below are a thin orchestration of
the per-commit workflow.

## Inputs

A single Spark commit hash, short or full. PR numbers are not accepted; the
caller must resolve them.

## Steps

### 1. Read the contributor guide

Read `docs/source/contributor-guide/spark_commit_audit.md` start to finish.
The "Rubric" section is the source of truth for the verdict. The "Comet
scope reference" table tells you which subsystems Comet currently
implements.

### 2. Fetch the Spark commit

Ensure `apache/spark` is cloned to a cache dir:

```bash
SPARK_DIR="/tmp/spark-audit-clone"
if [ ! -d "$SPARK_DIR" ]; then
git clone https://github.com/apache/spark.git "$SPARK_DIR"
else
git -C "$SPARK_DIR" fetch origin master
fi
```

Then read the commit:

```bash
git -C "$SPARK_DIR" show --stat $ARGUMENTS
```

For deeper investigation, read the changed files directly with `git -C "$SPARK_DIR" show $ARGUMENTS:<path>` or by checking out the commit in the cache dir.

If a SPARK JIRA or GitHub PR is referenced in the commit message, fetch
that for additional context using `gh` if available.

### 3. Confirm scope

Confirm the commit touches `sql/` and is not entirely under `sql/connect/`
or `sql/hive-thriftserver/`. If it is out of scope, propose `not-relevant`
with a one-line note explaining why and proceed to step 5.

### 4. Apply the rubric

Walk the "Relevant" trigger list and the "Not relevant" bucket list from
the contributor guide. Cross-reference the affected subsystem against the
"Comet scope reference" table.

Propose one of:

- `relevant`: the commit affects a subsystem Comet emulates.
- `not-relevant`: the commit does not affect Comet.
- `unclear`: the rubric cannot determine impact without more research.

Compose a one-sentence prose note that explains the verdict (e.g. "Adds a
new ANSI overflow check in `Add` expression that Comet currently does not
match"). Keep notes concise; the line is a single bullet.

### 5. Update the audit log

Locate the existing `[needs-triage]` line for this commit in
`dev/spark-commit-audit.md`. Match by short hash.

If the line is missing, abort and tell the user:

> The commit `$ARGUMENTS` is not in `dev/spark-commit-audit.md`. Re-run
> `python dev/regenerate-spark-audit.py` from the release virtualenv to
> pick up new commits, then invoke this skill again.

Do not append the line; the bootstrap script is the single source of
truth for membership in the queue.

If the line is present, propose the updated line to the user and wait for
approval or edits. The format is:

```
- `<short-hash>` <date> [<state>] <subject>. <prose-note>[. comet#<NNNN>]
```

On approval, replace the line in place using the Edit tool. Do not commit
and do not push.

### 6. Offer to handle the Comet tracking issue

If the verdict is `relevant`, ask the user:

> This commit is `relevant`. How would you like to handle the Comet
> tracking issue?
>
> - **(a)** Draft the issue body to a local markdown file under
> `/tmp/comet-audit-issue-<hash>.md` for review.
> - **(b)** File a Comet GitHub issue immediately via `gh issue create`,
> after I show you the title and body.
> - **(c)** Skip; I will handle it later.

For **(b)**, show the title and body and confirm before running `gh`. If
the user confirms, run `gh issue create --repo apache/datafusion-comet
--title "..." --body-file <file>` and report the resulting issue URL,
then offer to add `comet#NNNN` to the audit log line.

## What this skill does NOT do

- Resolve PR numbers to commits.
- Audit more than one commit per invocation.
- Append a new line if the commit is missing from the log (it tells the
user to re-run the bootstrap script instead).
- Commit, push, or open Comet PRs.

## Tone and style

- Keep prose notes to one sentence.
- Use backticks around code references.
- Avoid em dashes; use periods or restructure.
222 changes: 222 additions & 0 deletions dev/regenerate-spark-audit.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
#!/usr/bin/env python

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
Regenerate dev/spark-commit-audit.md.

Enumerates commits on Apache Spark master since branch-4.2 was cut that
touch the sql/ subtree (excluding sql/connect and sql/hive-thriftserver),
and writes them into the marker block of dev/spark-commit-audit.md.

Idempotent: existing verdicts and prose notes are preserved by short hash.
"""

from __future__ import annotations

import argparse
import os
import re
import sys


BEGIN_MARKER = "<!-- BEGIN AUDIT LIST -->"
END_MARKER = "<!-- END AUDIT LIST -->"

LINE_RE = re.compile(r"^- `([0-9a-f]{8})`")


def parse_existing_block(body: str) -> dict[str, str]:
"""Return a map of short-hash to full line for every entry in the block."""
if BEGIN_MARKER not in body or END_MARKER not in body:
raise ValueError(
f"audit log is missing one or both markers: {BEGIN_MARKER!r}, {END_MARKER!r}"
)
start = body.index(BEGIN_MARKER) + len(BEGIN_MARKER)
end = body.index(END_MARKER)
block = body[start:end]
result: dict[str, str] = {}
for line in block.splitlines():
line = line.rstrip()
m = LINE_RE.match(line)
if m:
result[m.group(1)] = line
return result


MAX_SUBJECT_LEN = 200


def format_new_line(*, short_hash: str, date: str, subject: str) -> str:
"""Format a fresh [needs-triage] entry."""
if len(subject) > MAX_SUBJECT_LEN:
subject = subject[: MAX_SUBJECT_LEN - 3] + "..."
return f"- `{short_hash}` {date} [needs-triage] {subject}"


def is_in_scope(file_paths: list[str]) -> bool:
"""True when the commit touches sql/ outside of connect/thriftserver."""
for path in file_paths:
if not path.startswith("sql/"):
continue
if path.startswith("sql/connect/"):
continue
if path.startswith("sql/hive-thriftserver/"):
continue
return True
return False


def merge_lines(commits: list[dict], existing: dict[str, str]) -> list[str]:
"""Merge a chronological commit list with existing audit lines.

Existing entries are emitted verbatim. Commits not in the existing map
get a fresh [needs-triage] line. Output preserves the order of `commits`.
"""
result: list[str] = []
for commit in commits:
short = commit["short"]
if short in existing:
result.append(existing[short])
else:
result.append(
format_new_line(
short_hash=short,
date=commit["date"],
subject=commit["subject"],
)
)
return result


def replace_block(body: str, lines: list[str]) -> str:
"""Return body with the marker block replaced by the given lines."""
if BEGIN_MARKER not in body or END_MARKER not in body:
raise ValueError("audit log file is missing marker comments")
before, _, rest = body.partition(BEGIN_MARKER)
_, _, after = rest.partition(END_MARKER)
block_body = "\n".join(lines)
if block_body:
block_body = "\n" + block_body + "\n"
else:
block_body = "\n"
return before + BEGIN_MARKER + block_body + END_MARKER + after


def enumerate_spark_commits(token: str, limit: int | None = None) -> list[dict]:
"""Enumerate in-scope Spark master commits since branch-4.2 was cut.

Returns a list of {short, date, subject} dicts in chronological order.
"""
from github import Github # local import keeps the script importable without PyGithub installed

gh = Github(token)
repo = gh.get_repo("apache/spark")

# Resolve the branch-4.2 cut point as the merge base of master and branch-4.2.
print("Resolving branch-4.2 merge base...", file=sys.stderr)
cmp = repo.compare("branch-4.2", "master")
base_sha = cmp.merge_base_commit.sha
print(f"merge base: {base_sha}", file=sys.stderr)

# Walk master commits that touch sql/, newest first, until we hit the merge base.
print("Listing sql/ commits on master...", file=sys.stderr)
paginated = repo.get_commits(sha="master", path="sql/")

candidates: list = []
seen = 0
for c in paginated:
if c.sha == base_sha:
break
seen += 1
if limit is not None and seen > limit:
break
candidates.append(c)

print(f"fetched {len(candidates)} candidate commits", file=sys.stderr)

# Filter by in-scope file paths. This is N extra API calls (one per commit).
out: list[dict] = []
for i, c in enumerate(candidates):
if i % 50 == 0:
print(f"filtering {i}/{len(candidates)}...", file=sys.stderr)
files = [f.filename for f in c.files]
if not is_in_scope(files):
continue
date_str = c.commit.author.date.strftime("%Y-%m-%d")
subject = c.commit.message.split("\n", 1)[0]
out.append(
{
"short": c.sha[:8],
"date": date_str,
"subject": subject,
}
)

# Reverse to chronological order (oldest first).
out.reverse()
return out


def main() -> int:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--dry-run",
action="store_true",
help="print the merged block to stdout instead of writing the file",
)
parser.add_argument(
"--limit",
type=int,
default=None,
help="only consider the most recent N candidate commits (for testing)",
)
parser.add_argument(
"--audit-log",
default=os.path.join(
os.path.dirname(os.path.abspath(__file__)), "spark-commit-audit.md"
),
help="path to the audit log file (default: dev/spark-commit-audit.md)",
)
args = parser.parse_args()

token = os.environ.get("GITHUB_TOKEN")
if not token:
print("GITHUB_TOKEN environment variable is required", file=sys.stderr)
return 2

with open(args.audit_log, "r", encoding="utf-8") as f:
body = f.read()
existing = parse_existing_block(body)
print(f"existing entries: {len(existing)}", file=sys.stderr)

commits = enumerate_spark_commits(token, limit=args.limit)
print(f"in-scope commits: {len(commits)}", file=sys.stderr)

merged = merge_lines(commits, existing)
new_body = replace_block(body, merged)

if args.dry_run:
sys.stdout.write(new_body)
else:
with open(args.audit_log, "w", encoding="utf-8") as f:
f.write(new_body)
print(f"wrote {args.audit_log} ({len(merged)} entries)", file=sys.stderr)
return 0


if __name__ == "__main__":
sys.exit(main())
36 changes: 36 additions & 0 deletions dev/spark-commit-audit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Spark Commit Audit Log

Tracks Apache Spark `master` commits since the `branch-4.2` cut that touch
the `sql/` subtree, recording whether each one impacts DataFusion Comet.
See `docs/source/contributor-guide/spark_commit_audit.md` for the rubric
and process.

This file is regenerated and incrementally updated by
`dev/regenerate-spark-audit.py`. Existing verdicts and prose notes are
preserved on re-run.

## Commits

<!-- BEGIN AUDIT LIST -->
- `84d9c842` 2026-05-02 [needs-triage] [SPARK-56686][FOLLOWUP][SQL] Mark CDC streaming rewrite via attribute metadata
- `ae5c075a` 2026-05-04 [needs-triage] [SPARK-56711][SQL] Restrict CDC `_commit_version` column to LongType or StringType
<!-- END AUDIT LIST -->
Loading
Loading