Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 2 additions & 92 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,90 +24,6 @@ jobs:
enable-cache: true
cache-dependency-glob: "requirements.in"

- name: Set up Python
run: |
# Automatically detect compatible Python version by testing dependency installation
# Try default latest Python first, then downgrade up to 2 times if needed

echo "Attempting to find compatible Python version..."

# Try installing default/latest Python
uv python install

# Get the initial version that was installed
test_version=$(uv python list | grep -E 'cpython-[0-9]+\.[0-9]+' | grep -v 'pypy' | head -n1 | awk '{print $1}' | sed 's/cpython-//' | cut -d'-' -f1)

for attempt in 1 2 3; do
echo ""
echo "Attempt $attempt:"
echo "Testing Python $test_version..."

# Create a temporary virtual environment with the specific Python version
# Use attempt number to avoid path collisions
test_venv="/tmp/test-venv-attempt-$attempt-$$"
uv venv --python "$test_version" "$test_venv" 2>&1
source "$test_venv/bin/activate"

# Try to compile requirements (capture output separately for error reporting)
compile_output=$(mktemp)
success=false
if uv pip compile requirements.in > /tmp/requirements-$attempt.txt 2>"$compile_output"; then
# Compile succeeded, now try to sync
if uv pip sync /tmp/requirements-$attempt.txt 2>&1; then
echo "✓ Python $test_version is compatible with all frameworks"
echo "PYTHON_VERSION=$test_version" >> $GITHUB_ENV
success=true
else
echo "✗ Python $test_version: pip sync failed"
fi
else
echo "✗ Python $test_version: dependency compilation failed"
fi

# Show the error for debugging if failed
if [ "$success" = false ]; then
echo "Error output:"
cat "$compile_output" 2>/dev/null || true
fi

# Clean up test venv
deactivate 2>/dev/null || true
rm -rf "$test_venv"
rm -f "$compile_output"

# If successful, we're done
if [ "$success" = true ]; then
break
fi

# If not the last attempt, try downgrading
if [ $attempt -lt 3 ]; then
# Parse version and downgrade
major=$(echo "$test_version" | cut -d. -f1)
minor=$(echo "$test_version" | cut -d. -f2)

# Downgrade minor version
new_minor=$((minor - 1))

if [ $new_minor -ge 10 ]; then
test_version="$major.$new_minor"
echo "Downgrading to Python $test_version..."

# Install downgraded version
uv python install "$test_version"
else
echo "Cannot downgrade further (reached Python 3.10)"
exit 1
fi
else
echo "Error: No compatible Python version found after 3 attempts"
exit 1
fi
done

echo ""
echo "Using Python $PYTHON_VERSION"

- name: Install system dependencies
run: |
sudo apt-get update
Expand All @@ -118,14 +34,8 @@ jobs:
# Make run.sh executable
chmod +x run.sh

# Set Python version for run.sh to use
# If PYTHON_VERSION was set by the setup step, pin it
if [ -n "$PYTHON_VERSION" ]; then
export UV_PYTHON="$PYTHON_VERSION"
echo "Using Python $PYTHON_VERSION for benchmark suite"
fi

# Run the benchmark suite
# It will install the latest compatible Python and dependencies
bash run.sh
env:
# Set environment variable to indicate GitHub Actions
Expand Down Expand Up @@ -219,4 +129,4 @@ jobs:
echo "::endgroup::"
else
echo "Summary statistics file not found"
fi
fi
51 changes: 51 additions & 0 deletions 00_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,49 @@
from typing import Any, Callable, Dict


def normalize_result(result: Any) -> Any:
"""
Normalize the result DataFrame to ensure consistent format across frameworks.
- Resets index (moving index to columns)
- Flattens MultiIndex columns
"""
# Check if it looks like a pandas/fireducks DataFrame
if (
hasattr(result, "index")
and hasattr(result, "columns")
and hasattr(result, "reset_index")
):
# 1. Reset index if it's not a RangeIndex
# This moves grouping keys from index to columns
is_range_index = False
# Check for RangeIndex (has start/stop/step attributes)
if (
hasattr(result.index, "start")
and hasattr(result.index, "stop")
and hasattr(result.index, "step")
):
is_range_index = True

if not is_range_index:
result = result.reset_index()

# 2. Flatten MultiIndex columns
# Check if columns is a MultiIndex (has nlevels > 1)
if hasattr(result.columns, "nlevels") and result.columns.nlevels > 1:
new_columns = []
for col in result.columns.values:
if isinstance(col, tuple):
# Join non-empty parts with underscore
# E.g. ('total_amount', 'sum') -> 'total_amount_sum'
name = "_".join([str(c) for c in col if str(c) != ""]).strip("_")
new_columns.append(name)
else:
new_columns.append(str(col))
result.columns = new_columns
Comment on lines +36 to +46

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Mutating input DataFrame's columns in-place.

Line 46 directly assigns to result.columns, which mutates the input DataFrame. If the caller holds a reference to the original DataFrame, they'll see modified column names unexpectedly.

Proposed fix: operate on a copy
     if hasattr(result.columns, "nlevels") and result.columns.nlevels > 1:
+        result = result.copy()
         new_columns = []
         for col in result.columns.values:
             if isinstance(col, tuple):
🤖 Prompt for AI Agents
In `@00_tools.py` around lines 36 - 46, The code mutates the caller's DataFrame by
assigning to result.columns; instead, make and modify a copy to avoid side
effects: before building/assigning new_columns, call result = result.copy() (or
create a new variable like new_result = result.copy()), compute new_columns from
result.columns as shown, assign the names to the copy (new_result.columns =
new_columns) and use/return that copy rather than mutating the original; keep
the existing tuple-flattening logic and only change the assignment to operate on
the copied DataFrame.


return result


def time_operation(
operation_name: str,
df_lib: Any,
Expand Down Expand Up @@ -49,6 +92,14 @@ def time_operation(
if hasattr(result, "to_frame"):
result = result.to_frame(name=operation_name)

# Normalize result (reset index, flatten columns) before saving
# This ensures consistency between Pandas (which uses Index/MultiIndex)
# and Polars (which uses flat DataFrames)
try:
result = normalize_result(result)
except Exception as e:
print(f"Warning: Failed to normalize result for {operation_name}: {e}")
Comment on lines +98 to +101

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a broad except Exception: can mask underlying bugs in the normalize_result function. If an unexpected error occurs, it's only printed as a warning, which might be missed. This makes debugging harder. Consider catching more specific exceptions (like AttributeError, ValueError for pandas/polars operations) or logging the full traceback to provide more context when an error does happen.


if hasattr(result, "to_parquet"):
result.to_parquet(output_filename, index=False)
elif hasattr(result, "write_parquet"):
Expand Down
Loading
Loading