Skip to content

240824085684611: copilot / claude-sonnet-4.6 — 4/5 A tier#116

Merged
laiso merged 1 commit into
mainfrom
leaderboard-update/24081204504
Apr 7, 2026
Merged

240824085684611: copilot / claude-sonnet-4.6 — 4/5 A tier#116
laiso merged 1 commit into
mainfrom
leaderboard-update/24081204504

Conversation

@laiso
Copy link
Copy Markdown
Owner

@laiso laiso commented Apr 7, 2026

…1204504]

🚀 New Entry: copilot-claude-sonnet-4.6 added to results

Tier: A (4/5)

  • Success Rate: 80.0% (was N/A)
  • Avg Time: 777.5s (was N/A)
Task Agent Test Overall Duration
14958 465.5s
14268 466.9s
20079 611.2s
15815_1 644.8s
15193 1699.3s

Open with Devin

…1204504]

🚀 New Entry: `copilot-claude-sonnet-4.6` added to results

- **Agent**: copilot
- **Model**: claude-sonnet-4.6
- **Provider**: anthropic
- **Run**: [View GitHub Actions Run](https://github.com/laiso/ts-bench/actions/runs/24081204504)

**Tier**: A (4/5)

- **Success Rate**: 80.0% (was N/A)
- **Avg Time**: 777.5s (was N/A)

| Task | Agent | Test | Overall | Duration |
|------|-------|------|---------|----------|
| 14958 | ✅ | ✅ | ✅ | 465.5s |
| 14268 | ✅ | ✅ | ✅ | 466.9s |
| 20079 | ✅ | ✅ | ✅ | 611.2s |
| 15815_1 | ✅ | ✅ | ✅ | 644.8s |
| 15193 | ✅ | ❌ | ❌ | 1699.3s |
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 1 additional finding.

Open in Devin Review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

🔍 Benchmark Failure Analysis

Run: unknown
Agent: copilot / Model: claude-sonnet-4.6 / Provider: anthropic
Result: 4/5 passed (8000.0%)
Analysis Model: deepseek/DeepSeek-V3-0324


Task 15193WRONG_FIX

Item Value
agentSuccess true
testSuccess false
Patch empty
Duration agent 1640s + test 59s = 1699s

Root Cause: The agent incorrectly assumed the issue was with font weight inheritance in react-native-render-html when the test shows the actual problem was bold styling being applied to code blocks.

Test Expectation: The test expected code blocks to have normal font weight (400) but found bold (700) instead.

Agent Behavior: The agent modified font weight handling in ExpensiMark.js but didn't address the core issue of bold styling being incorrectly applied to code blocks.

Suggestion: The agent should have focused on preventing bold styling from being applied to code blocks in the markdown parser, rather than trying to override it in the rendering layer.


@laiso laiso merged commit c30791c into main Apr 7, 2026
2 checks passed
@laiso laiso changed the title feat(leaderboard): copilot / claude-sonnet-4.6 — 4/5 A tier [run 2408… 240824085684611: copilot / claude-sonnet-4.6 — 4/5 A tier Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant