240824085684611: copilot / claude-sonnet-4.6 — 4/5 A tier by laiso · Pull Request #116 · laiso/ts-bench

laiso · 2026-04-07T13:51:13Z

…1204504]

🚀 New Entry: copilot-claude-sonnet-4.6 added to results

Agent: copilot
Model: claude-sonnet-4.6
Provider: anthropic
Run: View GitHub Actions Run

Tier: A (4/5)

Success Rate: 80.0% (was N/A)
Avg Time: 777.5s (was N/A)

Task	Agent	Test	Overall	Duration
14958	✅	✅	✅	465.5s
14268	✅	✅	✅	466.9s
20079	✅	✅	✅	611.2s
15815_1	✅	✅	✅	644.8s
15193	✅	❌	❌	1699.3s

…1204504] 🚀 New Entry: `copilot-claude-sonnet-4.6` added to results - **Agent**: copilot - **Model**: claude-sonnet-4.6 - **Provider**: anthropic - **Run**: [View GitHub Actions Run](https://github.com/laiso/ts-bench/actions/runs/24081204504) **Tier**: A (4/5) - **Success Rate**: 80.0% (was N/A) - **Avg Time**: 777.5s (was N/A) | Task | Agent | Test | Overall | Duration | |------|-------|------|---------|----------| | 14958 | ✅ | ✅ | ✅ | 465.5s | | 14268 | ✅ | ✅ | ✅ | 466.9s | | 20079 | ✅ | ✅ | ✅ | 611.2s | | 15815_1 | ✅ | ✅ | ✅ | 644.8s | | 15193 | ✅ | ❌ | ❌ | 1699.3s |

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 1 additional finding.

github-actions · 2026-04-07T13:57:26Z

🔍 Benchmark Failure Analysis

Run: unknown
Agent: copilot / Model: claude-sonnet-4.6 / Provider: anthropic
Result: 4/5 passed (8000.0%)
Analysis Model: deepseek/DeepSeek-V3-0324

Task `15193` — `WRONG_FIX`

Item	Value
agentSuccess	true
testSuccess	false
Patch	empty
Duration	agent 1640s + test 59s = 1699s

Root Cause: The agent incorrectly assumed the issue was with font weight inheritance in react-native-render-html when the test shows the actual problem was bold styling being applied to code blocks.

Test Expectation: The test expected code blocks to have normal font weight (400) but found bold (700) instead.

Agent Behavior: The agent modified font weight handling in ExpensiMark.js but didn't address the core issue of bold styling being incorrectly applied to code blocks.

Suggestion: The agent should have focused on preventing bold styling from being applied to code blocks in the markdown parser, rather than trying to override it in the rendering layer.

devin-ai-integration Bot reviewed Apr 7, 2026

View reviewed changes

laiso merged commit c30791c into main Apr 7, 2026
2 checks passed

laiso changed the title ~~feat(leaderboard): copilot / claude-sonnet-4.6 — 4/5 A tier [run 2408…~~ 240824085684611: copilot / claude-sonnet-4.6 — 4/5 A tier Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

240824085684611: copilot / claude-sonnet-4.6 — 4/5 A tier#116

240824085684611: copilot / claude-sonnet-4.6 — 4/5 A tier#116
laiso merged 1 commit into
mainfrom
leaderboard-update/24081204504

laiso commented Apr 7, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

laiso commented Apr 7, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

github-actions Bot commented Apr 7, 2026

🔍 Benchmark Failure Analysis

Task 15193 — WRONG_FIX

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

laiso commented Apr 7, 2026 •

edited by devin-ai-integration Bot

Loading

Task `15193` — `WRONG_FIX`