You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/pr_benchmark/index.md
+22Lines changed: 22 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,6 +34,12 @@ A list of the models used for generating the baseline suggestions, and example r
34
34
</tr>
35
35
</thead>
36
36
<tbody>
37
+
<tr>
38
+
<td style="text-align:left;">GPT-5.2</td>
39
+
<td style="text-align:left;">2025-12-11</td>
40
+
<td style="text-align:left;">medium</td>
41
+
<td style="text-align:center;"><b>80.8</b></td>
42
+
</tr>
37
43
<tr>
38
44
<td style="text-align:left;">GPT-5-pro</td>
39
45
<td style="text-align:left;">2025-10-06</td>
@@ -183,6 +189,22 @@ A list of the models used for generating the baseline suggestions, and example r
183
189
184
190
## Results Analysis (Latest Additions)
185
191
192
+
### GPT-5.2 ('medium' thinking budget)
193
+
194
+
Final score: **80.8**
195
+
196
+
Strengths:
197
+
198
+
-**Broad, context-aware coverage:** Frequently identifies multiple high-impact faults in the added lines and proposes fixes that surpass or equal the best prior answer in many cases (≈60 % of the 399 comparisons).
199
+
-**Actionable, minimal patches:** Tends to supply concise before/after code snippets that compile/run, keep changes local, and respect limits (≤3 suggestions, touched-lines only) – making the advice easy to apply.
200
+
-**Clear reasoning & prioritisation:** Usually explains why an issue is critical, ranks it properly (e.g., crash > style), and avoids clutter, resulting in focused reviews that align with real test failures.
201
+
202
+
Weaknesses:
203
+
204
+
-**Critical omissions remain common:** In a sizeable minority of examples the model overlooks the single most blocking error (e.g., compile-time break, nil-deref, enum mismatch), causing it to trail a sharper peer answer.
205
+
-**Occasional inaccurate or harmful fixes:** It sometimes introduces non-compiling code, speculative refactors, or misguided changes to unchanged lines, lowering reliability.
206
+
-**Inconsistent guideline adherence:** A non-trivial set of replies add off-scope edits, non-critical style nits, or empty suggestion lists when clear bugs exist, leading to avoidable downgrades.
0 commit comments