NVIDIA · rapids-bot · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
@@ -7,11 +7,11 @@ This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the s
 ## Evaluation Summary
 
 - Skill: `cuopt-numerical-optimization-api-c`
-- Evaluation date: 2026-05-28
+- Evaluation date: 2026-06-10
 - NVSkills-Eval profile: `external`
-- Environment: `local`
-- Dataset: 1 evaluation tasks
-- Attempts per task: 2
+- Environment: `astra-sandbox`
+- Dataset: 4 evaluation tasks
+- Attempts per task: 1
 - Pass threshold: 50%
 - Overall verdict: PASS
 
@@ -32,6 +32,7 @@ Reported benchmark dimensions:
 
 Underlying evaluation signals used in this run:
 
+- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access.
 - `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow.
 - `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage.
 - `accuracy` (Accuracy): grades final-answer correctness against the reference answer.
@@ -41,9 +42,9 @@ Underlying evaluation signals used in this run:
 
 ## Test Tasks
 
-The benchmark dataset contained 1 evaluation tasks:
+The benchmark dataset contained 4 evaluation tasks:
 
-- Positive tasks: 1 tasks where the skill was expected to activate.
+- Positive tasks: 4 tasks where the skill was expected to activate.
 - Negative tasks: 0 tasks where no skill was expected.
 - Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred.
 
@@ -53,17 +54,17 @@ Task composition is derived from the evaluation dataset when possible. Entries w
 
 | Dimension | Num | `claude-code` | `codex` |
 |---|---:|---:|---:|
-| Security | 2 | 100% (+0%) | 100% (+25%) |
-| Correctness | 2 | 100% (+0%) | 92% (-5%) |
-| Discoverability | 2 | 100% (+5%) | 80% (+8%) |
-| Effectiveness | 2 | 95% (-1%) | 92% (+9%) |
-| Efficiency | 2 | 93% (+13%) | 73% (+17%) |
+| Security | 4 | 100% (+0%) | 100% (+0%) |
+| Correctness | 4 | 88% (+16%) | 72% (+16%) |
+| Discoverability | 4 | 68% (+46%) | 55% (+36%) |
+| Effectiveness | 4 | 92% (+7%) | 70% (+17%) |
+| Efficiency | 4 | 66% (+48%) | 62% (+35%) |
 
 Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.
 
 ## Tier 1: Static Validation Summary
 
-Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 9 total findings.
+Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 7 total findings.
 
 Top findings:
 

@@ -9,5 +9,46 @@
       "Lists C API call sequence without writing a complete source file",
       "Names cuOptCreateRangedProblem, cuOptSolve, cuOptGetObjectiveValue in order"
     ]
+  },
+  {
+    "id": "numopt-c-eval-002-parameter-function-wrong-name",
+    "question": "I am setting a time limit on my cuOpt C API solver with this call: cuOptSetIntParameter(settings, CUOPT_TIME_LIMIT, 60.0). My colleague says the function name is wrong. What is the correct function, and what other parameter-setting functions does the C API provide?",
+    "expected_skill": "cuopt-numerical-optimization-api-c",
+    "expected_script": null,
+    "ground_truth": "The function name cuOptSetIntParameter does not exist in the cuOpt C API — it is a common mistake. The correct function for float parameters (including CUOPT_TIME_LIMIT, tolerances) is cuOptSetFloatParameter. The C API provides three parameter-setting functions: cuOptSetFloatParameter for float params such as time limits and tolerances, cuOptSetIntegerParameter (not cuOptSetIntParameter) for integer params such as CUOPT_LOG_TO_CONSOLE and method selection, and cuOptSetParameter for string params. CUOPT_TIME_LIMIT is a float parameter so the correct call is cuOptSetFloatParameter(settings, CUOPT_TIME_LIMIT, 60.0).",
+    "expected_behavior": [
+      "Identifies cuOptSetIntParameter as a non-existent function — the correct name is cuOptSetIntegerParameter",
+      "States CUOPT_TIME_LIMIT is a float parameter requiring cuOptSetFloatParameter, not cuOptSetIntegerParameter",
+      "Names all three parameter functions: cuOptSetFloatParameter, cuOptSetIntegerParameter, cuOptSetParameter",
+      "Does not produce a full source file — answers the question about function names only"
+    ]
+  },
+  {
+    "id": "numopt-c-eval-003-csr-constraint-matrix",
+    "question": "I am building the constraint matrix for a cuOpt C LP. The problem has 2 constraints and 2 variables. Constraint 1: 3x1 + 4x2 <= 5.4. Constraint 2: 2.7x1 + 10.1x2 <= 4.9. Show me the row_offsets, col_indices, and values arrays for the CSR representation, and explain what each array means.",
+    "expected_skill": "cuopt-numerical-optimization-api-c",
+    "expected_script": null,
+    "ground_truth": "The CSR (Compressed Sparse Row) format uses three arrays. row_offsets has length num_constraints+1 = 3: {0, 2, 4}. Element i gives the starting index in col_indices/values for row i; the last element is the total number of nonzeros (4 here). col_indices = {0, 1, 0, 1}: the column index of each nonzero, ordered by row. values = {3.0, 4.0, 2.7, 10.1}: the nonzero values in the same order. Constraint upper bounds are {5.4, 4.9} and lower bounds are {-CUOPT_INFINITY, -CUOPT_INFINITY} since both constraints are <=. These arrays are passed to cuOptCreateRangedProblem.",
+    "expected_behavior": [
+      "Gives row_offsets = {0, 2, 4} and explains it as start indices per row plus total nnz at the end",
+      "Gives col_indices = {0, 1, 0, 1} matching the column of each nonzero by row",
+      "Gives values = {3.0, 4.0, 2.7, 10.1} in row-major order",
+      "Explains that constraint_lower_bounds should be -CUOPT_INFINITY for <= constraints",
+      "Names cuOptCreateRangedProblem as the function that receives these arrays"
+    ]
+  },
+  {
+    "id": "numopt-c-eval-004-qp-restrictions",
+    "question": "I want to solve a QP with integer variables using the cuOpt C API. A colleague says this is not supported. Is that correct, and what are the restrictions for QP in the cuOpt C API?",
+    "expected_skill": "cuopt-numerical-optimization-api-c",
+    "expected_script": null,
+    "ground_truth": "The colleague is correct — integer QP is not supported in the cuOpt C API. The QP restrictions are: (1) minimization only — CUOPT_MINIMIZE is required; to maximize a quadratic objective, negate all objective coefficients and Q matrix entries; (2) continuous variables only — all variables must use CUOPT_CONTINUOUS, integer variables are not supported for QP; (3) the Q matrix should be positive semi-definite (PSD) for a convex, well-posed problem. The same library, include paths, and build pattern as LP/MILP are used; only the problem-creation call differs for QP.",
+    "expected_behavior": [
+      "Confirms integer QP is not supported — all QP variables must be CUOPT_CONTINUOUS",
+      "States QP only supports CUOPT_MINIMIZE, not CUOPT_MAXIMIZE",
+      "Explains how to maximize: negate objective coefficients and Q entries",
+      "Mentions Q should be positive semi-definite (PSD) for a convex problem",
+      "Notes the same library/headers/build pattern as LP/MILP — only the problem creation call differs"
+    ]
   }
 ]
@@ -9,7 +9,7 @@ NVIDIA <br>
 ### License/Terms of Use: <br>
 Apache 2.0 <br>
 ## Use Case: <br>
-Developers and engineers embedding linear programming, mixed-integer linear programming, or quadratic programming solvers in C/C++ applications using the NVIDIA cuOpt GPU-accelerated optimization library. <br>
+Developers and engineers embedding LP, MILP, or QP numerical optimization into C/C++ applications using the NVIDIA cuOpt GPU-accelerated solver. <br>
 
 ### Deployment Geography for Use: <br>
 Global <br>
@@ -19,13 +19,13 @@ Risk: Review before execution as proposals could introduce incorrect or misleadi
 Mitigation: Review and scan skill before deployment. <br>
 
 ## Reference(s): <br>
-- [examples.md](references/examples.md) <br>
+- [C API Examples (LP/MILP)](references/examples.md) <br>
 - [cuOpt User Guide](https://docs.nvidia.com/cuopt/user-guide/latest/introduction.html) <br>
-- [cuopt-examples](https://github.com/NVIDIA/cuopt-examples) <br>
+- [cuOpt Examples Repository](https://github.com/NVIDIA/cuopt-examples) <br>
 
 
 ## Skill Output: <br>
-**Output Type(s):** [Code, Shell commands, Configuration instructions] <br>
+**Output Type(s):** [Code, Shell commands] <br>
 **Output Format:** [Markdown with inline C code blocks] <br>
 **Output Parameters:** [1D] <br>
 **Other Properties Related to Output:** [None] <br>
@@ -37,7 +37,7 @@ Mitigation: Review and scan skill before deployment. <br>
 
 
 ## Evaluation Tasks: <br>
-Evaluated against 1 evaluation task (positive skill-activation case) with 2 attempts per task via NVSkills-Eval 3-Tier Evaluation. <br>
+Evaluated against 4 internal evaluation tasks (positive skill-activation cases) via NVSkills-Eval with the external profile. <br>
 
 ## Evaluation Metrics Used: <br>
 Reported benchmark dimensions: <br>
@@ -48,6 +48,7 @@ Reported benchmark dimensions: <br>
 - Efficiency: Checks whether the agent uses fewer tokens and avoids redundant work. <br>
 
 Underlying evaluation signals used in this run: <br>
+- `security`: Checks for unsafe operations, secret leakage, and unauthorized access. <br>
 - `skill_execution`: Verifies that the agent loaded the expected skill and workflow. <br>
 - `skill_efficiency`: Checks routing quality, decoy avoidance, and redundant tool usage. <br>
 - `accuracy`: Grades final-answer correctness against the reference answer. <br>
@@ -60,11 +61,11 @@ Underlying evaluation signals used in this run: <br>
 ## Evaluation Results: <br>
 | Dimension | Num | `claude-code` | `codex` |
 |---|---:|---:|---:|
-| Security | 2 | 100% (+0%) | 100% (+25%) |
-| Correctness | 2 | 100% (+0%) | 92% (-5%) |
-| Discoverability | 2 | 100% (+5%) | 80% (+8%) |
-| Effectiveness | 2 | 95% (-1%) | 92% (+9%) |
-| Efficiency | 2 | 93% (+13%) | 73% (+17%) |
+| Security | 4 | 100% (+0%) | 100% (+0%) |
+| Correctness | 4 | 88% (+16%) | 72% (+16%) |
+| Discoverability | 4 | 68% (+46%) | 55% (+36%) |
+| Effectiveness | 4 | 92% (+7%) | 70% (+17%) |
+| Efficiency | 4 | 66% (+48%) | 62% (+35%) |
 
 ## Skill Version(s): <br>
 26.08.00 (source: frontmatter) <br>