Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 13 additions & 12 deletions skills/cuopt-numerical-optimization-api-c/BENCHMARK.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the s
## Evaluation Summary

- Skill: `cuopt-numerical-optimization-api-c`
- Evaluation date: 2026-05-28
- Evaluation date: 2026-06-10
- NVSkills-Eval profile: `external`
- Environment: `local`
- Dataset: 1 evaluation tasks
- Attempts per task: 2
- Environment: `astra-sandbox`
- Dataset: 4 evaluation tasks
- Attempts per task: 1
- Pass threshold: 50%
- Overall verdict: PASS

Expand All @@ -32,6 +32,7 @@ Reported benchmark dimensions:

Underlying evaluation signals used in this run:

- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access.
- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow.
- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage.
- `accuracy` (Accuracy): grades final-answer correctness against the reference answer.
Expand All @@ -41,9 +42,9 @@ Underlying evaluation signals used in this run:

## Test Tasks

The benchmark dataset contained 1 evaluation tasks:
The benchmark dataset contained 4 evaluation tasks:

- Positive tasks: 1 tasks where the skill was expected to activate.
- Positive tasks: 4 tasks where the skill was expected to activate.
- Negative tasks: 0 tasks where no skill was expected.
- Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred.

Expand All @@ -53,17 +54,17 @@ Task composition is derived from the evaluation dataset when possible. Entries w

| Dimension | Num | `claude-code` | `codex` |
|---|---:|---:|---:|
| Security | 2 | 100% (+0%) | 100% (+25%) |
| Correctness | 2 | 100% (+0%) | 92% (-5%) |
| Discoverability | 2 | 100% (+5%) | 80% (+8%) |
| Effectiveness | 2 | 95% (-1%) | 92% (+9%) |
| Efficiency | 2 | 93% (+13%) | 73% (+17%) |
| Security | 4 | 100% (+0%) | 100% (+0%) |
| Correctness | 4 | 88% (+16%) | 72% (+16%) |
| Discoverability | 4 | 68% (+46%) | 55% (+36%) |
| Effectiveness | 4 | 92% (+7%) | 70% (+17%) |
| Efficiency | 4 | 66% (+48%) | 62% (+35%) |

Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.

## Tier 1: Static Validation Summary

Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 9 total findings.
Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 7 total findings.

Top findings:

Expand Down
41 changes: 41 additions & 0 deletions skills/cuopt-numerical-optimization-api-c/evals/evals.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,46 @@
"Lists C API call sequence without writing a complete source file",
"Names cuOptCreateRangedProblem, cuOptSolve, cuOptGetObjectiveValue in order"
]
},
{
"id": "numopt-c-eval-002-parameter-function-wrong-name",
"question": "I am setting a time limit on my cuOpt C API solver with this call: cuOptSetIntParameter(settings, CUOPT_TIME_LIMIT, 60.0). My colleague says the function name is wrong. What is the correct function, and what other parameter-setting functions does the C API provide?",
"expected_skill": "cuopt-numerical-optimization-api-c",
"expected_script": null,
"ground_truth": "The function name cuOptSetIntParameter does not exist in the cuOpt C API — it is a common mistake. The correct function for float parameters (including CUOPT_TIME_LIMIT, tolerances) is cuOptSetFloatParameter. The C API provides three parameter-setting functions: cuOptSetFloatParameter for float params such as time limits and tolerances, cuOptSetIntegerParameter (not cuOptSetIntParameter) for integer params such as CUOPT_LOG_TO_CONSOLE and method selection, and cuOptSetParameter for string params. CUOPT_TIME_LIMIT is a float parameter so the correct call is cuOptSetFloatParameter(settings, CUOPT_TIME_LIMIT, 60.0).",
"expected_behavior": [
"Identifies cuOptSetIntParameter as a non-existent function — the correct name is cuOptSetIntegerParameter",
"States CUOPT_TIME_LIMIT is a float parameter requiring cuOptSetFloatParameter, not cuOptSetIntegerParameter",
"Names all three parameter functions: cuOptSetFloatParameter, cuOptSetIntegerParameter, cuOptSetParameter",
"Does not produce a full source file — answers the question about function names only"
]
},
{
"id": "numopt-c-eval-003-csr-constraint-matrix",
"question": "I am building the constraint matrix for a cuOpt C LP. The problem has 2 constraints and 2 variables. Constraint 1: 3x1 + 4x2 <= 5.4. Constraint 2: 2.7x1 + 10.1x2 <= 4.9. Show me the row_offsets, col_indices, and values arrays for the CSR representation, and explain what each array means.",
"expected_skill": "cuopt-numerical-optimization-api-c",
"expected_script": null,
"ground_truth": "The CSR (Compressed Sparse Row) format uses three arrays. row_offsets has length num_constraints+1 = 3: {0, 2, 4}. Element i gives the starting index in col_indices/values for row i; the last element is the total number of nonzeros (4 here). col_indices = {0, 1, 0, 1}: the column index of each nonzero, ordered by row. values = {3.0, 4.0, 2.7, 10.1}: the nonzero values in the same order. Constraint upper bounds are {5.4, 4.9} and lower bounds are {-CUOPT_INFINITY, -CUOPT_INFINITY} since both constraints are <=. These arrays are passed to cuOptCreateRangedProblem.",
"expected_behavior": [
"Gives row_offsets = {0, 2, 4} and explains it as start indices per row plus total nnz at the end",
"Gives col_indices = {0, 1, 0, 1} matching the column of each nonzero by row",
"Gives values = {3.0, 4.0, 2.7, 10.1} in row-major order",
"Explains that constraint_lower_bounds should be -CUOPT_INFINITY for <= constraints",
"Names cuOptCreateRangedProblem as the function that receives these arrays"
]
},
{
"id": "numopt-c-eval-004-qp-restrictions",
"question": "I want to solve a QP with integer variables using the cuOpt C API. A colleague says this is not supported. Is that correct, and what are the restrictions for QP in the cuOpt C API?",
"expected_skill": "cuopt-numerical-optimization-api-c",
"expected_script": null,
"ground_truth": "The colleague is correct — integer QP is not supported in the cuOpt C API. The QP restrictions are: (1) minimization only — CUOPT_MINIMIZE is required; to maximize a quadratic objective, negate all objective coefficients and Q matrix entries; (2) continuous variables only — all variables must use CUOPT_CONTINUOUS, integer variables are not supported for QP; (3) the Q matrix should be positive semi-definite (PSD) for a convex, well-posed problem. The same library, include paths, and build pattern as LP/MILP are used; only the problem-creation call differs for QP.",
"expected_behavior": [
"Confirms integer QP is not supported — all QP variables must be CUOPT_CONTINUOUS",
"States QP only supports CUOPT_MINIMIZE, not CUOPT_MAXIMIZE",
"Explains how to maximize: negate objective coefficients and Q entries",
"Mentions Q should be positive semi-definite (PSD) for a convex problem",
"Notes the same library/headers/build pattern as LP/MILP — only the problem creation call differs"
]
}
]
21 changes: 11 additions & 10 deletions skills/cuopt-numerical-optimization-api-c/skill-card.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ NVIDIA <br>
### License/Terms of Use: <br>
Apache 2.0 <br>
## Use Case: <br>
Developers and engineers embedding linear programming, mixed-integer linear programming, or quadratic programming solvers in C/C++ applications using the NVIDIA cuOpt GPU-accelerated optimization library. <br>
Developers and engineers embedding LP, MILP, or QP numerical optimization into C/C++ applications using the NVIDIA cuOpt GPU-accelerated solver. <br>

### Deployment Geography for Use: <br>
Global <br>
Expand All @@ -19,13 +19,13 @@ Risk: Review before execution as proposals could introduce incorrect or misleadi
Mitigation: Review and scan skill before deployment. <br>

## Reference(s): <br>
- [examples.md](references/examples.md) <br>
- [C API Examples (LP/MILP)](references/examples.md) <br>
- [cuOpt User Guide](https://docs.nvidia.com/cuopt/user-guide/latest/introduction.html) <br>
- [cuopt-examples](https://github.com/NVIDIA/cuopt-examples) <br>
- [cuOpt Examples Repository](https://github.com/NVIDIA/cuopt-examples) <br>


## Skill Output: <br>
**Output Type(s):** [Code, Shell commands, Configuration instructions] <br>
**Output Type(s):** [Code, Shell commands] <br>
**Output Format:** [Markdown with inline C code blocks] <br>
**Output Parameters:** [1D] <br>
**Other Properties Related to Output:** [None] <br>
Expand All @@ -37,7 +37,7 @@ Mitigation: Review and scan skill before deployment. <br>


## Evaluation Tasks: <br>
Evaluated against 1 evaluation task (positive skill-activation case) with 2 attempts per task via NVSkills-Eval 3-Tier Evaluation. <br>
Evaluated against 4 internal evaluation tasks (positive skill-activation cases) via NVSkills-Eval with the external profile. <br>

## Evaluation Metrics Used: <br>
Reported benchmark dimensions: <br>
Expand All @@ -48,6 +48,7 @@ Reported benchmark dimensions: <br>
- Efficiency: Checks whether the agent uses fewer tokens and avoids redundant work. <br>

Underlying evaluation signals used in this run: <br>
- `security`: Checks for unsafe operations, secret leakage, and unauthorized access. <br>
- `skill_execution`: Verifies that the agent loaded the expected skill and workflow. <br>
- `skill_efficiency`: Checks routing quality, decoy avoidance, and redundant tool usage. <br>
- `accuracy`: Grades final-answer correctness against the reference answer. <br>
Expand All @@ -60,11 +61,11 @@ Underlying evaluation signals used in this run: <br>
## Evaluation Results: <br>
| Dimension | Num | `claude-code` | `codex` |
|---|---:|---:|---:|
| Security | 2 | 100% (+0%) | 100% (+25%) |
| Correctness | 2 | 100% (+0%) | 92% (-5%) |
| Discoverability | 2 | 100% (+5%) | 80% (+8%) |
| Effectiveness | 2 | 95% (-1%) | 92% (+9%) |
| Efficiency | 2 | 93% (+13%) | 73% (+17%) |
| Security | 4 | 100% (+0%) | 100% (+0%) |
| Correctness | 4 | 88% (+16%) | 72% (+16%) |
| Discoverability | 4 | 68% (+46%) | 55% (+36%) |
| Effectiveness | 4 | 92% (+7%) | 70% (+17%) |
| Efficiency | 4 | 66% (+48%) | 62% (+35%) |

## Skill Version(s): <br>
26.08.00 (source: frontmatter) <br>
Expand Down
Loading