Pinned Loading
Repositories
- ContractBench Public
ContractBench: evaluating observation contract failures (validity + integrity) in LLM agents. 33 harbor-runnable API-contract tasks with deterministic programmatic evaluation.
SecurityLab-UCD/ContractBench’s past year of commit activity - ContractBench-inspect Public
Inspect AI eval for ContractBench — 33 observation-contract tasks (validity + integrity). Register-ready for inspect_evals.
SecurityLab-UCD/ContractBench-inspect’s past year of commit activity - korabench Public Forked from korabench/benchmark
Everything needed to run the KORA benchmark for AI Child Safety.
SecurityLab-UCD/korabench’s past year of commit activity - CGNTG Public Forked from FuzzAnything/PromptFuzz
PromtFuzz is an automated tool that generates high-quality fuzz drivers for libraries via a fuzz loop constructed on mutating LLMs' prompts.
SecurityLab-UCD/CGNTG’s past year of commit activity - oss-fuzz-lisa Public Forked from google/oss-fuzz
OSS-Fuzz - continuous fuzzing for open source software.
SecurityLab-UCD/oss-fuzz-lisa’s past year of commit activity - vllm Public Forked from vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
SecurityLab-UCD/vllm’s past year of commit activity - LiveCodeBench Public Forked from LiveCodeBench/LiveCodeBench
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
SecurityLab-UCD/LiveCodeBench’s past year of commit activity
Top languages
Loading…
Most used topics
Loading…