mbpp

Here are 6 public repositories matching this topic...

abhaymundhara / llm-benchmark-suite

Benchmark suite for evaluating LLMs and SLMs on coding and SE tasks. Features HumanEval, MBPP, SWE-bench, and BigCodeBench with an interactive Streamlit UI. Supports cloud APIs (OpenAI, Anthropic, Google) and local models via Ollama. Tracks pass rates, latency, token usage, and costs.

python benchmark evaluation gemini openai code-generation claude streamlit humaneval llm ollama swe-bench mbpp bigcodebench

Updated Apr 23, 2026
Python

OpenMLRL / LLM_Collab_Code_Generation

Star

LLM Collaboration for Code Generation

code-generation multi-agent-systems multi-agent-reinforcement-learning humaneval large-language-models code-agent mbpp comlrl openmlrl coophumaneval

Updated Feb 17, 2026
Python

jcartu / llm-stress-harness

Star

Diagnostic toolkit for self-hosted LLM inference: failure-taxonomic stress harness + 4-phase orchestrator + parametric vLLM launchers

python benchmarking inference stress-testing humaneval llm vllm speculative-decoding sglang mbpp

Updated May 7, 2026
Shell

Shreyash-Gaur / TensorFlow_Python_Code_Generation

Star

Fine-tuning CodeT5 for Python code generation on the MBPP dataset. Features custom TensorFlow training loops, mixed precision, XLA optimization, and distributed multi-GPU strategies.

deep-learning tensorflow transformer code-generation distributed-training mixed-precision huggingface nl2code text-to-code llm generative-ai codet5 mbpp

Updated Mar 19, 2025
Jupyter Notebook

scouzi1966 / qwen-humaneval

Star

🧪 Automated LLM coding benchmarks with Ollama - HumanEval & MBPP evaluation suite with safe execution, comprehensive logging, and detailed analysis tools

python benchmarking machine-learning evaluation coding humaneval llm ollama qwen mbpp

Updated Aug 1, 2025
Python

jcartu / qwen36-27b-blackwell-stress-validation

Star

Stress-validation of Qwen3.6-27B inference configurations on dual RTX PRO 6000 Blackwell. 5 configs x 4 phases (gates, throughput matrix, HumanEval, MBPP) = 2,105 hard coding problems, zero crashes. Headline: FP8+MTP=3 wins HumanEval (79.3%), BF16+DFlash wins MBPP (89.5%). MTP=5 dominated on correctness despite faster raw tok/s.

benchmark inference blackwell humaneval vllm qwen speculative-decoding qwen3 mbpp rtx-pro-6000

Updated May 7, 2026
Python

Improve this page

Add a description, image, and links to the mbpp topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the mbpp topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mbpp

Here are 6 public repositories matching this topic...

abhaymundhara / llm-benchmark-suite

OpenMLRL / LLM_Collab_Code_Generation

jcartu / llm-stress-harness

Shreyash-Gaur / TensorFlow_Python_Code_Generation

scouzi1966 / qwen-humaneval

jcartu / qwen36-27b-blackwell-stress-validation

Improve this page

Add this topic to your repo