MacKV-Opt

MacKV-Opt is a local tool for choosing safer context and KV cache settings when running LLMs on Apple Silicon Macs. It keeps the selected model unchanged, does not upload prompts or outputs, and helps Ollama users avoid memory pressure, swap stalls, and trial-and-error num_ctx tuning.

What It Optimizes

Long-context inference grows KV cache memory roughly linearly with context length. On unified-memory Macs, an oversized context can push the system into memory pressure and swap; an undersized context wastes available memory and forces users to shorten prompts.

MacKV-Opt improves this workflow by:

estimating model weights, KV cache, runtime overhead, and memory headroom;
choosing a num_ctx that fits the requested memory budget;
recommending llama.cpp KV cache types when that runtime is available;
warning when Ollama, llama.cpp, model metadata, or memory sampling are missing;
comparing default Ollama, manual num_ctx, and MacKV-Opt runs with the same model and prompts.

The practical effect is not magic speedup. The benefit is targeted: fewer failed long-context runs, less swap-driven slowdown, and a larger usable context when the Mac has memory headroom that default or manual settings do not use.

Quick Start

Install from the repository:

python -m pip install -e .

Check the machine:

mackv-opt doctor

Use the simplest planner entry:

mackv-opt auto llama3.1:8b --memory-budget 12GiB

Run only when you are ready to call the local Ollama API:

mackv-opt auto llama3.1:8b \
  --memory-budget 12GiB \
  --prompt "Summarize this document" \
  --execute

Plan a specific target context:

mackv-opt plan llama3.1:8b --target-context 64k --memory-budget 12GiB

If Ollama metadata is incomplete, provide model metadata manually:

mackv-opt plan llama3.1:8b \
  --target-context 64k \
  --memory-budget 12GiB \
  --model-size 4.8GiB \
  --hidden-size 4096 \
  --layers 32 \
  --heads 32 \
  --kv-heads 8 \
  --hardware-memory 16GiB

How It Works

The planner uses this budget model:

estimated_total = model_weights + KV_cache(context, layers, hidden, GQA_ratio, KV_type) + runtime_overhead

It searches from the requested context downward and selects the largest configuration that fits the memory budget. It prefers higher-quality KV cache types when they fit, falls back to smaller KV types when memory is tight, and reduces context only when no cache strategy fits.

For Ollama, MacKV-Opt emits safe num_ctx and num_gpu options. For llama.cpp, it also emits --ctx-size, --cache-type-k, --cache-type-v, and --kv-offload arguments when the local runtime exposes those flags.

Compared With Other Options

Approach	What users do	Main drawback	MacKV-Opt role
Default Ollama	Run the model with default options	May use a conservative or unsuitable context for the task	Measures the default baseline
Manual `num_ctx`	Guess a context value	Easy to overshoot memory or leave capacity unused	Estimates a safer value before running
llama.cpp flags	Tune low-level cache/context flags directly	Powerful but easy to misconfigure	Converts memory budget into concrete args
MLX/other runtimes	Use a different local inference stack	May require changing workflow or model format	Keeps Ollama-first workflow and selected model
KV compression libraries	Add algorithm-specific compression	Often requires code integration or runtime changes	Uses available runtime controls first

Validate Improvement On Your Mac

Create three comparable run folders:

mackv-opt baseline-template \
  --output-dir experiments/m2-16gb \
  --models llama3.1:8b \
  --contexts 8k,16k,32k \
  --memory-budget 12GiB

Run the generated run.sh files under:

experiments/m2-16gb/llama3.1-8b/default/
experiments/m2-16gb/llama3.1-8b/manual-num-ctx/
experiments/m2-16gb/llama3.1-8b/mackv-opt/

Compare the results:

mackv-opt compare \
  default=experiments/m2-16gb/llama3.1-8b/default/full-run.json \
  manual-num-ctx=experiments/m2-16gb/llama3.1-8b/manual-num-ctx/full-run.json \
  mackv-opt=experiments/m2-16gb/llama3.1-8b/mackv-opt/full-run.json \
  --baseline-label default \
  --format markdown

For a multi-model matrix, use:

./scripts/run_macos_matrix.sh
MACKV_EXECUTE=1 MACKV_MEMORY_BUDGET=20GiB ./scripts/run_macos_matrix.sh

See docs/MAC_VALIDATION_CHECKLIST.md for 16GB, 32GB, and 64GB validation presets.

Testing Without A Mac

You can test most of the project without Apple Silicon hardware:

run the full unit suite on Windows or Linux with python -m pytest -q;
run planner commands with manual metadata and --hardware-memory;
run bench, experiment, needle, and qa in --dry-run mode;
generate baseline folders and comparison reports using fixture JSON;
use GitHub Actions macOS runners for packaging and CLI smoke checks;
use a borrowed, remote, or self-hosted Apple Silicon Mac for executable Ollama validation.

What non-Mac testing cannot prove: Apple unified-memory pressure, Metal runtime behavior, Ollama model throughput, and max stable context on a target Mac. Those need an Apple Silicon machine with Ollama and the selected models installed.

CLI Reference

Common commands:

mackv-opt profile --json
mackv-opt doctor --json
mackv-opt capabilities --json
mackv-opt collect --output-dir experiments/m2-16gb/collect --models llama3.1:8b --json
mackv-opt audit experiments/m2-16gb/collect/manifest.json --json
mackv-opt auto llama3.1:8b --memory-budget 12GiB
mackv-opt run llama3.1:8b --target-context 64k --memory-budget 12GiB
mackv-opt bench --models llama3.1:8b --contexts 8k,16k,32k --dry-run --json
mackv-opt experiment llama3.1:8b --contexts 8k,16k,32k --memory-budget 12GiB --dry-run --json
mackv-opt report full-run.json --table performance
mackv-opt plot-memory full-run.json --output memory-series.svg

run, bench, needle, qa, and experiment call local inference only when --execute is passed.

Safety And Privacy

No model replacement.
No weight rewriting.
No automatic model quantization.
No cloud service.
No prompt or output upload.
No persistent Ollama configuration changes unless the user runs the printed command or passes --execute.

Repository Profile

About: KV cache and context planner for longer local LLM contexts on Apple Silicon Macs with Ollama-compatible benchmarks.
Homepage: https://github.com/Lin-Aurora/MacKV-Opt#readme
Topics: apple-silicon, ollama, llama-cpp, kv-cache, local-llm, llm-inference, macos, benchmark, long-context, mlx, gguf, ollama-tools.

The same values are recorded in docs/GITHUB_REPOSITORY_METADATA.md. They can be applied with python scripts/sync_github_metadata.py --apply when a GitHub token is available in the environment.

Development

python -m pip install -e ".[dev]"
python -m pytest -q

The CI workflow validates packaging, unit tests, and non-executing CLI smoke paths on Ubuntu and macOS. It does not download models or call Ollama inference.

Roadmap

Add a packaged installer and shell completions.
Add richer prompt suites for local validation.
Add clearer warning thresholds for swap, pageouts, and throughput collapse.
Add optional adapters for direct llama.cpp execution.
Improve automatic metadata extraction from Ollama and GGUF files.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
docs		docs
mackv_opt		mackv_opt
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MacKV-Opt

What It Optimizes

Quick Start

How It Works

Compared With Other Options

Validate Improvement On Your Mac

Testing Without A Mac

CLI Reference

Safety And Privacy

Repository Profile

Development

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MacKV-Opt

What It Optimizes

Quick Start

How It Works

Compared With Other Options

Validate Improvement On Your Mac

Testing Without A Mac

CLI Reference

Safety And Privacy

Repository Profile

Development

Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages