Build a lightweight prompt-compression service that uses a fast token-classification model at runtime, not an LLM. The service should score text, delete lower-value tokens, preserve important structure, and return compressed prompts through a local API and testing UI.
The first milestone is already in progress: this repo runs the base LLMLingua-2 model behind a FastAPI service with a browser test app. Python has been upgraded to 3.14. Docker exists as a file target, but the Docker image has not been built or validated yet.
The compressor should work like this:
input text
-> tokenizer / LLMLingua-2 model
-> per-token keep/drop decision
-> deterministic aggressiveness control
-> protected token hints
-> compressed text + stats + token labels
This is intentionally extractive. The model should not summarize, rewrite, or generate new text. It should only remove tokens while preserving original order and wording.
Implemented:
app/main.py: FastAPI application with/,/health, and/compress.app/compressor.py: wrapper around the LLMLingua-2 MeetingBank model.app/protected_spans.py: basic forced-token hints for punctuation, negations, URLs, emails, numbers, money, inline code, and uppercase IDs.app/schemas.py: request/response models, including labeled token output.scripts/smoke_test.py: local API smoke test.tests/: unit tests for threshold mapping, protected tokens, and API response shape.README.md: setup, usage, API examples, Docker notes, and Cloud Run shape.Dockerfile: Python 3.14 container definition, not yet built..vscode/: local VS Code tasks, launch config, settings, and extension recommendations.
Current runtime model:
microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank
The service defaults to CPU. It can use CUDA only if a compatible CUDA PyTorch install is available and COMPRESSOR_DEVICE=cuda is set.
Status: mostly complete.
Objective: prove compression works locally without training a model.
Steps:
- Use Python 3.14 virtual environment.
- Install dependencies from
requirements-dev.txt. - Start the API:
uvicorn app.main:app --reload --host 127.0.0.1 --port 8000- Open the testing UI:
http://127.0.0.1:8000/
- Test API docs:
http://127.0.0.1:8000/docs
- Run the smoke test:
python scripts\smoke_test.pyExpected behavior:
- First compression request may be slow because the Hugging Face model downloads.
- Later requests should reuse the loaded model.
- Response includes compressed text, token counts, reduction percentage, elapsed time, model name, target retention rate, and kept/dropped word labels.
Objective: make the local MVP predictable enough to evaluate.
Steps:
- Verify that the current LLMLingua method call is correct for the installed
llmlingua==0.2.2. - Run the full unit test suite:
pytest -q- Run a manual compression test through the browser UI.
- Try three aggressiveness values:
0.10,0.25, and0.60. - Record example outputs and check whether critical content is preserved.
- Expand
protected_spans.pyonly where real examples show risky deletions.
Protected content to watch closely:
- Numbers, dates, IDs, and money.
- URLs and email addresses.
- Negations such as
not,never,without, andunless. - Required/must/shall constraints.
- Code-like fragments and inline backtick content.
- JSON, XML, SQL, and config text. These should eventually get stricter handling than plain prose.
Objective: determine whether compression helps the actual downstream workflow.
Create a small local eval set before training anything custom:
data/eval/
prompt_001.txt
prompt_002.txt
...
Start with 50-100 examples from real usage:
- Long system prompts.
- RAG context chunks.
- Chat history.
- Documentation passages.
- Support or meeting transcript text.
- Agent instructions.
For each example, save:
- Original prompt.
- Compressed prompt at safe/balanced/aggressive settings.
- Token reduction.
- Whether downstream LLM answer quality changed.
- Any critical deletion failures.
Useful MVP metrics:
- Token reduction percentage.
- Compression latency.
- Downstream answer pass/fail.
- Count of dangerous deletions.
- User-visible readability of compressed prompt.
Do not rely only on semantic similarity. The real test is whether the target LLM still answers correctly.
Status: pending. The Dockerfile exists but has not been built.
Objective: prove the service runs from a clean container.
Steps:
- Build the image:
docker build -t prompt-compression .- Run it locally:
docker run --rm -p 8080:8080 prompt-compression- Open:
http://127.0.0.1:8080/
- Run a POST test against port
8080. - Confirm model download works inside the container.
- Confirm memory use is acceptable.
Risks to verify:
- Python 3.14 compatibility with
torch,transformers, andllmlingua. - Availability of
python:3.14-slimand binary wheels for all dependencies. - Cold-start time caused by downloading the model at first request.
- Container memory pressure during model load.
If dependency wheels are not ready for Python 3.14, the pragmatic fallback is to keep local dev on Python 3.14 but use a supported Python version in Docker until the ecosystem catches up.
Objective: deploy the Dockerized API with minimal infrastructure.
Recommended hosting path:
FastAPI
-> Docker image
-> Cloud Run
-> HTTP clients call POST /compress
Initial Cloud Run settings:
- CPU: 1-2 vCPU.
- Memory: 1-2 GB.
- Concurrency: start around 10.
- Minimum instances:
0for cheapest testing,1if cold starts are unacceptable. - Model file: initially downloaded at startup or first request; later baked into the image.
Deployment steps:
- Build Docker image.
- Push image to a registry.
- Deploy to Cloud Run.
- Set environment variables:
COMPRESSOR_MODEL=microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank
COMPRESSOR_MIN_RATE=0.45
COMPRESSOR_DEVICE=cpu
- Test
/health. - Test
/compress. - Watch logs for latency, memory, and first-request model loading errors.
Objective: replace or fine-tune the base LLMLingua-2 behavior with domain-specific examples.
Do this after the local and Docker MVPs are stable.
Training data plan:
- Collect 500-2,000 representative examples.
- Use an LLM offline to create extractive compressed versions.
- Reject examples where the compressed output cannot be aligned back to original text.
- Convert original/compressed pairs into KEEP/DROP token labels.
- Fine-tune a small token classifier.
- Evaluate against the Phase 3 eval set.
- Promote only if it beats or matches the base LLMLingua-2 model on quality and speed.
Potential bootstrap data:
microsoft/MeetingBank-LLMCompressed
Important licensing note: that dataset is listed as non-commercial/share-alike, so treat it as research/bootstrap material unless licensing is reviewed for the intended use.
Do these after the product behavior is proven:
- Export the model to ONNX.
- Quantize to INT8.
- Load the model during service startup instead of first request.
- Bake model artifacts into the Docker image.
- Add structured logs:
request_id
model
aggressiveness
target_rate
original_tokens
compressed_tokens
reduction
elapsed_ms
error
- Add a regression eval command.
- Add stricter span preservation for structured text.
- Add max input size and clear error handling.
- Run
pytest -qon the current Python 3.14 environment. - Start local API and confirm
http://127.0.0.1:8000/works. - Run
python scripts\smoke_test.py. - Test several real prompts and save results.
- Build Docker image.
- Run Docker container locally on port
8080. - Decide whether Docker stays on Python 3.14 or temporarily falls back for dependency compatibility.
- Deploy to Cloud Run only after Docker is verified locally.
The first hosted MVP is done when:
- Local UI compresses text reliably.
/compressreturns useful stats and labeled tokens.- Unit tests pass.
- Docker image builds.
- Docker container runs locally.
- Cloud Run deployment responds to
/healthand/compress. - A small eval set shows acceptable token savings without obvious critical deletions.