Bug
`@cf/zai-org/glm-4.7-flash` is currently tagged `active, [COST_EFFECTIVE, BALANCED, TOOL_CALLING, LONG_CONTEXT]` in `model-catalog`. This causes it to be selected as the first-choice model for summary and cost-effective routing.
The model outputs chain-of-thought reasoning traces instead of direct responses, making it unsuitable for any route class where the caller expects a clean answer. Example of what users get when glm-4.7-flash is selected for a summary turn:
```
- Analyze the user's request:
- Topic: Local LLM gateway proxy.
- Constraint: "In one sentence."
- Why do people use it? ...
- Draft 1: A local LLM gateway proxy is software that...
- *Draft 2 (More technical
```
Response truncates mid-reasoning at `max_tokens` because the model never gets to the actual answer.
Confirmed affected models (same pattern)
Two additional models expected to have the same issue if added to the catalog:
- `@cf/deepseek-ai/deepseek-r1-distill-qwen-32b` — DeepSeek R1 distill, reasoning model
- `@cf/qwen/qwq-32b` — QwQ is explicitly a reasoning model
Fix
Either:
- Remove `COST_EFFECTIVE` and `BALANCED` use cases from `glm-4.7-flash` — keep it only under an explicit `REASONING` or `ANALYTICAL` use case
- Add a `thinkingModel: true` flag to the catalog entry so routers can filter these out for direct-response routing
When `@cf/deepseek-ai/deepseek-r1-distill-qwen-32b` and `@cf/qwen/qwq-32b` are added (see companion issue), apply the same handling.
Discovered via
bildy end-to-end smoke test: live summary routing selected `glm-4.7-flash` and returned reasoning trace to Claude Code instead of a summary response.
Bug
`@cf/zai-org/glm-4.7-flash` is currently tagged `active, [COST_EFFECTIVE, BALANCED, TOOL_CALLING, LONG_CONTEXT]` in `model-catalog`. This causes it to be selected as the first-choice model for summary and cost-effective routing.
The model outputs chain-of-thought reasoning traces instead of direct responses, making it unsuitable for any route class where the caller expects a clean answer. Example of what users get when glm-4.7-flash is selected for a summary turn:
```
```
Response truncates mid-reasoning at `max_tokens` because the model never gets to the actual answer.
Confirmed affected models (same pattern)
Two additional models expected to have the same issue if added to the catalog:
Fix
Either:
When `@cf/deepseek-ai/deepseek-r1-distill-qwen-32b` and `@cf/qwen/qwq-32b` are added (see companion issue), apply the same handling.
Discovered via
bildy end-to-end smoke test: live summary routing selected `glm-4.7-flash` and returned reasoning trace to Claude Code instead of a summary response.