Summary
Reasoning-capable chat models (Qwen3 series, DeepSeek-R1, etc.) default to emitting a long internal reasoning chain. Wiki generation is mostly "read code + write markdown description" — it doesn't benefit from a chain-of-thought, but pays the time cost (often 60-80% of completion tokens go to reasoning_content rather than output).
repowise has no surface to disable reasoning per-provider. The user has to either accept the slowdown or fork the LLM provider code.
Reproduction
- Use any reasoning model via
OPENAI_BASE_URL pointing to a runtime that serves it (e.g. Ollama with a recent Qwen3 build, or vLLM/SGLang).
- Run
repowise init and observe completion_tokens vs the actual output text length in the generation phase: most tokens are reasoning, not the wiki content.
Suggested fix
Add an opt-in flag to GenerationConfig (or per-provider config) such as:
@dataclass
class GenerationConfig:
...
disable_reasoning: bool = False
# When True, providers should pass backend-specific kwargs to disable
# the model's reasoning chain (e.g. extra_body / chat_template_kwargs /
# reasoning_effort=minimal, depending on the backend).
Then each LLM provider translates the flag to its own backend syntax. For OpenAI-compatible chat completions, the common patterns are:
- vLLM / SGLang serving Qwen3:
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
- OpenAI Responses API o-series:
reasoning={"effort": "minimal"}
- Some proxies expose vendor-specific envelopes
Locally, we observed a 3.9× speedup translating ~3800 wiki pages (a related batch task using the same chat completions endpoint) with reasoning disabled, with no measurable quality regression on the output markdown structure or technical content.
Happy to send a PR adding the flag and a default OpenAIProvider mapping; backend-specific mappings can be added incrementally.
Summary
Reasoning-capable chat models (Qwen3 series, DeepSeek-R1, etc.) default to emitting a long internal reasoning chain. Wiki generation is mostly "read code + write markdown description" — it doesn't benefit from a chain-of-thought, but pays the time cost (often 60-80% of completion tokens go to
reasoning_contentrather than output).repowise has no surface to disable reasoning per-provider. The user has to either accept the slowdown or fork the LLM provider code.
Reproduction
OPENAI_BASE_URLpointing to a runtime that serves it (e.g. Ollama with a recent Qwen3 build, or vLLM/SGLang).repowise initand observecompletion_tokensvs the actual output text length in the generation phase: most tokens are reasoning, not the wiki content.Suggested fix
Add an opt-in flag to
GenerationConfig(or per-provider config) such as:Then each LLM provider translates the flag to its own backend syntax. For OpenAI-compatible chat completions, the common patterns are:
extra_body={"chat_template_kwargs": {"enable_thinking": False}}reasoning={"effort": "minimal"}Locally, we observed a 3.9× speedup translating ~3800 wiki pages (a related batch task using the same chat completions endpoint) with reasoning disabled, with no measurable quality regression on the output markdown structure or technical content.
Happy to send a PR adding the flag and a default
OpenAIProvidermapping; backend-specific mappings can be added incrementally.