Follow-up from PR #1286 review (ananthsub).
The inference_provider model server currently issues chat-completion requests directly:
async with self._semaphore:
chat_completion_dict = await self._client.create_chat_completion(**body_dict)
As we add support for more hosted providers (Fireworks, Together.ai, Baseten, DeepInfra, Nebius, Friendli, OpenRouter, HF Inference, Gemini, ...), their OpenAI-compatible endpoints differ in how they surface errors (rate limits, auth failures, model-not-found, transient 5xx, etc.). We may want provider-aware / more granular error handling here rather than letting raw errors propagate uniformly.
Scope:
- Survey error response shapes/status codes across supported providers
- Decide on a normalization / retry / surfacing strategy
- Apply consistently in
responses_api_models/inference_provider/app.py
Ref: PR #1286 review comment.
Follow-up from PR #1286 review (ananthsub).
The
inference_providermodel server currently issues chat-completion requests directly:As we add support for more hosted providers (Fireworks, Together.ai, Baseten, DeepInfra, Nebius, Friendli, OpenRouter, HF Inference, Gemini, ...), their OpenAI-compatible endpoints differ in how they surface errors (rate limits, auth failures, model-not-found, transient 5xx, etc.). We may want provider-aware / more granular error handling here rather than letting raw errors propagate uniformly.
Scope:
responses_api_models/inference_provider/app.pyRef: PR #1286 review comment.