Skip to content

Provider-aware endpoint error handling in inference_provider model server #1748

Description

@cwing-nvidia

Follow-up from PR #1286 review (ananthsub).

The inference_provider model server currently issues chat-completion requests directly:

async with self._semaphore:
    chat_completion_dict = await self._client.create_chat_completion(**body_dict)

As we add support for more hosted providers (Fireworks, Together.ai, Baseten, DeepInfra, Nebius, Friendli, OpenRouter, HF Inference, Gemini, ...), their OpenAI-compatible endpoints differ in how they surface errors (rate limits, auth failures, model-not-found, transient 5xx, etc.). We may want provider-aware / more granular error handling here rather than letting raw errors propagate uniformly.

Scope:

  • Survey error response shapes/status codes across supported providers
  • Decide on a normalization / retry / surfacing strategy
  • Apply consistently in responses_api_models/inference_provider/app.py

Ref: PR #1286 review comment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions