Skip to content

Paralell Inference #174

Description

@alpcansoydas

Hello team,

I’m looking for guidance on recommended hardware specifications for parallel serving of this model.
Specifically, I’d like to understand:
Are there any official or suggested hardware specs for deploying this model in a production environment?

Is it possible to serve 200–300 parallel users with low latency using:
A single GPU?
Multiple GPUs?

If multiple GPUs are recommended, what type (e.g., A100, H100, etc.) and how many would typically be required?
Are there any benchmarks, reference architectures, or deployment examples available?
Any information on expected throughput, memory requirements, and scaling strategies (tensor parallelism, pipeline parallelism, model sharding, etc.) would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions