Skip to content

infinite s3dlio list during training (retinanet) #472

@rasar07

Description

@rasar07

Issue description: 
We have 4 clients with 755GB memory per client. This results in 50 million .3MiB Jpg files to be ingested to read back for training. s3dlio lists back the bucket and it has taken more than 12 hours and it is still listing.

  1. Datasize:
    mlpstorage closed training retinanet datasize --client-host-memory-in-gb 754 --max-accelerators 16 --accelerator-type b200 --num-client-hosts 4

gives:

RESULT: Number of training files: 50136788

  1. Data generation:

mlpstorage closed training retinanet datagen object
--num-processes 64
--hosts gizmoc5,gizmoc6,gizmoc7,gizmoc8
--data-dir retinanet_64p
--params dataset.num_files_train=50136788
--params storage.storage_type=s3
--params storage.storage_root=retinanet1
--mpi-params "-x AWS_ACCESS_KEY_ID -x AWS_SECRET_ACCESS_KEY -x AWS_REGION -x S3_ENDPOINT_URIS -x S3DLIO_MAX_CONCURRENCY -x S3DLIO_MULTIPART_THRESHOLD -x S3DLIO_PART_SIZE -x S3DLIO_CONNECTION_TIMEOUT -x S3DLIO_READ_TIMEOUT -x AWS_S3_MULTIPART_THRESHOLD -x RUST_LOG --map-by slot --mca pml ob1 --mca btl tcp,self --mca btl_openib_allow_ib 0 --mca btl_openib_warn_no_device_params_found 0"
--allow-run-as-root[11:59 AM]export AWS_ACCESS_KEY_ID=user1

export AWS_SECRET_ACCESS_KEY=iOpHmc2afcbYuAfAapfWljfxtbHM6XYalMC2GwD6
export AWS_REGION=us-east-1
export S3_ENDPOINT_URIS= http://ip1:9020,http://ip2:9020,http://ip3:9020,http://ip4:9020
export S3DLIO_MAX_CONCURRENCY=32
export S3DLIO_MULTIPART_THRESHOLD=1073741824
export S3DLIO_PART_SIZE=268435456
export S3DLIO_CONNECTION_TIMEOUT=300
export S3DLIO_READ_TIMEOUT=300
export RUST_LOG=s3dlio=debug,aws_sdk_s3=debug

  1. Training:

mlpstorage open training retinanet run object
--data-dir retinanet_64p
--results-dir /tmp/result.20260618.033618/mlperf/training/retinanet/run_20260618_033720
--accelerator-type b200
--num-accelerators 256
--client-host-memory-in-gb 755
--hosts gizmoc5,gizmoc6,gizmoc7,gizmoc8
--params storage.storage_type=s3
--params storage.storage_root=retinanet1
--params dataset.num_files_train=50204282
--mpi-params "-x AWS_ACCESS_KEY_ID -x AWS_SECRET_ACCESS_KEY -x AWS_REGION -x S3_ENDPOINT_URIS -x S3DLIO_MAX_CONCURRENCY -x S3DLIO_MULTIPART_THRESHOLD -x S3DLIO_PART_SIZE -x S3DLIO_CONNECTION_TIMEOUT -x S3DLIO_OPERATION_TIMEOUT_SECS -x S3DLIO_READ_TIMEOUT -x RUST_LOG --map-by slot --mca pml ob1 --mca btl tcp,self --mca btl_openib_allow_ib 0 --mca btl_openib_warn_no_device_params_found 0"
--allow-run-as-root

export AWS_ACCESS_KEY_ID=user1
export AWS_SECRET_ACCESS_KEY=iOpHmc2afcbYuAfAapfWljfxtbHM6XYalMC2GwD6
export AWS_REGION=us-east-1
export S3_ENDPOINT_URIS=http://ip1:9020,http://ip2:9020,http://ip3:9020,http://ip4:9020
export S3DLIO_MAX_CONCURRENCY=32
export S3DLIO_MULTIPART_THRESHOLD=1073741824
export S3DLIO_PART_SIZE=268435456
export S3DLIO_CONNECTION_TIMEOUT=600
export S3DLIO_OPERATION_TIMEOUT_SECS=600
export S3DLIO_READ_TIMEOUT=600
export RUST_LOG=s3dlio=info,aws_sdk_s3=warn

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions