Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 21 additions & 5 deletions Analysis.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,27 @@
### Metrics

#### Stable Diffusion on Nvidia Geforce GTX 1070 GPU
#### Stable Diffusion UNet on Nvidia Geforce GTX 1070 GPU

Average inference time: 28.63s ± 3.09s
see `results/benchmarks/benchmark_results.csv`

Average CPU memory usage: 890.11MB ± 59.32MB
``` bash
INFO - 2025-09-01 16:26:26,203 - 3315694995.py - UNet Inference time: 0.10s
INFO - 2025-09-01 16:26:26,448 - 3315694995.py - UNet Inference time: 0.24s
INFO - 2025-09-01 16:26:26,711 - 3315694995.py - UNet Inference time: 0.26s
INFO - 2025-09-01 16:26:26,974 - 3315694995.py - UNet Inference time: 0.26s
INFO - 2025-09-01 16:26:27,238 - 3315694995.py - UNet Inference time: 0.26s
INFO - 2025-09-01 16:26:27,494 - 3315694995.py - UNet Inference time: 0.25s
INFO - 2025-09-01 16:26:27,760 - 3315694995.py - UNet Inference time: 0.27s
INFO - 2025-09-01 16:26:28,017 - 3315694995.py - UNet Inference time: 0.26s
INFO - 2025-09-01 16:26:28,285 - 3315694995.py - UNet Inference time: 0.27s
INFO - 2025-09-01 16:26:28,538 - 3315694995.py - UNet Inference time: 0.25s
INFO - 2025-09-01 16:26:28,538 - 3315694995.py -

Average GPU memory usage: 2486.43MB ± 0.00MB
Average inference time: 0.24s ± 0.05s

see `results/benchmarks/benchmark_results.csv`
Average CPU memory usage: 1009.82MB ± 0.03MB

Average GPU memory usage: 3817.92MB ± 0.00MB
```

---
134 changes: 100 additions & 34 deletions notebooks/baseline_generation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@
"id": "0",
"metadata": {},
"source": [
"Choosing to use a LoRA / Distilled model because its lighter, faster, lower VRAM, easier for experimentation, perfect for a baseline and for quantization/optimization later."
"An important thing to keep in mind here is that a stable diffusion model is not a monolithic model but has different parts such as a UNet, VAE, text encoders etc. \n",
"\n",
"_Here (at least initially) we will only focus on and benchmark the UNet_ because its the most compute heavy. And this approach is more simpler than trying to export to ONNX, apply PTQ, QAT etc on all the components."
]
},
{
Expand All @@ -22,8 +24,8 @@
"import statistics\n",
"import psutil\n",
"import pandas as pd \n",
"import matplotlib.pyplot as plt\n",
"from PIL import Image"
"from tinydiffusion.utils.logger import LoggerConfig #this works in VS Code because of the .env file\n",
"from tinydiffusion.utils.constants import PROMPT"
]
},
{
Expand All @@ -32,14 +34,24 @@
"id": "2",
"metadata": {},
"outputs": [],
"source": [
"LOGGER = LoggerConfig().logger"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3",
"metadata": {},
"outputs": [],
"source": [
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
"print(f\"Using device: {device}\")"
"LOGGER.info(f\"Using device: {device}\")"
]
},
{
"cell_type": "markdown",
"id": "3",
"id": "4",
"metadata": {},
"source": [
"Load a lightweight/distilled Stable Diffusion model (LoRA or small variant)"
Expand All @@ -48,29 +60,31 @@
{
"cell_type": "code",
"execution_count": null,
"id": "4",
"id": "5",
"metadata": {},
"outputs": [],
"source": [
"ROOT_DIR = os.path.dirname(os.getcwd())\n",
"print(f\"Root directory: {ROOT_DIR}\")"
"LOGGER.info(f\"Root directory: {ROOT_DIR}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5",
"id": "6",
"metadata": {},
"outputs": [],
"source": [
"from tinydiffusion.utils.constants import ModelType\n",
"\n",
"# Example: \"stabilityai/stable-diffusion-2-base\" is smaller than SD 1.5 full\n",
"model_cache_dir = os.path.join(ROOT_DIR, \"checkpoints\", \"stablediffusion\")\n",
"model_id = \"stabilityai/stable-diffusion-2-base\" # This is not LoRA checkpoint"
"model_id = ModelType.STABLE_DIFFUSION_2_BASE.value "
]
},
{
"cell_type": "markdown",
"id": "6",
"id": "7",
"metadata": {},
"source": [
"Below we load the fp16 variant (as opposed to downloading the fp32 variant and then converting to fp16). [Ref](https://huggingface.co/docs/diffusers/en/using-diffusers/loading#:~:text=There%20are%20two%20important%20arguments%20for%20loading%20variants%3A)"
Expand All @@ -79,7 +93,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "7",
"id": "8",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -98,12 +112,13 @@
" cache_dir=model_cache_dir,\n",
" torch_dtype=torch.float32 \n",
" )\n",
"pipe = pipe.to(device)"
"pipe = pipe.to(device)\n",
"pipe.unet.eval()"
]
},
{
"cell_type": "markdown",
"id": "8",
"id": "9",
"metadata": {},
"source": [
"[StableDiffusionPipeline.enable_attention_slicing()](https://huggingface.co/docs/diffusers/v0.3.0/en/api/pipelines/stable_diffusion#diffusers.StableDiffusionPipeline.enable_attention_slicing)"
Expand All @@ -112,7 +127,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "9",
"id": "10",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -122,7 +137,50 @@
},
{
"cell_type": "markdown",
"id": "10",
"id": "11",
"metadata": {},
"source": [
"Create inputs of UNet since as noted at the beginning of the notebook, we intend to benchmark just the UNet. So we need to explicitly pass the inputs to the UNet and get just the UNet ouput.\n",
"\n",
"See [this](https://medium.com/@onkarmishra/stable-diffusion-explained-1f101284484d) for a quick reference about the stable diffusion architecture.\n",
"\n",
"Instead of running text encoder → U-Net denoising loop → VAE decode we:\n",
"\n",
"- Generate fake random latents (the \"noisy\" image at some timestep).\n",
"- Pick a timestep (e.g. 50).\n",
"- Encode text prompt via `pipe.text_encoder` ie, Stable Diffusion's text encoder (whatever it uses).\n",
"- Run just the U-Net forward pass.\n",
"\n",
"This means we dont actually generate a final image."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "12",
"metadata": {},
"outputs": [],
"source": [
"batch_size = 1\n",
"height = width = 64 # for 512x512 images\n",
"\n",
"# dummy latent image representation for 1 image\n",
"latents = torch.randn(\n",
" (batch_size, pipe.unet.config.in_channels, height, width),\n",
" device=device,\n",
" dtype=pipe.unet.dtype\n",
")\n",
"timestep = torch.tensor([10], device=device, dtype=torch.int64) # arbitrary diffusion step\n",
"text_embeddings = pipe.text_encoder(\n",
" pipe.tokenizer(PROMPT, return_tensors=\"pt\").input_ids.to(device)\n",
")[0]\n",
"\n",
"LOGGER.info(f\"Text embeddings shape: {text_embeddings.shape}\")"
]
},
{
"cell_type": "markdown",
"id": "13",
"metadata": {},
"source": [
"Benchmarking"
Expand All @@ -131,7 +189,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "11",
"id": "14",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -144,26 +202,27 @@
{
"cell_type": "code",
"execution_count": null,
"id": "12",
"id": "15",
"metadata": {},
"outputs": [],
"source": [
"prompt = \"A whale falling through a starry sky beside a floating bowl of petunias, painted in a surreal \" \\\n",
" \"cosmic landscape, whimsical and dreamlike, detailed digital art.\"\n",
"prompt = PROMPT\n",
"num_samples = 10\n",
"\n",
"process = psutil.Process(os.getpid())\n",
"\n",
"GEN_IMG_SAVE_PATH = os.path.join(os.path.dirname(os.getcwd()), \"results\", \"generated_images\")\n",
"os.makedirs(GEN_IMG_SAVE_PATH, exist_ok=True)\n",
"#GEN_IMG_SAVE_PATH = os.path.join(os.path.dirname(os.getcwd()), \"results\", \"generated_images\")\n",
"#os.makedirs(GEN_IMG_SAVE_PATH, exist_ok=True)\n",
"\n",
"print(f\"Generating images. Will be saved to: {GEN_IMG_SAVE_PATH}\")\n",
"#LOGGER.info(f\"Generating images. Will be saved to: {GEN_IMG_SAVE_PATH}\")\n",
"\n",
"results = []\n",
"\n",
"for i in range(num_samples):\n",
" start_time = time.time()\n",
" image = pipe(prompt, guidance_scale=7.5, num_inference_steps=50).images[0]\n",
" with torch.no_grad():\n",
" #image = pipe(prompt, guidance_scale=7.5, num_inference_steps=50).images[0] # this is what we would typically do to generate the image\n",
" noise_pred = pipe.unet(latents, timestep, text_embeddings).sample\n",
" end_time = time.time()\n",
" inference_time.append(end_time - start_time)\n",
"\n",
Expand All @@ -178,17 +237,16 @@
" gpu_mem = 0 \n",
" # Memory usage - END\n",
"\n",
" print(f\"Saved sample_{i+1}.png | Inference time: {(end_time - start_time):.2f}s\", end=\"\\n\")\n",
" image.save(os.path.join(GEN_IMG_SAVE_PATH, f\"sample_{i}.png\"))\n",
" LOGGER.info(f\"UNet Inference time: {(end_time - start_time):.2f}s\")\n",
"\n",
"print(f\"\\nAverage inference time: {statistics.mean(inference_time):.2f}s ± {statistics.stdev(inference_time):.2f}s\")\n",
"print(f\"\\nAverage CPU memory usage: {statistics.mean(cpu_mem_usage):.2f}MB ± {statistics.stdev(cpu_mem_usage):.2f}MB\")\n",
"LOGGER.info(f\"\\nAverage inference time: {statistics.mean(inference_time):.2f}s ± {statistics.stdev(inference_time):.2f}s\")\n",
"LOGGER.info(f\"\\nAverage CPU memory usage: {statistics.mean(cpu_mem_usage):.2f}MB ± {statistics.stdev(cpu_mem_usage):.2f}MB\")\n",
"if device == \"cuda\":\n",
" print(f\"\\nAverage GPU memory usage: {statistics.mean(gpu_mem_usage):.2f}MB ± {statistics.stdev(gpu_mem_usage):.2f}MB\")\n",
" LOGGER.info(f\"\\nAverage GPU memory usage: {statistics.mean(gpu_mem_usage):.2f}MB ± {statistics.stdev(gpu_mem_usage):.2f}MB\")\n",
"\n",
"# store results\n",
"results.append({\n",
" \"desc\": \"stable_diffusion_GPU\",\n",
" \"desc\": \"stable_diffusion_UNet_GPU\",\n",
" \"avg_inference_time\": statistics.mean(inference_time),\n",
" \"std_inference_time\": statistics.stdev(inference_time),\n",
" \"avg_cpu_mem_usage\": statistics.mean(cpu_mem_usage),\n",
Expand All @@ -200,7 +258,15 @@
},
{
"cell_type": "markdown",
"id": "13",
"id": "16",
"metadata": {},
"source": [
"The inference time would be really small here because we're running only one denoising step of the UNet as opposed to say 50 denoising steps. "
]
},
{
"cell_type": "markdown",
"id": "17",
"metadata": {},
"source": [
"Save benchmark details as CSV"
Expand All @@ -209,7 +275,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "14",
"id": "18",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -220,20 +286,20 @@
{
"cell_type": "code",
"execution_count": null,
"id": "15",
"id": "19",
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(results)\n",
"csv_path = os.path.join(BENCHMARK_SAVE_PATH, \"benchmark_results.csv\")\n",
"df.to_csv(csv_path, index=False)\n",
"print(f\"Saved benchmark results to {csv_path}\")"
"LOGGER.info(f\"Saved benchmark results to {csv_path}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16",
"id": "20",
"metadata": {},
"outputs": [],
"source": []
Expand Down
2 changes: 1 addition & 1 deletion results/benchmarks/benchmark_results.csv
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
desc,avg_inference_time,std_inference_time,avg_cpu_mem_usage,std_cpu_mem_usage,avg_gpu_mem_usage,std_gpu_mem_usage
stable_diffusion_GPU,28.794894647598266,2.808588022429784,897.65859375,1.677354443989871,2486.42822265625,0.0
stable_diffusion_UNet_GPU,0.24329500198364257,0.04984202499570953,1009.8203125,0.03146626555623586,3817.91748046875,0.0
Loading
Loading