PTX Backend by WillTrojak · Pull Request #18 · PyFR/GiMMiK

WillTrojak · 2026-05-15T12:23:17Z

This adds a PTX backend to GiMMiK. The key features are:

Mild optimisation of exist CUDA algorithms.
Optional async loads for some sparse kernels
Added dense generation for Hopper and above

Optimisations have focused on FP64, FP32 is future work.

FreddieWitherden · 2026-05-15T13:41:09Z

+<%inherit file='base'/>
+
+<%
+pftype = 'f32' if dtype == 'float' else 'f64'


Is it worth factoring any of this into base?

FreddieWitherden · 2026-05-15T13:43:01Z

+        yield from self._dense_kernel_generators(dtype, dsize, base_args)
+
+    def _sparse_kernel_generators(self, dtype, dsize, base_args):
+        if not self.is_sparse_suitable(self.A):


Maybe move these checks up to the _kernel_generators function.

FreddieWitherden · 2026-05-15T13:43:47Z

+            meta = {
+                'block': (blkx, 1, 1),
+                'grid': (-(-self.n // n_per_cta), 1, 1),
+                'desc': f'{tpl}/nn{nn}-w{w}{"-bs" if bs else ""}',


Recent Python allows for single quotes here.

FreddieWitherden · 2026-05-15T13:44:24Z

+            yield (tpl, args, meta)
+
+        # Warp-specialised dense DMMA
+        if cc >= (10, 0):


Does this gate consumer cards with less shared memory?

FreddieWitherden · 2026-05-15T13:54:35Z

@@ -0,0 +1,276 @@
+# -*- coding: utf-8 -*-
+
+import struct


FreddieWitherden · 2026-05-15T18:22:44Z

+            return
+
+        # Some kernels can optional steal blocks
+        bs_default = cc >= (10, 0)


Combine with the check below?

FreddieWitherden · 2026-05-15T18:24:12Z

+        ws_configs = [(1, 4), (2, 4), (4, 4)]
+        for nn, w in ws_configs:
+            n_per_cta = 8 * nn * w
+            if n_per_cta > self.n:


FreddieWitherden · 2026-05-15T18:24:38Z

+            if ws_layout['dynm_total_bytes'] > 200 * 1024:
+                continue
+
+            args = (base_args


I can think reorder to get it to two lines?

FreddieWitherden · 2026-05-15T18:25:33Z

+                    i = m_tile * 8 + lane // 4
+                    j = k_iter * 4 + lane % 4
+                    v = float(a[i, j]) if (i < m and j < k) else 0.0
+                    u = struct.unpack('<Q', struct.pack('<d', v))[0]


Can you unpick this for me?

FreddieWitherden · 2026-05-15T18:25:58Z

+
+        # A in fragment layout: lane l -> A[m_tile*8 + l/4][k_iter*4 + l%4]
+        a_u64 = []
+        for m_tile in range(m_tiles):


Can 3 arg range work here?

FreddieWitherden · 2026-05-15T18:31:49Z

I know this is an utter pain but for FP32/FP64 can you confirm correctness for all relevant PyFR matrices at a suite of N values for all instances where a kernel is expected to work on A100/H100/B100)?

FreddieWitherden · 2026-05-15T18:32:10Z

@@ -0,0 +1,4 @@
+.version 8.7
+.target sm_${cc[0]}${cc[1]}${"a" if cc[0] >= 9 else ""}


Single quotes.

FreddieWitherden · 2026-05-15T18:33:25Z

+                         .param .u64 _c)
+{
+% endif
+    .reg .u32 n, id, tid_x, tid_y;


Ensure we throw higher up if n is too big.

FreddieWitherden · 2026-05-15T18:34:40Z

+## Async fill of chunk 0
+%   for idx, kx in enumerate(bchunks[0]):
+%     if idx % msplit == cid:
+% if n is None:


See if we can come up with some consistent indentation for Mako. Am open to ideas.

FreddieWitherden · 2026-05-15T18:35:02Z

+<%
+        buf_cur = bb % 2
+        buf_next = (bb + 1) % 2
+        is_last = (bb == len(bchunks) - 1)


There is a Mako var for this.

FreddieWitherden · 2026-05-15T18:36:01Z

+%       if afix[row_j] == -1:
+% if beta == 0:
+    {
+    .reg .${pftype} _tmp;


Can this be factored up as appears in both branches?

FreddieWitherden · 2026-05-15T18:39:06Z

+    fma.rn.${pftype} _ctmp, _ctmp, ${float(beta)}, dotp;
+    st.global.${pftype} [_cptr], _ctmp;
+% else:
+    ld.global.${pftype} _ctmp, [c_base + ${ldc*j*dwidth_i}];


Is there scope to lifting these ld's up or does the assembler handle this?

FreddieWitherden · 2026-05-15T18:40:51Z

+    ld.weak.global.cg.${pftype} csub${j}, [_cptr];
+    }
+% else:
+    ld.weak.global.cg.${pftype} csub${j}, [c_base + ${ldc*j*dwidth_i}];


Are we consistent in our use of loads throughout the PTX? Not sure if it makes a huge different for performance but code consistency would be good.

Will Trojak and others added 6 commits December 2, 2025 22:13

[wip] added ptx generator for bstream

0cd7485

Addtional sparse and dense work

626c2f5

Dense and sparse optimisation

bbbb8ef

Added warp specialised dense kernel

393b409

Performance tuning and cleanup

67d1beb

Whitespace

e2a818b

WillTrojak mentioned this pull request May 15, 2026

Support for GiMMiK PTX Provider PyFR/PyFR#556

Open

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py

@@ -0,0 +1,276 @@

# -*- coding: utf-8 -*-

import struct

Copy link
Copy Markdown

Contributor

FreddieWitherden May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8

FreddieWitherden reviewed May 15, 2026

View reviewed changes

		@@ -0,0 +1,4 @@
		.version 8.7
		.target sm_${cc[0]}${cc[1]}${"a" if cc[0] >= 9 else ""}

Conversation

WillTrojak commented May 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FreddieWitherden commented May 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants