Skip to content

PTX Backend#18

Open
WillTrojak wants to merge 6 commits into
PyFR:masterfrom
WillTrojak:feature/ptx
Open

PTX Backend#18
WillTrojak wants to merge 6 commits into
PyFR:masterfrom
WillTrojak:feature/ptx

Conversation

@WillTrojak
Copy link
Copy Markdown
Member

This adds a PTX backend to GiMMiK. The key features are:

  • Mild optimisation of exist CUDA algorithms.
  • Optional async loads for some sparse kernels
  • Added dense generation for Hopper and above

Optimisations have focused on FP64, FP32 is future work.

<%inherit file='base'/>

<%
pftype = 'f32' if dtype == 'float' else 'f64'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth factoring any of this into base?

Comment thread gimmik/ptx.py
yield from self._dense_kernel_generators(dtype, dsize, base_args)

def _sparse_kernel_generators(self, dtype, dsize, base_args):
if not self.is_sparse_suitable(self.A):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move these checks up to the _kernel_generators function.

Comment thread gimmik/ptx.py
meta = {
'block': (blkx, 1, 1),
'grid': (-(-self.n // n_per_cta), 1, 1),
'desc': f'{tpl}/nn{nn}-w{w}{"-bs" if bs else ""}',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent Python allows for single quotes here.

Comment thread gimmik/ptx.py
yield (tpl, args, meta)

# Warp-specialised dense DMMA
if cc >= (10, 0):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this gate consumer cards with less shared memory?

Comment thread gimmik/ptx.py
@@ -0,0 +1,276 @@
# -*- coding: utf-8 -*-

import struct
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8

Comment thread gimmik/ptx.py
return

# Some kernels can optional steal blocks
bs_default = cc >= (10, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine with the check below?

Comment thread gimmik/ptx.py
ws_configs = [(1, 4), (2, 4), (4, 4)]
for nn, w in ws_configs:
n_per_cta = 8 * nn * w
if n_per_cta > self.n:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walrus

Comment thread gimmik/ptx.py
if ws_layout['dynm_total_bytes'] > 200 * 1024:
continue

args = (base_args
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think reorder to get it to two lines?

Comment thread gimmik/ptx.py
i = m_tile * 8 + lane // 4
j = k_iter * 4 + lane % 4
v = float(a[i, j]) if (i < m and j < k) else 0.0
u = struct.unpack('<Q', struct.pack('<d', v))[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you unpick this for me?

Comment thread gimmik/ptx.py

# A in fragment layout: lane l -> A[m_tile*8 + l/4][k_iter*4 + l%4]
a_u64 = []
for m_tile in range(m_tiles):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can 3 arg range work here?

@FreddieWitherden
Copy link
Copy Markdown
Contributor

I know this is an utter pain but for FP32/FP64 can you confirm correctness for all relevant PyFR matrices at a suite of N values for all instances where a kernel is expected to work on A100/H100/B100)?

@@ -0,0 +1,4 @@
.version 8.7
.target sm_${cc[0]}${cc[1]}${"a" if cc[0] >= 9 else ""}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single quotes.

.param .u64 _c)
{
% endif
.reg .u32 n, id, tid_x, tid_y;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure we throw higher up if n is too big.

## Async fill of chunk 0
% for idx, kx in enumerate(bchunks[0]):
% if idx % msplit == cid:
% if n is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if we can come up with some consistent indentation for Mako. Am open to ideas.

<%
buf_cur = bb % 2
buf_next = (bb + 1) % 2
is_last = (bb == len(bchunks) - 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a Mako var for this.

% if afix[row_j] == -1:
% if beta == 0:
{
.reg .${pftype} _tmp;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be factored up as appears in both branches?

fma.rn.${pftype} _ctmp, _ctmp, ${float(beta)}, dotp;
st.global.${pftype} [_cptr], _ctmp;
% else:
ld.global.${pftype} _ctmp, [c_base + ${ldc*j*dwidth_i}];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there scope to lifting these ld's up or does the assembler handle this?

ld.weak.global.cg.${pftype} csub${j}, [_cptr];
}
% else:
ld.weak.global.cg.${pftype} csub${j}, [c_base + ${ldc*j*dwidth_i}];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we consistent in our use of loads throughout the PTX? Not sure if it makes a huge different for performance but code consistency would be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants