port trczdf to GPU by dindon-sournois · Pull Request #3 · inogs/ogstm

dindon-sournois · 2024-04-15T21:13:26Z

No description provided.

dindon-sournois · 2024-04-16T07:29:14Z

+#ifdef _OPENACC
+      subroutine myalloc_ZDF_gpu()
+        allocate(zwd(jpk, dimen_jvzdf))
+        zwd          = huge(zwd(1,1))
+        allocate(zws(jpk, dimen_jvzdf))
+        zws          = huge(zws(1,1))
+        allocate(zwi(jpk, dimen_jvzdf))
+        zwi          = huge(zwi(1,1))
+        allocate(zwx(jpk, dimen_jvzdf))
+        zwx          = huge(zwx(1,1))
+        allocate(zwy(jpk, dimen_jvzdf))
+        zwy          = huge(zwy(1,1))
+        allocate(zwz(jpk, dimen_jvzdf))
+        zwz          = huge(zwz(1,1))
+        allocate(zwt(jpk, dimen_jvzdf))
+        zwt          = huge(zwt(1,1))
+
+        !$acc enter data create(zwd,zwi,zwx,zws,zwz,zwy,zwt)
+        !$acc update device(zwd,zwi,zwx,zws,zwz,zwy,zwt)
+      END subroutine myalloc_ZDF_gpu
+#endif


We create a new subroutine here that is called once in trczdf after dimen_jvzdf value is known

We could probably do the same for the CPU version to avoid duplicates, also the memory counter might needs to be adapted

dindon-sournois · 2024-04-16T07:32:02Z


-  !$acc enter data create( e1t(1:jpj,1:jpi), e2t(1:jpj,1:jpi), e3t(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
  !$acc enter data create( e1u(1:jpj,1:jpi), e2u(1:jpj,1:jpi), e3u(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
  !$acc enter data create( e1v(1:jpj,1:jpi), e2v(1:jpj,1:jpi), e3v(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
-  !$acc enter data create( e3w(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
  !$acc enter data create( un(1:jpk,1:jpj,1:jpi), vn(1:jpk,1:jpj,1:jpi), wn(1:jpk,1:jpj,1:jpi) ) if(use_gpu)


it's not a good idea to declare these arrays here:

they are allocated and deallocated later, which is a waste of time

GPU allocation should be moved close to CPU allocate as the port progress

dindon-sournois · 2024-04-16T07:33:45Z

+        ! NOTE: kernel is too big, should be split
+        !$acc parallel loop gang vector default(present) async vector_length(32)


we might want to think about clever ways to generate this kernel as it seems quite big, best performance on A100 was obtained with a vector length of 32 which isn't very high

dindon-sournois · 2024-04-16T07:34:19Z

           Aij = e1t(jj,ji) * e2t(jj,ji)

+#ifdef _OPENACC
+           ntx=jv


for GPU version we parallelize on dimen_jvzdf

dindon-sournois added 2 commits April 15, 2024 23:10

port trczdf to GPU

444e5f3

reduce vector length

59e50e2

dindon-sournois self-assigned this Apr 16, 2024

dindon-sournois requested a review from stefanocampanella April 16, 2024 07:29

dindon-sournois commented Apr 16, 2024

View reviewed changes

dindon-sournois mentioned this pull request Apr 24, 2024

Trcadv async + serial kernel fix #4

Open

dindon-sournois marked this pull request as ready for review April 24, 2024 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

port trczdf to GPU#3

port trczdf to GPU#3
dindon-sournois wants to merge 2 commits into
dev_gpufrom
trczdf_gpu

dindon-sournois commented Apr 15, 2024

Uh oh!

dindon-sournois Apr 16, 2024

Uh oh!

dindon-sournois Apr 16, 2024

Uh oh!

dindon-sournois Apr 16, 2024

Uh oh!

dindon-sournois Apr 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		! NOTE: kernel is too big, should be split
		!$acc parallel loop gang vector default(present) async vector_length(32)

Conversation

dindon-sournois commented Apr 15, 2024

Uh oh!

dindon-sournois Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

dindon-sournois Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

dindon-sournois Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

dindon-sournois Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant