Skip to content

port trczdf to GPU#3

Open
dindon-sournois wants to merge 2 commits into
dev_gpufrom
trczdf_gpu
Open

port trczdf to GPU#3
dindon-sournois wants to merge 2 commits into
dev_gpufrom
trczdf_gpu

Conversation

@dindon-sournois

Copy link
Copy Markdown
Collaborator

No description provided.

Comment thread src/PHYS/ZDF_mem.f90
Comment on lines +75 to +95
#ifdef _OPENACC
subroutine myalloc_ZDF_gpu()
allocate(zwd(jpk, dimen_jvzdf))
zwd = huge(zwd(1,1))
allocate(zws(jpk, dimen_jvzdf))
zws = huge(zws(1,1))
allocate(zwi(jpk, dimen_jvzdf))
zwi = huge(zwi(1,1))
allocate(zwx(jpk, dimen_jvzdf))
zwx = huge(zwx(1,1))
allocate(zwy(jpk, dimen_jvzdf))
zwy = huge(zwy(1,1))
allocate(zwz(jpk, dimen_jvzdf))
zwz = huge(zwz(1,1))
allocate(zwt(jpk, dimen_jvzdf))
zwt = huge(zwt(1,1))

!$acc enter data create(zwd,zwi,zwx,zws,zwz,zwy,zwt)
!$acc update device(zwd,zwi,zwx,zws,zwz,zwy,zwt)
END subroutine myalloc_ZDF_gpu
#endif

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create a new subroutine here that is called once in trczdf after dimen_jvzdf value is known

We could probably do the same for the CPU version to avoid duplicates, also the memory counter might needs to be adapted

Comment thread src/PHYS/trcadv.f90
Comment on lines 177 to 179

!$acc enter data create( e1t(1:jpj,1:jpi), e2t(1:jpj,1:jpi), e3t(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
!$acc enter data create( e1u(1:jpj,1:jpi), e2u(1:jpj,1:jpi), e3u(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
!$acc enter data create( e1v(1:jpj,1:jpi), e2v(1:jpj,1:jpi), e3v(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
!$acc enter data create( e3w(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
!$acc enter data create( un(1:jpk,1:jpj,1:jpi), vn(1:jpk,1:jpj,1:jpi), wn(1:jpk,1:jpj,1:jpi) ) if(use_gpu)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not a good idea to declare these arrays here:

  • they are allocated and deallocated later, which is a waste of time
  • GPU allocation should be moved close to CPU allocate as the port progress

Comment thread src/PHYS/trczdf.f90
Comment on lines +136 to +137
! NOTE: kernel is too big, should be split
!$acc parallel loop gang vector default(present) async vector_length(32)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to think about clever ways to generate this kernel as it seems quite big, best performance on A100 was obtained with a vector length of 32 which isn't very high

Comment thread src/PHYS/trczdf.f90
Aij = e1t(jj,ji) * e2t(jj,ji)

#ifdef _OPENACC
ntx=jv

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for GPU version we parallelize on dimen_jvzdf

@dindon-sournois dindon-sournois marked this pull request as ready for review April 24, 2024 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant