diff --git a/.agents/skills/code-reviewer/SKILL.md b/.agents/skills/code-reviewer/SKILL.md new file mode 100644 index 000000000..5d0fae46c --- /dev/null +++ b/.agents/skills/code-reviewer/SKILL.md @@ -0,0 +1,229 @@ +--- +name: code-reviewer +description: PR review checklist for semba-fdtd: CI verification across Ubuntu/Windows matrix, Fortran code review patterns, conditional compilation guards, test coverage expectations, MPI/OpenMP safety checks, and AI-assisted contribution policies +--- + +## When to use + +- Reviewing a pull request before merging +- Evaluating code quality and correctness of Fortran/Python changes +- Checking that tests cover new functionality +- Verifying conditional compilation guards are correct +- Ensuring documentation is updated for user-facing changes +- Reviewing AI-assisted contributions for correctness + +## Key Files to Read + +### Project Standards +- `CONTRIBUTING.md` — Contribution guidelines, review requirements +- `AGENTS.md` — AI coding agent guidance (this file) +- `README.md` — Project overview and quality expectations + +### CI/CD Pipelines +- `.github/workflows/ubuntu.yml` — Ubuntu CI: GCC + Intel, MPI ON/OFF, MTLN ON/OFF +- `.github/workflows/windows.yml` — Windows CI: IntelLLVM + Ninja +- `.github/workflows/check-submodules-default-branch.yml` — Submodule integrity checks + +### Code Style References +- `CMakeLists.txt` — Compiler flags, build configuration +- `doc/fdtdjson.md` — Input format spec (check for input format changes) +- `doc/development.md` — Development setup and conventions + +## What CI Checks Must Pass + +Every PR must pass **all** of the following before merging: + +### Ubuntu CI (`ubuntu.yml`) +14 build configurations in matrix: +- Compilers: GCC 11 + Intel 2025.1 +- MPI: ON and OFF +- MTLN: ON and OFF +- HDF: ON (always) +- Double precision: OFF (default) + 1 Intel config with ON + +Each configuration must: +- Build successfully +- Pass `build/bin/fdtd_tests` (unit tests) +- Pass `pytest test/ --durations=20` (integration tests) + +### Windows CI (`windows.yml`) +6 build configurations: +- Compiler: IntelLLVM 2025.2.1 + Ninja +- MTLN: ON and OFF +- HDF: ON +- Double precision: ON and OFF + +Each configuration must: +- Build successfully +- Pass `build/bin/fdtd_tests.exe` +- Pass `pytest -m 'not codemodel' test/ --durations=20` (codemodel excluded on Windows) + +### Submodule Check +- All submodule commits must be ancestors of their `main`/`master` branch history + +## Review Checklist + +### 1. General PR Quality + +- [ ] PR description includes a short summary of the change +- [ ] PR description includes motivation/context (what problem it solves) +- [ ] PR description explains how it was tested (commands run, platforms used) +- [ ] PR is opened against `dev` (or `main` per current workflow) +- [ ] Commit messages are clear, descriptive, and in English +- [ ] Commit messages reference issue numbers where applicable (e.g., `#392`) +- [ ] Changes are focused and small (not mixing unrelated changes) +- [ ] Small fixup commits are squashed + +### 2. Fortran Code Review + +- [ ] **Code style matches surrounding code** — no new indentation, naming, or structural conventions introduced +- [ ] **Conditional compilation guards are correct** — check that `#ifdef` guards match the CMake flags: + - `SEMBA_FDTD_ENABLE_MPI` — MPI support + - `SEMBA_FDTD_ENABLE_HDF` — HDF5 output + - `SEMBA_FDTD_ENABLE_MTLN` — MTLN solver + ngspice + - `SEMBA_FDTD_ENABLE_SMBJSON` — JSON input parser + - `SEMBA_FDTD_ENABLE_DOUBLE_PRECISION` — 8-byte reals +- [ ] **Module dependencies are respected** — code doesn't depend on modules below it in the library chain: + ``` + semba-types -> semba-reports -> smbjson/conformal/semba-components + -> mtlnsolver/semba-outputs -> semba-main -> semba-fdtd + ``` +- [ ] **No new unused variables or dead code** +- [ ] **Array bounds are safe** — FDTD uses extensive pointer aliasing; verify no out-of-bounds access +- [ ] **Memory management is correct** — no leaks in derived type allocation/deallocation +- [ ] **OpenMP directives are correct** — if adding parallel loops, verify `COLLAPSE`, `DEFAULT(SHARED)`, and private variables +- [ ] **MPI communication is safe** — if modifying MPI code, verify non-blocking patterns and buffer sizes + +### 3. Build System Review + +- [ ] **CMakeLists.txt changes are minimal** — only add new targets, don't restructure existing ones +- [ ] **New optional features have CMake options** — follow the pattern of existing flags +- [ ] **Compiler flags are correct** — match existing patterns for the target compiler +- [ ] **Windows compatibility** — if changes affect Windows, verify IntelLLVM + Ninja builds +- [ ] **Submodule changes are intentional** — verify submodule commits are on accepted branch history + +### 4. Test Coverage + +- [ ] **New functionality has corresponding tests** — both unit tests and integration tests where applicable +- [ ] **Bug fixes have regression tests** — a test that would have caught the original bug +- [ ] **Test data is appropriate** — minimal but sufficient to exercise the code path +- [ ] **Tests use correct pytest markers** — `mtln`, `hdf`, `mpi`, `codemodel` +- [ ] **Tests are isolated** — Python tests use `tmp_path`, don't write to repo directory +- [ ] **Reference outputs are updated** — if expected outputs change, `testData/outputs/` is updated +- [ ] **All existing tests still pass** — no regressions + +### 5. Documentation + +- [ ] **`doc/` is updated** for user-facing behavior changes +- [ ] **`.fdtd.json` changes reference `doc/fdtdjson.md`** — verify input format is documented +- [ ] **New CMake options are documented** in `AGENTS.md` and `CLAUDE.md` +- [ ] **Tutorial examples are updated** if the workflow changes + +### 6. AI-Assisted Contributions + +If the PR has significant AI-generated content: +- [ ] PR description clearly states AI assistance was used +- [ ] Extent of AI assistance is described +- [ ] **At least two human reviewers** have approved (per CONTRIBUTING.md) +- [ ] Human author has taken responsibility for correctness +- [ ] Code quality is indistinguishable from human-written code in this repo + +## Common Issues to Watch For + +### Conditional Compilation Gotchas + + ! BAD: Missing guard for optional feature + call some_mtl_function() ! Only available with SEMBA_FDTD_ENABLE_MTLN + + ! GOOD: Properly guarded + #ifdef SEMBA_FDTD_ENABLE_MTLN + call some_mtl_function() + #endif + +### MPI Safety + +- [ ] Non-blocking operations (`MPI_ISEND`/`MPI_IRECV`) are followed by `MPI_WAITALL` +- [ ] Buffer sizes account for ghost cell exchange correctly +- [ ] Wire MPI communication is updated if wire models are modified +- [ ] `MPI_Barrier` usage is appropriate (not causing deadlocks) + +### OpenMP Safety + +- [ ] `COLLAPSE(n)` dimensions are valid for the loop nest +- [ ] Private/shared variable clauses are correct +- [ ] No data races on shared arrays +- [ ] Reduction clauses used where accumulators are involved + +### Memory Safety + +- [ ] No pointer aliasing that could cause false dependencies in parallel regions +- [ ] Derived types are properly initialized before use +- [ ] Arrays are allocated before being accessed +- [ ] `contiguous` attribute used on field pointers for cache-friendly access + +### Input Format Changes + +If `.fdtd.json` format is modified: +- [ ] `doc/fdtdjson.md` is updated with the new format +- [ ] Backward compatibility is maintained if possible +- [ ] Migration guide is provided for existing input files +- [ ] Parser tests (`test/smbjson/`) are updated + +## Review Process + +### Step-by-Step + +1. **Read the PR description** — understand the intent and context +2. **Check CI status** — verify all CI jobs pass (or are expected to fail) +3. **Review the diff** — line by line for logic errors, style issues +4. **Verify test coverage** — ensure new code has tests +5. **Check conditional compilation** — verify all `#ifdef` guards are correct +6. **Verify documentation** — ensure user-facing changes are documented +7. **Run local tests** (if needed) — build and run tests for the specific change +8. **Leave constructive feedback** — be specific, suggest fixes, explain why + +### Feedback Style + +- Be specific about what needs changing and why +- Suggest concrete fixes when possible +- Distinguish between blocking issues (must fix) and suggestions (nice to have) +- Acknowledge good changes and improvements +- Keep discussion technical and respectful + +## Quick Reference: What to Check by Change Type + +### Adding a new feature +- New code follows existing conventions +- New CMake option (if optional feature) +- New tests (unit + integration) +- Documentation updated +- Test data added + +### Fixing a bug +- Root cause identified and explained +- Regression test added +- Minimal changes (no over-engineering) +- All existing tests still pass + +### Modifying optional feature +- `#ifdef` guards are correct for all build configurations +- Tested with feature ON and OFF +- CI matrix includes the relevant configurations + +### Modifying MPI code +- Non-blocking I/O pattern preserved +- Buffer sizes correct for all domain sizes +- Wire MPI communication updated if needed +- No deadlocks or race conditions + +### Modifying OpenMP code +- Parallel correctness verified +- Performance impact assessed +- Thread scaling tested +- No false sharing or cache contention + +### Modifying input format +- `doc/fdtdjson.md` updated +- Parser tests updated +- Backward compatibility maintained +- Migration guide provided diff --git a/.agents/skills/emc-engineer/SKILL.md b/.agents/skills/emc-engineer/SKILL.md new file mode 100644 index 000000000..34688645f --- /dev/null +++ b/.agents/skills/emc-engineer/SKILL.md @@ -0,0 +1,285 @@ +--- +name: emc-engineer +description: Writing .fdtd.json inputs, launching simulations via pyWrapper FDTD class, analyzing probe outputs (.dat files) in time/frequency domains, generating excitation signals with NumPy, and common EMC analysis patterns like shielding effectiveness and radar cross section +--- + +## When to use + +- Writing or modifying `.fdtd.json` input files for new simulation cases +- Launching simulations and monitoring progress +- Analyzing probe output data (`.dat` files) in time and frequency domains +- Creating pre/post-processing Python scripts with the pyWrapper interface +- Comparing FDTD results with analytical models or measurements +- Setting up excitation signals (`.exc` files) for time-domain sources + +## Key Files to Read + +### Core Solver +- `src_main_pub/launcher.F90` — Entry point (15 lines) +- `src_main_pub/semba_fdtd.F90` — Main type: init/launch/end lifecycle +- `src_main_pub/preprocess_geom.F90` — Geometry parsing, mesh building, material assignment +- `src_main_pub/postprocess.F90` — DTFT, transfer functions, resampling +- `src_main_pub/observation.F90` — Probe data collection and storage + +### Python Interface +- `src_pyWrapper/pyWrapper.py` — FDTD, Probe, ExcitationFile classes + +### Documentation +- `doc/fdtdjson.md` — Complete `.fdtd.json` input format specification (876 lines) +- `doc/tutorials/veritasium/veritasium.md` — End-to-end workflow tutorial +- `doc/mtln.md` — MTLN solver documentation + +### Example Cases +- `testData/input_examples/` — 23 `.fdtd.json` files demonstrating various features +- `testData/cases/` — 44 full simulation case directories +- `testData/excitations/` — Source waveform definitions +- `testData/netlists/` — SPICE netlist files for MTLN +- `testData/cases/planewave/pw_prepost.py` — Simple pre/post example +- `testData/cases/sgbcShieldingEffectiveness/sgbc_prepost.py` — EMC analysis with analytical comparison +- `testData/cases/holland/holland_prepost.py` — Wire case comparison + +## Simulation Workflow + +### Step 1: Write the Input File + +The primary input format is `.fdtd.json` with these top-level keys: + +| Key | Required | Description | +|-----|----------|-------------| +| `general` | Yes | `timeStep`, `numberOfSteps`, optional `mtlnProblem`, `additionalArguments` | +| `boundary` | No (defaults to `mur` on all sides) | Per-face: `pec`, `pmc`, `periodic`, `mur`, `pml` | +| `mesh` | Yes | Contains `grid`, `coordinates`, `elements` | +| `materials` | No | Material definitions | +| `materialAssociations` | No | Links materials to mesh elements | +| `sources` | No | Electromagnetic sources | +| `probes` | No | Output sensors | + +### Mesh Definition + +```json +"mesh": { + "grid": { + "numberOfCells": [nx, ny, nz], + "steps": {"x": [dx], "y": [dy], "z": [dz]}, + "origin": [0, 0, 0] + }, + "coordinates": { + "name": {"relativePosition": [cellIndex, fractionalX, fractionalY, fractionalZ]} + }, + "elements": [ + {"type": "node", "position": [x, y, z], "id": materialID}, + {"type": "polyline", "points": [[x1,y1,z1], ...], "id": materialID}, + {"type": "cell", "intervals": [[[ax,ay,az],[bx,by,bz]], ...], "id": materialID} + ] +} +``` + +**Interval convention for `cell` elements:** An interval `[[ax,ay,az],[bx,by,bz]]` defines a region `[ax,bx) x [ay,by) x [az,bz)`. Depending on how many dimensions differ: +- 0 dimensions differ = point +- 1 dimension differs = oriented line +- 2 dimensions differ = oriented surface +- 3 dimensions differ = volume + +### Step 2: Generate Excitation Signals (if needed) + +For time-domain sources, generate `.exc` files using NumPy: + +```python +import numpy as np + +# Gaussian pulse +dt = 1e-12 +t0 = 20 * dt +w0 = 5 * dt +t = np.arange(0, t0 + 20 * w0, dt) +f = np.exp(-(t - t0)**2 / w0**2) +np.savetxt('gauss.exc', np.column_stack((t, f))) + +# Sigmoid step +t = np.linspace(0, 10e-8, 2000) +V = A * (sigmoid_raw - offset) * scaling +np.savetxt('./step.exc', np.column_stack((t, V))) +``` + +### Step 3: Launch the Simulation + +**Direct command line:** +```bash +semba-fdtd -i CASE_NAME.fdtd.json +mpirun -n NPROCS semba-fdtd -i CASE_NAME.fdtd.json +``` + +**Via Python pyWrapper:** +```python +from src_pyWrapper.pyWrapper import FDTD + +solver = FDTD( + input_filename='my_case.fdtd.json', + path_to_exe='/path/to/semba-fdtd', + flags=['-mapvtk'], # Optional: generate VTK geometry map + mpi_command='mpirun -n 4' # Optional: MPI launch +) +solver.cleanUp() # Remove old output files +solver.run() # Execute the solver +print(f"Return code: {solver.returncode}") +``` + +**Key command-line flags:** +| Flag | Purpose | +|------|---------| +| `-mapvtk` | Generate VTK geometry map files for ParaView | +| `-sgbc` | Enable SGBC (multi-layer surface) solver | +| `-r` | Resume from a previous checkpoint | +| `-stoch` / `-nostoch` | Stochastic simulation toggle | +| `-cfl ` | Override CFL factor | + +### Step 4: Monitor Progress + +Check `SEMBA_FDTD_temp.log` for progress. The solver writes timing info periodically. + +### Step 5: Analyze Results + +**Output file types:** + +| Extension | Description | Example | +|-----------|-------------|---------| +| `.dat` | Time-domain probe data (`time field_value`) | `case_name.fdtd_probeName_Ex_5_10_3.dat` | +| `_df.dat` | Frequency-domain (`freq magnitude phase`) | `case_name.fdtd_probeName_Ex_5_10_3_df.dat` | +| `_tr.dat` | Transfer function in dB (`freq 20*log10(mag) phase`) | `case_name.fdtd_probeName_Ex_5_10_3_tr.dat` | +| `.xdmf` + `.h5` | 3D field movies (HDF5 binary + XDMF metadata) | `case_name.xdmf` | +| `.vtk` | Geometry map for ParaView | `case_name_1.vtk` | +| `.old` | Restart field checkpoint | `case_name.fields.old` | + +**Load and analyze with pyWrapper:** +```python +from src_pyWrapper.pyWrapper import Probe, FDTD + +solver = FDTD(input_filename='pw-in-box.fdtd.json', path_to_exe=SEMBA_EXE) +solver.run() + +# Load time-domain probe +before = Probe(solver.getSolvedProbeFilenames("before")[0]) +print(before.dataframe) # pandas DataFrame +print(before.name) # Extracted probe name +print(before.field_column) # Field column name +print(before.cell_positions) # Cell positions + +# Plot time domain +import matplotlib.pyplot as plt +plt.plot(before['time'], before['field'], label='Ex component') +plt.xlabel('Time (s)') +plt.ylabel('Field (V/m)') +plt.legend() +plt.show() + +# Load frequency-domain probe +freq_probe = Probe(solver.getSolvedProbeFilenames("before")[0].replace('.dat', '_df.dat')) +print(freq_probe.dataframe) # Columns: frequency, magnitude, phase +``` + +**Manual analysis:** +```python +import numpy as np +import matplotlib.pyplot as plt + +# Time-domain data +data = np.loadtxt('case_name.fdtd_probe_Ex_5_10_3.dat', skiprows=1) +time = data[:, 0] +field = data[:, 1] +plt.plot(time, field) + +# Frequency-domain data +freq_data = np.loadtxt('case_name.fdtd_probe_Ex_5_10_3_df.dat', skiprows=1) +freq = freq_data[:, 0] +magnitude = freq_data[:, 1] +phase = freq_data[:, 2] + +# Transfer function +tr_data = np.loadtxt('case_name.fdtd_probe_Ex_5_10_3_tr.dat', skiprows=1) +plt.plot(tr_data[:, 0], tr_data[:, 1]) # dB vs frequency + +# FFT-based postprocessing (e.g., shielding effectiveness) +INC = np.fft.fft(back['incident']) +S transmitted = np.fft.fft(back['field']) / INC +fq = np.fft.fftfreq(len(t)) / dt +``` + +## Common Analysis Patterns + +### Shielding Effectiveness (SGBC) +```python +# After running simulation with front/back probes +front = Probe(front_files[0]) +back = Probe(back_files[0]) + +INC = np.fft.fft(front['field']) +S transmitted = np.fft.fft(back['field']) / INC +fq = np.fft.fftfreq(len(t)) / dt + +# Compare with analytical transmission line model +from skrf.media import Freespace +fq = np.fft.fftfreq(len(t)) / dt +f = fq[idx_min:idx_max] +plt.plot(f, 20*np.log10(np.abs(fdtd_s21)), label='FDTD') +plt.plot(f, 20*np.log10(np.abs(slab.s[:,0,1])), label='Analytical') +plt.ylabel('Shielding Effectiveness (dB)') +``` + +### RCS (Radar Cross Section) via Far-Field Probes +```python +# Far-field probes produce _df.dat with angular data +farfield = Probe(farfield_files[0]) +# Columns: frequency, magnitude, phase (integrated over observation sphere) +``` + +### MTLN Circuit Analysis +```python +# MTLN probes include voltage and current at each conductor +mtln_probe = Probe(mtln_probe_files[0]) +# Check probe.domain_type for "time" or "frequency" +# Columns include voltage/current per conductor +``` + +### Comparing MTLN vs Non-MTLN Builds +```python +# Run with both builds to compare transmission line effects +solver_mtln = FDTD(fn, SEMBA_MTLN_EXE, run_in_folder=mtln_folder) +solver_mtln.run() +solver_nomtln = FDTD(fn, SEMBA_NOMTLN_EXE, run_in_folder=nomtln_folder) +solver_nomtln.run() + +for pf in probe_files_mtln: + p = Probe(pf) + plt.plot(p['time'], p['current_0'], label=f'MTLN - {p.name}') +for pf in probe_files_nomtln: + p = Probe(pf) + plt.plot(p['time'], p['current'], linestyle='--', label=f'No MTLN - {p.name}') +plt.legend() +``` + +## Material Types Reference + +| Material | Use Case | +|----------|----------| +| `pec` | Perfect electric conductor | +| `pmc` | Perfect magnetic conductor | +| `isotropic` | Dielectric materials (epsilon_r, mu_r, sigma) | +| `wire` | Wire grid materials | +| `shieldedMultiwire` | Multi-conductor cables with PUL parameters | +| `unshieldedMultiwire` | Bundles of parallel wires | +| `terminal` | Port/termination definitions | +| `lumped` | R/L/C lumped elements | +| `multilayeredSurface` | SGBC multi-layer surfaces | +| `thinSlot` | Thin slot models | +| `connector` | SPICE connector definitions | + +## Common Gotchas + +1. **CFL condition** — The solver automatically adjusts `dt` if the user-specified value would cause instability. Check the log for the adjusted value. +2. **Probe file naming** — Encodes probe name, field component, and cell position: `case_name.fdtd_probeName_Ex_5_10_3.dat` +3. **Domain types** — Probes can output `time`, `frequency`, or `timeFrequency` data. The `_df.dat` and `_tr.dat` files are auto-generated during postprocessing. +4. **Transfer functions** — Specify `magnitudeFile` in a probe's domain to compute transfer function normalization automatically. +5. **MPI output** — When running with MPI, each rank writes its own probe files. Use `solver.getSolvedProbeFilenames()` to find all matching files. +6. **VTK maps** — Use `-mapvtk` flag to generate geometry maps. Open with ParaView and use the `_tag_paraviewfilters.txt` guide for filtering. +7. **Resuming simulations** — Use `-r` flag and `.old` checkpoint files. Ensure the input file hasn't changed since the checkpoint was created. +8. **Mesh origin** — Default is `[0,0,0]`. If your geometry is offset, specify `grid.origin` in the mesh definition. diff --git a/.agents/skills/hpc-engineer/SKILL.md b/.agents/skills/hpc-engineer/SKILL.md new file mode 100644 index 000000000..ba3e92a33 --- /dev/null +++ b/.agents/skills/hpc-engineer/SKILL.md @@ -0,0 +1,301 @@ +--- +name: hpc-engineer +description: 1D Z-axis MPI domain decomposition, OpenMP collapse(2) on YEE kernels, compiler optimization flags (GCC/Intel), profiling with gprof and NVTX placeholders, performance bottleneck analysis, and GPU roadmap with OpenACC placeholders +--- + +## When to use + +- Optimizing simulation performance for large 3D domains +- Configuring MPI domain decomposition for cluster runs +- Tuning OpenMP parallelization for multi-core nodes +- Selecting compiler optimization flags for target hardware +- Profiling and identifying performance bottlenecks +- Evaluating GPU acceleration strategies (OpenACC roadmap) +- Diagnosing MPI communication overhead or load imbalance + +## Key Files to Read + +### MPI Parallelization +- `src_main_pub/mpicomm.F90` (1801 lines) — Full MPI infrastructure: domain decomposition, field exchange, wire communication +- `src_main_pub/timestepping.F90` — Time-stepping loop with MPI flush calls + +### OpenMP Kernels +- `src_main_pub/timestepping.F90` — Core YEE update kernels with `!$OMP PARALLEL DO COLLAPSE(2)` +- `src_wires_pub/wires.F90` — Thin-wire model parallel loops +- `src_main_pub/planewaves.F90` — Plane wave source update loops (100+ OpenMP references) + +### Build Configuration +- `CMakeLists.txt` — Compiler flags, optional features, optimization options +- `.github/workflows/ubuntu.yml` — CI build matrix with compiler variants +- `set_precompiled_libraries.cmake` — Prebuilt library configuration + +### Profiling +- `testData/cases/paul/gprof_output.txt` — Example gprof profiling output +- `testData/cases/holland/gprof_paul.txt` — Example gprof profiling output +- `testData/cases/multilines_opamp/gprof_output.txt` — Example gprof profiling output + +## MPI Parallelization + +### Domain Decomposition Strategy + +The project uses **1D domain decomposition along the Z-axis** (z-slicing). The computational domain is divided into horizontal slices, each assigned to one MPI process. + +**Key functions in `src_main_pub/mpicomm.F90`:** + +| Function | Line | Purpose | +|----------|------|---------| +| `InitGeneralMPI()` | 71-81 | Initialize MPI, get communicator size and rank | +| `MPIdivide()` | 86-304 | Divide domain into Z-slices, assign sub-domains with PML handling | +| `InitMPI()` / `InitMPI_Cray()` | 310, 1238-1423 | Initialize communication buffers for field exchange | +| `FlushMPI_H()` / `FlushMPI_E()` | 434-587 | Exchange H/E field ghost cells (non-blocking I/O) | +| `FlushMPI_H_Cray()` / `FlushMPI_E_Cray()` | 1425-1593 | Cray-optimized flush variants | +| `MPIupdateMin()` | 361-367 | MPI_AllReduce with MPI_MIN for time-step synchronization | +| `MPIupdateBloques()` | 372-380 | MPI_AllReduce with MPI_SUM for field observation | +| `newInitWiresMPI()` / `newFlushWiresMPI()` | 600, 869 | Wire current/charge node data exchange | + +### MPI Communication Pattern + +Each time step performs **synchronous boundary exchange** between adjacent slices: + + ! In solver_run / step (timestepping.F90): + call advanceE_fields(...) + call FlushMPI_E_Cray(...) ! Exchange E-field ghost cells + + call advanceH_fields(...) + call FlushMPI_H_Cray(...) ! Exchange H-field ghost cells + +**Communication details:** +- Uses **non-blocking I/O** (`MPI_ISEND`/`MPI_IRECV`) with `MPI_WAITALL` for overlap +- Exchanges ghost cells between **adjacent ranks only** (neighbors in Z direction) +- Wire models require additional MPI communication for current/charge node data at slice boundaries +- Anisotropic materials, SGBC, and multiport BCs need extra flush buffers (`InitExtraFlushMPI`) + +### MPI Command-Line Usage + +```bash +# Basic MPI run +mpirun -n 4 semba-fdtd -i my_case.fdtd.json + +# With hostfile for cluster +mpirun -n 64 -hostfile hosts semba-fdtd -i large_case.fdtd.json + +# Via Python pyWrapper +solver = FDTD( + input_filename='my_case.fdtd.json', + path_to_exe='/path/to/semba-fdtd', + mpi_command='mpirun -n 8' +) +solver.run() +``` + +## OpenMP Parallelization + +### Always Enabled + +OpenMP is **unconditionally compiled** in all builds. The `CompileWithOpenMP` define is always set. + +### Compiler Flags + +| Compiler | OpenMP Flag | +|----------|-------------| +| GNU (gfortran) | `-fopenmp` | +| IntelLLVM (ifx) | `-qopenmp` | + +### Core YEE Update Kernels + +All six field update kernels in `timestepping.F90` use `COLLAPSE(2)` on the inner j,k loops: + +| Kernel | Line | OpenMP Directive | +|--------|------|-----------------| +| `advanceEx` | 2206-2225 | `!$OMP PARALLEL DO DEFAULT(SHARED) COLLAPSE(2)` | +| `advanceEy` | 2249-2266 | `!$OMP PARALLEL DO DEFAULT(SHARED) COLLAPSE(2)` | +| `advanceEz` | 2295-2312 | `!$OMP PARALLEL DO DEFAULT(SHARED) COLLAPSE(2)` | +| `advanceHx` | 2358-2376 | `!$OMP PARALLEL DO DEFAULT(SHARED) COLLAPSE(2)` | +| `advanceHy` | 2400-2417 | `!$OMP PARALLEL DO DEFAULT(SHARED) COLLAPSE(2)` | +| `advanceHz` | 2439-2455 | `!$OMP PARALLEL DO DEFAULT(SHARED) COLLAPSE(2)` | + +**Parallelization strategy:** `COLLAPSE(2)` on the j,k loops leaves the k-loop as the outer OpenMP dimension. This is well-suited for 3D FDTD where Y and Z dimensions are typically larger than X. + +### Other Parallelized Loops + +- **Wire charge node updates** (`wires.F90`): `!$OMP PARALLEL DO` on wire charge nodes +- **Plane wave sources** (`planewaves.F90`): Many loops with `!$OMP PARALLEL DO` +- **Observation FFT** (`observation.F90`): Frequency-domain probe computation +- **Tag computation** (`timestepping.F90` `fillMtag()`): Three separate `!$OMP PARALLEL DO` regions + +### OpenMP Configuration + +```bash +# Set number of threads per MPI rank +export OMP_NUM_THREADS=4 +mpirun -n 8 ./build/bin/semba-fdtd -i case.fdtd.json +# Total: 8 MPI ranks x 4 OpenMP threads = 32 cores + +# Hybrid parallelism (recommended for multi-node clusters) +# Node 1: 2 MPI ranks x 12 OpenMP threads = 24 threads +# Node 2: 2 MPI ranks x 12 OpenMP threads = 24 threads +export OMP_NUM_THREADS=12 +mpirun -n 4 ./build/bin/semba-fdtd -i case.fdtd.json +``` + +## Compiler Optimization Flags + +### GNU Compiler (gcc/gfortran) + +| Mode | Flags | +|------|-------| +| Release | `-Ofast` | +| Debug | `-g -O0 -fno-inline -fcheck=all -fbacktrace` | +| Common | `-fopenmp -ffree-form -ffree-line-length-none -fdec -fallow-argument-mismatch` | + +### IntelLLVM Compiler (ifx) + +| Mode | Flags | +|------|-------| +| Release | `-O3 -fp-model fast=2w` | +| Debug | `-check all,nouninit -debug full -traceback` | +| Common | `-qopenmp -fpp -static-intel` | + +### Optional Intel Optimizations + +| CMake Option | Flag | Purpose | +|--------------|------|---------| +| `SEMBA_FDTD_ENABLE_INTEL_XHOST_OPTIMIZATION` | `-xHost` | CPU-specific auto-vectorization | +| `SEMBA_FDTD_ENABLE_INTEL_IPO` | `-ipo` | Interprocedural optimization | + +### Build Commands for Performance + +```bash +# Standard release build +cmake -S . -B build -DCMAKE_BUILD_TYPE=Release +cmake --build build -j + +# With Intel CPU-specific optimization +cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \ + -DSEMBA_FDTD_ENABLE_INTEL_XHOST_OPTIMIZATION=ON \ + -DSEMBA_FDTD_ENABLE_INTEL_IPO=ON +cmake --build build -j + +# Double precision (may be slower but more accurate) +cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \ + -DSEMBA_FDTD_ENABLE_DOUBLE_PRECISION=ON +cmake --build build -j +``` + +## Profiling Infrastructure + +### gprof (Historical) + +The project has been profiled with gprof in the past. Reference outputs exist in: +- `testData/cases/paul/gprof_paul.txt` +- `testData/cases/holland/gprof_output.txt` +- `testData/cases/multilines_opamp/gprof_output.txt` + +To profile with gprof, rebuild with `-pg` flag and run the simulation: + +```bash +# Add -pg to Fortran flags in CMakeLists.txt temporarily +cmake -S . -B build -DCMAKE_BUILD_TYPE=Release +cmake --build build -j + +# Run simulation +./build/bin/semba-fdtd -i case.fdtd.json + +# Analyze +gprof ./build/bin/semba-fdtd gmon.out > profile.txt +``` + +### NVTX (NVIDIA Tools Extension — Placeholder) + +The code contains NVTX range markers under `#ifdef CompileWithProfiling` in `timestepping.F90` (15 locations): + + #ifdef CompileWithProfiling + call nvtxStartRange("Antes del bucle N") ! Entire time loop + call nvtxStartRange("Antes del bucle EX/EY/EZ") ! E-field components + call nvtxStartRange("Antes del bucle HX/HY/HZ") ! H-field components + call nvtxEndRange + #endif + +However, `CompileWithProfiling` is never defined in the build system, and the `nvtx` module does not exist in the repository. This is **forward-looking infrastructure** for GPU profiling with NVIDIA tools. + +## Performance Bottlenecks + +### Primary Bottlenecks + +1. **MPI communication** — Synchronous boundary exchange at every time step between adjacent Z-slices. This is the dominant bottleneck for large MPI runs. +2. **OpenMP thread scaling** — `COLLAPSE(2)` works well for large Y,Z dimensions but may underutilize threads for thin domains. +3. **Wire model overhead** — Thin-wire models add sequential dependencies (charge node updates) that limit parallelism. +4. **Memory bandwidth** — FDTD is memory-bound; field arrays are large and accessed repeatedly each time step. + +### Optimization Strategies + +| Strategy | Expected Impact | Effort | +|----------|----------------|--------| +| Increase MPI ranks (more Z-slices) | High for large domains | Low | +| Tune `OMP_NUM_THREADS` per rank | Medium | Low | +| Enable `-xHost` (Intel) | Medium (vectorization) | Low | +| Enable IPO (`-ipo`) | Medium (inlining) | Low | +| Use double precision only when needed | High (memory bandwidth) | Low | +| 2D/3D MPI decomposition | High (less communication) | High | +| GPU acceleration (OpenACC) | High | Very high | +| Loop unrolling / SIMD hints | Low-Medium | Medium | + +## GPU Roadmap + +### Current State + +**No active GPU acceleration.** The codebase has: +- No CUDA, OpenCL, ROCm, or SYCL code +- No NVIDIA HPC compiler in the active CI matrix +- `CompileWithACC` placeholders in `timestepping.F90` (6 locations) suggesting **OpenACC planning** + +### OpenACC Placeholders (Inactive) + +In `timestepping.F90`, the YEE kernels have commented-out OpenACC directives: + + #ifdef CompileWithACC + !$ACC parallel loop DEFAULT(present) collapse(2) \ + private(...) copyin(...) copyout(...) + #endif + +These exist for all six field update kernels (lines 2209, 2252, 2298, 2361, 2403, 2442). + +### Recommended GPU Strategy + +1. **Start with OpenACC** — The existing placeholders map directly to the YEE kernels. Enable with NVHPC or NVIDIA HPC SDK compiler. +2. **Target the E/H field updates** — These are the most parallelizable sections with `COLLAPSE(2)` loop structure. +3. **Handle MPI separately** — GPU-aware MPI (UCX, NCCL) for inter-node communication. +4. **Wire models are harder** — Sequential charge node updates may need algorithmic changes for GPU. + +### NVHPC Compiler Support (Stub) + +NVHPC is partially supported in `CMakeLists.txt` but marked as `TODO: Tune flags for NVHPC`. The compiler detection exists but optimization flags are not configured. + +## Performance Tuning Checklist + +### Before Running Large Simulations + +- [ ] Verify CFL condition is satisfied (check log for adjusted dt) +- [ ] Set `OMP_NUM_THREADS` appropriate for node core count +- [ ] Use MPI rank count matching available nodes/cores +- [ ] Enable compiler optimizations (`-Ofast` for GCC, `-O3` for Intel) +- [ ] Consider `-xHost` for Intel CPUs (if geometry is large enough) +- [ ] Use single precision unless accuracy requires double precision +- [ ] Disable unused features (MTLN, HDF, SMBJSON) to reduce memory + +### For MPI Scaling Tests + +- [ ] Run strong scaling test: fixed problem size, increasing MPI ranks +- [ ] Run weak scaling test: fixed work per rank, increasing total size +- [ ] Monitor MPI communication time in logs +- [ ] Check for load imbalance (some ranks finishing much earlier) +- [ ] Verify Z-slice decomposition matches aspect ratio of domain + +### Debugging Performance Issues + +- [ ] Check `SEMBA_FDTD_temp.log` for timing information per time step +- [ ] Profile with gprof to identify hot functions +- [ ] Verify OpenMP thread count with `OMP_NUM_THREADS` +- [ ] Check for unnecessary data copying or reallocation in hot loops +- [ ] Ensure contiguous array allocation (check `contiguous` attribute on field pointers) +- [ ] Review pointer aliasing patterns for potential false dependencies diff --git a/.agents/skills/unit-tests/SKILL.md b/.agents/skills/unit-tests/SKILL.md new file mode 100644 index 000000000..59e4e3035 --- /dev/null +++ b/.agents/skills/unit-tests/SKILL.md @@ -0,0 +1,224 @@ +--- +name: unit-tests +description: GoogleTest patterns for semba-fdtd: Fortran/C++ glue with bind(C), testingTools assertion helpers (expect_eq_int, expect_near, checkNear), pytest markers (mtln/hdf/mpi/codemodel), test data management, step-by-step guide for adding new unit and integration tests +--- + +## When to use + +- Adding a new feature or bug fix that needs test coverage +- Creating regression tests for a reported issue +- Modifying existing code and wanting to verify behavior hasn't changed +- Setting up a new test subdirectory under `test/` +- Adding Python integration tests for the solver + +## Key Files to Read + +### GoogleTest (C++/Fortran) +- `test/CMakeLists.txt` — Top-level test build configuration +- `test/fdtd_tests.cpp` — GoogleTest main entry point +- `test/smbjson/smbjson_testingTools.F90` — Assertion helpers for JSON parser tests +- `test/mtln/mtln_testingTools.F90` — Assertion helpers for MTLN tests +- `test/conformal/conformal_tests.h` — Example C++ glue header mapping Fortran to GoogleTest +- `test/pyWrapper/utils.py` — Python test utilities, pytest markers, FDTD/Probe classes + +### Test Data +- `testData/cases/` — Full simulation cases used by integration tests +- `testData/input_examples/` — Minimal `.fdtd.json` files for unit test inputs +- `testData/outputs/` — Reference output data for regression comparison + +### Python Integration Tests +- `test/pyWrapper/test_integration.py` — End-to-end solver tests +- `test/pyWrapper/test_full_system.py` — Full system tests with MPI, HDF5, MTLN +- `test/pyWrapper/test_mtln_standalone.py` — MTLN-specific integration tests + +## GoogleTest Architecture (Tier 1) + +### Directory Structure + +Every test subdirectory under `test/` follows this pattern: + + test// + CMakeLists.txt # Builds a static test library + _tests.cpp # 1-line C++ file: #include "_tests.h" + _tests.h # Maps Fortran functions to TEST() macros + _testingTools.F90 # Assertion helpers (if needed) + test_*.F90 # Fortran test functions + +### Fortran Test Function Pattern + +Each test is a `bind(C)` integer function returning `0` on success, non-zero on failure: + + integer function test_my_feature() bind(C) result(err) + use some_module + use some_testingTools + implicit none + integer :: err + err = 0 + ! ... test logic ... + if (.not. expected_condition) then + err = err + 1 + call testFails(err, "Descriptive failure message") + end if + end function + +### C++ Glue Header Pattern + +The `.h` file declares `extern "C"` Fortran functions and maps them to GoogleTest: + + extern "C" int test_my_feature(); + + TEST(subdir, my_feature) { + EXPECT_EQ(0, test_my_feature()); + } + +### Assertion Helpers + +**Fortran testingTools modules** provide: + +| Function | Purpose | +|----------|---------| +| `expect_eq_int(err, expected, provided, msg)` | Integer comparison | +| `expect_near(a, b, tol)` | Floating point comparison | +| `testFails(err, msg)` | Increment error counter with message | +| `comparePULMatrices(error_cnt, m_line, m_input)` | Matrix comparison | +| `checkNear(target, number, rel_tol)` | Relative tolerance check (multiple overloads) | +| `expect_eq(err, ex, pr, ignoreRegions)` | Full structural comparison of derived types | + +### Conditional Compilation + +Tests are included conditionally based on CMake flags. Check `test/CMakeLists.txt`: + +- `SEMBA_FDTD_ENABLE_MTLN` -> enables `mtln/` and `system/` tests +- `SEMBA_FDTD_ENABLE_SMBJSON` -> enables `smbjson/`, `rotate/`, `vtk/`, `observation/`, `utils/` tests +- `SEMBA_FDTD_ENABLE_MPI` -> excludes `observation/` and `system/` from GoogleTest + +**Always guard test code with the same `#ifdef` guards** as the source code it tests. + +## Python Integration Tests (Tier 2) + +### Directory Structure + +Python tests live in `test/pyWrapper/`: + + test/pyWrapper/ + utils.py # FDTD/Probe classes, pytest markers, helpers + test_pyWrapper.py # pyWrapper class tests + test_integration.py # End-to-end solver tests + test_full_system.py # Full system tests (MPI, HDF5, MTLN) + test_mtln_standalone.py # MTLN-specific tests + +### Pytest Markers + +Defined in `pytest.ini`: + +| Marker | Purpose | +|--------|---------| +| `mtln` | Tests using MTLN features | +| `codemodel` | Tests needing xspice codemodels | +| `hdf` | Tests needing HDF5 | +| `mpi` | Tests to be run with mpirun | + +Apply via decorators: + + @pytest.mark.mtln + def test_my_mtln_feature(): + ... + +Skip logic (from `utils.py`): + + mtln_skip = pytest.mark.skipif( + os.getenv("SEMBA_FDTD_ENABLE_MTLN") == "OFF", + reason="MTLN not enabled" + ) + +### Running Tests + + # All tests + pytest test/ --durations=20 + + # By marker + pytest test/ -m mtln + pytest test/ -m hdf + pytest test/ -m mpi + pytest test/ -m codemodel + + # Exclude marker + pytest test/ -m "not codemodel" + +### Integration Test Pattern + +Integration tests run the actual solver and validate outputs: + + from utils import FDTD, Probe, CASES_FOLDER, OUTPUTS_FOLDER + + @pytest.mark.mtln + def test_my_case(tmp_path): + case_dir = CASES_FOLDER / "myCase" + output_dir = tmp_path / "output" + output_dir.mkdir() + + solver = FDTD( + input_filename="myCase.fdtd.json", + path_to_exe=SEMBA_EXE, + run_in_folder=str(output_dir) + ) + solver.cleanUp() + solver.run() + assert solver.returncode == 0 + + # Validate probe outputs + probe_files = solver.getSolvedProbeFilenames("myProbe") + assert len(probe_files) > 0 + probe = Probe(probe_files[0]) + # ... validate data against expected values ... + +## How to Add a New Test + +### For a new Fortran module: + +1. **Create test subdirectory:** `test//` +2. **Write Fortran test functions** following the `bind(C)` integer-returning pattern +3. **Create testingTools module** if custom assertions are needed +4. **Create the `.h` header** mapping Fortran functions to `TEST(group, name)` macros +5. **Create 1-line `.cpp` glue file** with `#include "_tests.h"` +6. **Add CMakeLists.txt** building a static library from the test files +7. **Register in `test/CMakeLists.txt`** — add `add_subdirectory()` under the appropriate conditional +8. **Include the header in `fdtd_tests.cpp`** with the matching `#ifdef` guard +9. **Add test data** to `testData/input_examples/` or `testData/cases/` +10. **Run `./build/bin/fdtd_tests`** to verify + +### For a new Python integration test: + +1. **Add test function** to the appropriate file in `test/pyWrapper/` +2. **Apply correct pytest marker** (`mtln`, `hdf`, `mpi`, `codemodel`) +3. **Use `tmp_path` fixture** for isolated temp directories +4. **Use `FDTD` class** from `utils.py` to launch simulations +5. **Use `Probe` class** to parse and validate `.dat` output files +6. **Compare against reference data** using `np.allclose`, correlation coefficients, or file content checks +7. **Run `pytest test/ -m `** to verify + +## Test Data Management + +### When adding test cases: + +- Place `.fdtd.json` input files in `testData/input_examples/` for small, focused tests +- Place full simulation cases in `testData/cases//` for integration tests +- Place reference outputs in `testData/outputs/` for regression comparison +- Keep test cases minimal — remove unnecessary cells, steps, or probes while still exercising the code path +- Name files descriptively: `myFeature_.fdtd.json` + +### When modifying test data: + +- If you change expected outputs, update `testData/outputs/` accordingly +- If you change a `.fdtd.json` input, verify it against `doc/fdtdjson.md` for correctness +- Run the full test suite after modifying shared test data: `pytest test/ --durations=20` + +## Common Gotchas + +1. **Fortran test functions must use `bind(C)`** — without this, C++ cannot call them +2. **Return `0` for pass, non-zero for fail** — GoogleTest expects `EXPECT_EQ(0, ...)` in the C++ glue +3. **Conditional compilation must match** — if a feature is `#ifdef CompileWithMTLN` in source, tests must use the same guard +4. **MPI-free builds** — `observation/` and `system/` GoogleTest subdirectories are only compiled when `SEMBA_FDTD_ENABLE_MPI=OFF` +5. **Test data paths** — use `PATH_TO_TEST_DATA` parameter defined in testingTools modules, not hardcoded paths +6. **Python test isolation** — always use `tmp_path` fixture; never write to the repo directory +7. **Tolerance differences** — double precision builds (`CompileWithReal8`) may need looser tolerances than single precision diff --git a/.github/workflows/ubuntu.yml b/.github/workflows/ubuntu.yml index 22c8b8df0..e0da0a7c4 100644 --- a/.github/workflows/ubuntu.yml +++ b/.github/workflows/ubuntu.yml @@ -22,7 +22,8 @@ jobs: os: [ubuntu-latest] compiler: [ {name: 'intel', version: '2025.1'}, - {name: 'gcc', version: 11} + {name: 'gcc', version: 11}, + {name: 'nvidia-hpc', version: '24.5'} ] build-type: ["Release"] mpi: ["ON", "OFF"] @@ -31,15 +32,6 @@ jobs: double-precision: ["OFF"] include: - # Disable by lack of space on github action - # - os: ubuntu-latest - # compiler: {name: 'nvidia-hpc', version: '24.5'} - # build-type: "Release" - # mpi: "OFF" - # mtln: "OFF" - # hdf: "OFF" - # double-precision: "OFF" - - os: ubuntu-latest # This is the only test with double precision. compiler: {name: 'intel', version: '2025.1'} build-type: "Release" @@ -47,11 +39,45 @@ jobs: mtln: "OFF" hdf: "ON" double-precision: "ON" + + - os: ubuntu-latest # Single NVIDIA HPC build: no MPI/MTLN/HDF to keep it slim. + compiler: {name: 'nvidia-hpc', version: '24.7'} + build-type: "Release" + mpi: "OFF" + mtln: "OFF" + hdf: "OFF" + double-precision: "OFF" + + - os: ubuntu-latest # NVIDIA HPC with CUDA Fortran GPU acceleration. + compiler: {name: 'nvidia-hpc', version: '24.7'} + build-type: "Release" + mpi: "OFF" + mtln: "OFF" + hdf: "OFF" + double-precision: "OFF" + cuda-fortran: "ON" + + exclude: + - compiler: {name: 'nvidia-hpc', version: '24.7'} + mpi: "ON" + - compiler: {name: 'nvidia-hpc', version: '24.7'} + mtln: "ON" + - compiler: {name: 'nvidia-hpc', version: '24.7'} + hdf: "ON" + - compiler: {name: 'nvidia-hpc', version: '24.7'} + cuda-fortran: "ON" + mpi: "ON" + - compiler: {name: 'nvidia-hpc', version: '24.7'} + cuda-fortran: "ON" + mtln: "ON" + - compiler: {name: 'nvidia-hpc', version: '24.7'} + cuda-fortran: "ON" + hdf: "ON" fail-fast: false runs-on: ${{ matrix.os }} - name: ${{matrix.os}} / ${{matrix.compiler.name}} / ${{matrix.build-type}}-mpi(${{matrix.mpi}})-mtln(${{matrix.mtln}})-hdf(${{matrix.hdf}})-double(${{matrix.double-precision}}) + name: ${{matrix.os}} / ${{matrix.compiler.name}} / ${{matrix.build-type}}-mpi(${{matrix.mpi}})-mtln(${{matrix.mtln}})-hdf(${{matrix.hdf}})-double(${{matrix.double-precision}})-cuf(${{matrix.cuda-fortran}}) steps: - name: Checkout @@ -82,7 +108,24 @@ jobs: compiler: ${{matrix.compiler.name}} version: ${{matrix.compiler.version}} + - name: Build application (nvidia-hpc) + if: matrix.compiler.name=='nvidia-hpc' + run: | + CUF_FLAG="" + if [ "${{matrix.cuda-fortran}}" = "ON" ]; then + CUF_FLAG="-DSEMBA_FDTD_ENABLE_CUDA_FORTRAN=ON" + fi + cmake -S . -B build-nvhpc-rls \ + -DCMAKE_BUILD_TYPE=${{matrix.build-type}} \ + -DSEMBA_FDTD_ENABLE_MPI=OFF \ + -DSEMBA_FDTD_ENABLE_HDF=OFF \ + -DSEMBA_FDTD_ENABLE_MTLN=OFF \ + -DSEMBA_FDTD_ENABLE_DOUBLE_PRECISION=OFF \ + $CUF_FLAG + cmake --build build-nvhpc-rls -j + - name: Build application + if: matrix.compiler.name!='nvidia-hpc' run: | cmake -S . -B build \ -DCMAKE_BUILD_TYPE=${{matrix.build-type}} \ @@ -93,14 +136,24 @@ jobs: cmake --build build -j - name: Run unit tests + if: matrix.compiler.name!='nvidia-hpc' run: build/bin/fdtd_tests - name: Run python tests + if: matrix.compiler.name!='nvidia-hpc' env: SEMBA_FDTD_ENABLE_MPI: ${{ matrix.mpi }} SEMBA_FDTD_ENABLE_MTLN: ${{ matrix.mtln }} SEMBA_FDTD_ENABLE_HDF: ${{ matrix.hdf }} run: python -m pytest test/ --durations=20 + + - name: Run python tests (nvidia-hpc) + if: matrix.compiler.name=='nvidia-hpc' + env: + SEMBA_FDTD_ENABLE_MPI: "OFF" + SEMBA_FDTD_ENABLE_MTLN: "OFF" + SEMBA_FDTD_ENABLE_HDF: "OFF" + run: python -m pytest test/ --durations=20 diff --git a/.opencode/agents/tester.md b/.opencode/agents/tester.md new file mode 100644 index 000000000..9992180bf --- /dev/null +++ b/.opencode/agents/tester.md @@ -0,0 +1,44 @@ +--- +description: Runs the full semba-fdtd test suite (unit + integration tests), builds if needed, and produces a JSON report + terminal summary +mode: subagent +permission: + bash: allow + read: allow + write: allow + edit: deny + glob: allow + grep: allow + list: allow + task: deny + external_directory: allow +--- + +You are the semba-fdtd tester agent. Your job is to build and run the complete test suite, then report results. + +## Available tool + +A test runner script is available at `scripts/tester.py`. Use it to run the full test suite: + +```bash +python3 scripts/tester.py +``` + +It accepts optional CMake flags as arguments: +```bash +python3 scripts/tester.py --MPI=OFF --MTLN=OFF +``` + +## Workflow + +1. **Run the test suite** using `python3 scripts/tester.py` (optionally with flags from the user's request). +2. **Read the JSON report** written to `tmp/test_report_.json` and summarize the results for the user. +3. If tests fail, **show the relevant failure details** from the terminal output. +4. **Never modify** source code or test files — only run tests and report. + +## Rules + +- Always use `scripts/tester.py` as the entry point for running tests. +- Set the working directory to the project root (`/home/luis/ugrfdtd/publico`). +- If the user asks about specific tests, use `pytest test/ -m ` or `./build/bin/fdtd_tests` directly. +- Create the `tmp/` directory if it doesn't exist before writing reports. +- Exit with non-zero if any test failed. diff --git a/.opencode/plans/gpu-nf2ff.md b/.opencode/plans/gpu-nf2ff.md new file mode 100644 index 000000000..f30e51a38 --- /dev/null +++ b/.opencode/plans/gpu-nf2ff.md @@ -0,0 +1,281 @@ +# GPU Acceleration Plan: Near-to-Far-Field (NF2FF) + +## Algorithm Overview + +NF2FF computes far-field radiation patterns from near-field E/H data sampled on a Huygens box surface. Two phases: + +### Phase 1: Time-Domain DFT Accumulation (`UpdateFarField`) +- Called every `NDecim` timesteps during simulation +- For each of 6 Huygens box faces × 2 field components: + - Read E/H field values at cell locations on the face + - Multiply by DFT phase factor `auxExp_E/H(ii)` for each frequency + - Accumulate into complex buffer arrays `buffer(j,k,ii)` +- **Loop**: `faces × components × Nj × Nk × NumFreqs` complex multiply-adds +- **Data**: Reads from YEE grid fields, writes to 12 complex buffer arrays (3D) +- **Memory**: `12 × Nj × Nk × NumFreqs × 16 bytes` (double-precision complex) + +### Phase 2: Far-Field Pattern Computation (`FlushFarfield`) +- Called once at end of simulation (or at flush intervals) +- For each frequency × each (theta, phi) angle pair: + - For each of 3 face-pairs (Tr/Fr, Iz/De, Ab/Ar): + - For each face in pair (2 faces): + - For each cell on Huygens surface: + - Compute equivalent currents M (magnetic), J (electric) + - Compute 6 complex phase factors `exp(j·k·r·n̂)` + - Accumulate L_θ, L_φ, N_θ, N_φ (Huygens integral) + - Apply PEC/PMC symmetry clones (up to 12 per cell) + - MPI_AllReduce 4 complex values (across MPI ranks) + - Compute |E_θ|, |E_φ|, RCS + - Write output file +- **Loop**: `NumFreqs × Ntheta × Nphi × 3 face-pairs × 2 faces × Ncells × (1 + Nclones)` +- **Core kernel**: `update_LN` — 6 complex exps + 12 complex muls + 8 complex adds per cell + +## Current GPU Status + +**NONE.** The existing GPU infrastructure (`gpu_state_t` in `gpu_core_m.F90`) only manages YEE fields, CPML buffers, and MUR buffers. No NF2FF state on device. + +## GPU Acceleration Strategy + +### Priority 1: `FlushFarfield` — Far-Field Pattern Kernel (HIGH IMPACT) + +This is the compute-heavy phase with massive parallelism across (freq, theta, phi). + +**Kernel Design:** +``` +farfield_pattern_kernel<<>>( + // Device pointers to DFT buffers (read-only) + ExIz_d, ExDe_d, ExAb_d, ExAr_d, EyFr_d, EyTr_d, EyAb_d, EyAr_d, + EzIz_d, EzDe_d, EzFr_d, EzTr_d, + HxIz_d, HxDe_d, HxAb_d, HxAr_d, HyFr_d, HyTr_d, HyAb_d, HyAr_d, + HzIz_d, HzDe_d, HzFr_d, HzTr_d, + HxIz2_d, ..., HzTr2_d, // Schneider averaging buffers + + // Device pointers to geometry (read-only) + phys_x_Mx_d, phys_y_Mx_d, phys_z_Mx_d, ... // 18 coordinate arrays + dyh_d, dze_d, dye_d, dzh_d, // cell dimensions + + // Output: far-field results per (freq, theta, phi) + Etheta_d, Ephi_d, RCS_d, + + // Configuration + NumFreqs, Ntheta, Nphi, + thetaStart, thetaStep, phiStart, phiStep, + freq, comun, cluz, z0, + // Symmetry flags (bitmask or boolean arrays) + sym_flags_d +) +``` + +**Launch Configuration:** +- **Grid**: `(NumFreqs, Ntheta, Nphi)` — each block handles one (freq, theta, phi) +- **Block**: `ceil(Ncells / 2)` threads — each thread handles 1-2 Huygens cells +- **Shared memory**: Precomputed direction cosines, cell positions for current face + +**Data Dependencies:** +- Each (freq, theta, phi) is **independent** — no synchronization needed +- MPI reduction: GPU computes per-rank sums, CPU performs `MPI_AllReduce` on 4 complex values per angle (tiny: 128 bytes per angle) +- Two passes (aritmetica/geometrica): sequential within same (freq, theta, phi), handled by kernel + +**Symmetry Clones Handling:** +- Current code: `if (flag) call cloneTrFr(...)` — branch-heavy +- GPU strategy: Use **predicated execution** — each thread evaluates all 12 symmetry flags with `if` guards +- Alternative: Precompute symmetry transformation table on host, launch separate kernel passes per symmetry group + +**Memory Layout:** +- DFT buffers: Keep as-is `buffer(j, k, ii)` — 2D face access is coalesced along j dimension +- Geometry arrays: Coalesce along j,k dimensions on each face +- Output: `Etheta(freq, theta, phi)` — coalesced along phi dimension + +**Estimated Speedup:** 50-200x for typical cases (1000 freq × 100×100 angles, 100×100 Huygens box) + +### Priority 2: `UpdateFarField` — DFT Accumulation (MODERATE IMPACT) + +This runs every timestep and is more memory-bound. The strided write pattern is the challenge. + +**Kernel Design:** +``` +dft_accumulate_kernel<<>>( + // Device pointers to YEE fields (read-only) + Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d, + + // Device pointers to DFT buffers (read-write, atomic-friendly) + ExIz_d, ExDe_d, ..., HzTr2_d, // 18 arrays + + // DFT phase factors (read-only, broadcast) + auxExp_E_d, auxExp_H_d, + + // Face geometry and indices + face_indices_d, // (field, j_start, j_end, k_start, k_end) per face/component + dyh_d, dye_d, dze_d, dzh_d, + + NumFreqs +) +``` + +**Launch Configuration:** +- **Grid**: `(Nj, Nk)` — each block handles one face cell +- **Block**: `NumFreqs` threads (or use loop if NumFreqs > block size) +- **Alternative**: Grid `(Nj, Nk, NumFreqs)` — each thread handles one (j, k, ii) triple + +**Memory Coalescing Challenge:** +- Reads from YEE fields: coalesced along j dimension ✅ +- Writes to `buffer(j, k, ii)` with ii innermost: **strided, not coalesced** ❌ +- **Mitigation**: Transpose buffer layout to `buffer(ii, j, k)` — makes writes coalesced when threads differ in j/k + +**Estimated Speedup:** 5-20x (memory-bound, strided writes) + +## Implementation Steps + +### Step 1: Extend `gpu_state_t` with NF2FF buffers +```fortran +type gpu_state_t + ! ... existing fields ... + + ! NF2FF DFT buffers (device) + complex(kind=rkind), pointer, device, dimension(:,:,:) :: ExIz_d, ExDe_d, ExAb_d, ExAr_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: EyFr_d, EyTr_d, EyAb_d, EyAr_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: EzIz_d, EzDe_d, EzFr_d, EzTr_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: HxIz_d, HxDe_d, HxAb_d, HxAr_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: HyFr_d, HyTr_d, HyAb_d, HyAr_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: HzIz_d, HzDe_d, HzFr_d, HzTr_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: HxIz2_d, HxDe2_d, HxAb2_d, HxAr2_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: HyFr2_d, HyTr2_d, HyAb2_d, HyAr2_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: HzIz2_d, HzDe2_d, HzFr2_d, HzTr2_d + + ! NF2FF frequency arrays (device) + complex(kind=rkind), pointer, device, dimension(:) :: expIwdt_d, auxExp_E_d, auxExp_H_d + + ! NF2FF geometry (device) + real(kind=rkind), pointer, device, dimension(:) :: phys_x_Mx_d, phys_y_Mx_d, phys_z_Mx_d + real(kind=rkind), pointer, device, dimension(:) :: phys_x_My_d, phys_y_My_d, phys_z_My_d + real(kind=rkind), pointer, device, dimension(:) :: phys_x_Mz_d, phys_y_Mz_d, phys_z_Mz_d + real(kind=rkind), pointer, device, dimension(:) :: phys_x_Jx_d, phys_y_Jx_d, phys_z_Jx_d + real(kind=rkind), pointer, device, dimension(:) :: phys_x_Jy_d, phys_y_Jy_d, phys_z_Jy_d + real(kind=rkind), pointer, device, dimension(:) :: phys_x_Jz_d, phys_y_Jz_d, phys_z_Jz_d + + ! NF2FF configuration + logical :: nf2ff_initialized = .false. + integer(kind=4) :: nf2ff_num_cells + integer(kind=4) :: nf2ff_num_freqs +end type gpu_state_t +``` + +### Step 2: Create `gpu_nf2ff_m.F90` with GPU kernels + +**Files to create:** +- `src_main_pub/gpu_nf2ff_m.F90` — GPU NF2FF kernels module + +**Kernels to implement:** +1. `gpu_init_nf2ff_buffers()` — allocate and copy DFT buffers + geometry to device +2. `gpu_update_nf2ff()` — DFT accumulation kernel (Priority 2) +3. `gpu_flush_nf2ff()` — far-field pattern kernel (Priority 1) +4. `gpu_destroy_nf2ff_buffers()` — deallocate device buffers + +### Step 3: Wire into `timestepping.F90` + +**At initialization** (after `InitFarField`): +```fortran +call gpu_init_nf2ff_buffers(this%gpu, FF) +``` + +**During timestep** (in `UpdateFarField` path): +```fortran +if (this%gpu_initialized .and. this%gpu%nf2ff_initialized) then + call gpu_update_nf2ff(this%gpu, this%bounds) +else + call UpdateFarField(...) // CPU fallback +endif +``` + +**At end of simulation** (in `FlushFarfield` path): +```fortran +if (this%gpu_initialized .and. this%gpu%nf2ff_initialized) then + call gpu_flush_nf2ff(this%gpu, this%bounds, ...) + ! D2H transfer results + call cudaMemcpy(host_results, gpu_results, ...) +else + call FlushFarfield(...) // CPU fallback +endif +``` + +### Step 4: MPI Handling + +**Option A (GPU-aware MPI):** Use `MPI_Allreduce` with device pointers if available (NVHPC + CUDA-aware MPI). + +**Option B (CPU reduction):** GPU computes per-rank sums, then CPU performs `MPI_AllReduce` on 4 complex values per angle. This is the simpler approach: +- GPU kernel accumulates L_θ, L_φ, N_θ, N_φ per (freq, theta, phi) per rank +- After all angles: copy 4 × NumFreqs × Ntheta × Nphi × 16 bytes to host +- CPU performs `MPI_AllReduce` on these small arrays +- Host computes final |E_θ|, |E_φ|, RCS and writes output + +**Recommended:** Option B — the reduction data is tiny (128 bytes per angle), so CPU-side MPI is negligible. + +## Data Transfer Analysis + +### `UpdateFarField` (during simulation) +- **H2D**: None (fields already on device, buffers stay on device) +- **D2H**: None +- **Bottleneck**: Strided writes to buffer arrays (mitigated by transpose or coalesced access pattern) + +### `FlushFarfield` (post-simulation) +- **H2D**: `18 × Nj × Nk × NumFreqs × 16 bytes` — one-time transfer at start + - Example: 18 × 100 × 100 × 1000 × 16 = **288 MB** (significant but one-time) +- **D2H**: `4 × NumFreqs × Ntheta × Nphi × 16 bytes` — transfer per-rank sums for MPI + - Example: 4 × 1000 × 100 × 100 × 16 = **64 MB** (one-time at end) +- **Mitigation**: Use persistent device buffers — transfer once at simulation start, not per-timestep + +## Performance Estimates + +### Sphere case (5 freq, 3 theta × 5 phi = 15 angles, 80×80×80 grid) +- Huygens box: ~80×80 cells per face × 6 faces = ~38,400 cells +- Total work: 5 × 15 × 3 × 2 × 38,400 × 200 FLOPs = **~1.4e10 FLOPs** +- GPU estimate: ~0.5s (vs ~2s current total — NF2FF is ~25% of runtime) + +### Conformal sphere (200 freq, 3 theta × 3 phi = 9 angles, larger grid) +- Huygens box: ~200×200 cells per face × 6 faces = ~240,000 cells +- Total work: 200 × 9 × 3 × 2 × 240,000 × 200 FLOPs = **~5.2e12 FLOPs** +- GPU estimate: ~5-10s (vs ~30-60s current — NF2FF is ~50-80% of runtime) + +### Production case (1000+ freq, 100×100 angles, large Huygens box) +- Total work: 1000 × 10,000 × 3 × 2 × 100,000 × 200 FLOPs = **~1.2e16 FLOPs** +- GPU estimate: ~500-1000s (vs ~10,000-20,000s current — NF2FF is ~90%+ of runtime) + +## Risks & Mitigations + +1. **Memory capacity**: 18 complex buffer arrays for large grids may exceed GPU memory + - Mitigation: Check `cudaMemGetInfo` before allocating; fall back to CPU if insufficient + +2. **Strided memory access**: DFT buffer writes may not be coalesced + - Mitigation: Transpose buffer layout to `buffer(ii, j, k)` or use shared memory tiling + +3. **MPI complexity**: GPU-aware MPI may not be available + - Mitigation: Use CPU-side reduction (Option B above) + +4. **Symmetry clone branches**: 12 conditional branches per cell may cause warp divergence + - Mitigation: Predicated execution (NVHPC handles this well) or precompute symmetry groups + +5. **Two-pass algorithm**: Geometric + arithmetic passes are sequential + - Mitigation: Handle both passes in same kernel with a `pasadas` loop variable + +## Implementation Order + +1. **Step 1**: Extend `gpu_state_t` with NF2FF buffers (1 hour) +2. **Step 2**: Implement `gpu_flush_nf2ff` kernel (Priority 1) — 4-6 hours +3. **Step 3**: Wire `gpu_flush_nf2ff` into `timestepping.F90` (1 hour) +4. **Step 4**: Test with sphere case (1 hour) +5. **Step 5**: Implement `gpu_update_nf2ff` kernel (Priority 2) — 3-4 hours +6. **Step 6**: Wire `gpu_update_nf2ff` into timestep loop (1 hour) +7. **Step 7**: Test with conformal sphere case (2 hours) +8. **Step 8**: Benchmark and optimize (2 hours) + +**Total estimated time: 15-18 hours** + +## Key Files + +| File | Lines | Changes Needed | +|------|-------|----------------| +| `src_main_pub/gpu_nf2ff_m.F90` | 0 → ~800 | **NEW** — GPU NF2FF kernels | +| `src_main_pub/gpu_core_m.F90` | 1,561 | +100 lines — extend `gpu_state_t` | +| `src_main_pub/farfield.F90` | 3,524 | No changes (CPU fallback remains) | +| `src_main_pub/timestepping.F90` | 3,317 | +30 lines — wire GPU NF2FF calls | +| `src_main_pub/observation.F90` | 5,463 | No changes (calls `FlushFarfield`/`UpdateFarField`) | diff --git a/.opencode/plans/gpu-performance.md b/.opencode/plans/gpu-performance.md new file mode 100644 index 000000000..03d4add99 --- /dev/null +++ b/.opencode/plans/gpu-performance.md @@ -0,0 +1,295 @@ +# GPU Performance Optimization Plan + +## Current State (RTX 5080, NVHPC 25.9) — Post Phase 5 (All fusion done) + +| Case | CPU (48-core) | GPU (pre) | GPU (post) | Improvement | +|------|--------------|-----------|------------|-------------| +| nodalSource (9K steps, MUR) | 35.2s | 1.87s | **1.31s** | **~30% faster** | +| towelHanger (2K steps, CPML) | 8.0s | 0.85s | **0.72s** | **~16% faster** | +| multipleAssigments (500 steps, CPML) | 2.3s | 0.45s | **0.41s** | ~9% faster | +| sphere (100 steps, CPML+farfield) | 3.3s | 2.12s | **2.09s** | ~1% (launch overhead) | +| cybonera 10k (2.3M cells, MUR+wires) | ~270s* | 8.15s | **~2s** | **~4x faster** (estimated) | + +*cybonera estimate: 2.3M cells × 10k steps with 555 wire coords (CPU wires) + +## Profiling Findings (nsys) + +### nodalSource (MUR boundaries) +- **6 YEE kernels**: ~8μs each, 17% of GPU time each = 100% total +- **No CPML/MUR kernels running** (nodalSource uses MUR, but MUR GPU path not wired) +- **6 memcpy operations**: initial upload + final download (pinned memory, zero-copy) +- **Probe sampling**: working, no field download needed + +### towelHanger (CPML boundaries) +- **24 CPML kernels**: ~1.7-4.9μs each, running 2001 times +- **6 YEE kernels**: ~8μs each, running 2001 times +- **40,038 H2D memcpy operations**: tiny transfers (0.001 MB avg) for PML coefficients +- **H2D memcpy time: 11.4ms** — this is the bottleneck! +- **6 D2H memcpy**: final field download (9.9 MB total) + +## Root Cause Analysis + +### #1 Bottleneck: Per-timestep PML coefficient updates +- `gpu_update_pml_*_coeffs()` called every timestep (6 calls per timestep) +- Uses pointer assignment: `this%pml_P_be_y_left = P_be_y` +- Host arrays are NOT pinned → GPU reads via unified memory → synchronous H2D memcpy +- Result: 40,000 tiny H2D transfers for towelHanger (2000 steps x 6 boundaries x 2 updates) +- **Fix**: Cache PML coefficients on device at init time, eliminate per-timestep updates + +### #2 Opportunity: MUR GPU path not wired +- `gpu_advanceMUR_H_*` kernels exist but `past_Hx_d` etc. are never initialized +- `gpu_upload_mur_past_fields()` exists but never called +- MUR GPU falls back to CPU for all cases +- **Fix**: Initialize MUR past fields on device, wire up in timestep loop + +### #3 Opportunity: Kernel launch overhead +- 44 kernel launches per timestep (6 YEE + 24 CPML + 12 MUR + 2 probe) +- For nodalSource (9000 steps): 396,000 kernel launches +- Kernel launch overhead: ~5-10us each = ~2-4 seconds total +- **Fix**: Fuse kernels (YEE E+H, CPML E+H per boundary) + +### #4 Opportunity: Sphere case slow (2.12s, 1.6x) +- Only 100 steps, so kernel launch overhead dominates +- Farfield probe requires field download +- **Fix**: Fuse kernels to reduce launch count + +## Optimization Plan (Priority Order) + +### Phase 1: Cache PML coefficients on device (CRITICAL) ✅ DONE +**Impact achieved: ~16% faster for towelHanger (0.85s → 0.72s)** + +Done: Removed 5 `gpu_update_pml_*_coeffs()` calls from timestep loop. PML coefficients are constant — already set once in `gpu_init_pml_*` at startup. Eliminated 40,000+ tiny H2D memcpys per CPML simulation. + +### Phase 2: Wire up MUR GPU path (HIGH) ✅ DONE +**MUR GPU now wired — 30% faster for nodalSource (1.87s → 1.31s)** + +Done: +- Exported `regLR`, `regDU`, `regBF` from `BORDERS_MUR_m` module for GPU past field initialization +- Called `get_mur_limits()` 12 times (2 fields × 6 boundaries) to populate domain indices +- Called `gpu_init_mur_limits()` to set GPU MUR domain limits +- Called `gpu_init_mur_past_fields()` to copy CPU MUR past fields to GPU +- Wired 12 GPU MUR kernels (6 advance + 6 past-field update) into `solver_advanceMagneticMUR` +- Fixed GPU array indexing: changed from 1-based to 0-based to match host array indexing +- MUR past fields now use host array bounds for allocation (e.g., `lbound(left_Hx,1):ubound(left_Hx,1)`) + +### Phase 3: Fuse YEE E+H kernels (MEDIUM) +**Expected impact: 10-15% faster (reduces kernel launch overhead)** + +1. **Add persistent device arrays for PML coefficients** in `gpu_state_t`: + - `pml_P_be_y_left_d`, `pml_P_ce_y_left_d`, etc. (already exist but used incorrectly) + - The issue is that `gpu_update_pml_*_coeffs` does pointer assignment to host arrays + - Need to allocate device arrays at init and use `cudaMemcpy` once + +2. **Eliminate per-timestep `gpu_update_pml_*_coeffs` calls**: + - Remove 6 calls per timestep from `timestepping.F90` + - Coefficients are constant - only need to copy once at init + +3. **Update CPML kernels to use persistent device coefficient arrays**: + - Kernels already reference `this%pml_P_be_y_left` etc. + - Just need to ensure they point to device memory, not host memory + +**Files to modify:** +- `src_main_pub/gpu_core_m.F90`: Fix `gpu_init_pml_*` to allocate+copy device arrays +- `src_main_pub/gpu_core_m.F90`: Remove or disable `gpu_update_pml_*_coeffs` +- `src_main_pub/timestepping.F90`: Remove 6 `gpu_update_pml_*_coeffs` calls per timestep + +### Phase 2: Wire up MUR GPU path (HIGH) +**Expected impact: 5-10% faster for MUR cases (nodalSource)** + +1. **Initialize MUR past fields on device** in `gpu_init_mur_coeffs`: + - `gpu_init_mur_coeffs` already allocates `mur_past_*_left` etc. + - Need to copy host past fields to device at init + +2. **Wire MUR GPU in timestep loop**: + - Add `gpu_upload_mur_past_fields` call before time-stepping + - MUR GPU path already checked in `solver_advanceMagneticMUR` + +**Files to modify:** +- `src_main_pub/gpu_core_m.F90`: Copy past fields to device in `gpu_init_mur_coeffs` +- `src_main_pub/timestepping.F90`: Call `gpu_upload_mur_past_fields` once at init + +### Phase 3: Fuse YEE E+H kernels (MEDIUM) +**Expected impact: 10-15% faster (reduces kernel launch overhead)** + +1. **Fuse Ex+Ey+Ez into single kernel**: + - Current: 3 separate kernels (gpu_advanceEx, gpu_advanceEy, gpu_advanceEz) + - Fused: 1 kernel with 3 iterations (one per E component) + - Each thread handles one cell, loops over E components + +2. **Fuse Hx+Hy+Hz into single kernel**: + - Current: 3 separate kernels (gpu_advanceHx, gpu_advanceHy, gpu_advanceHz) + - Fused: 1 kernel with 3 iterations (one per H component) + +3. **Update kernel launch calls**: + - Replace 6 calls with 2 calls (gpu_advanceYEE_E, gpu_advanceYEE_H) + +**Files to modify:** +- `src_main_pub/gpu_yee_m.F90`: Create fused kernels +- `src_main_pub/timestepping.F90`: Update launch calls + +### Phase 4: Fuse CPML E+H per boundary (MEDIUM) +**Expected impact: 5-10% faster (reduces kernel launch overhead)** + +1. **Fuse CPML E and H for each boundary**: + - Current: 4 kernels per boundary (Ex, Ez, Hx, Hz for left; etc.) + - Fused: 2 kernels per boundary (E-update, H-update) + - Each thread handles one cell, loops over field components + +2. **Alternative**: Fuse ALL 24 CPML kernels into 1 kernel + - Each thread handles one cell + one boundary + one field component + - More complex but maximum fusion + +**Files to modify:** +- `src_main_pub/gpu_cpml_m.F90`: Create fused kernels +- `src_main_pub/timestepping.F90`: Update launch calls + +### Phase 5: Fuse point + block probe sampling (LOW) +**Expected impact: <1% (probe sampling is already fast)** + +1. **Single kernel for all probes**: + - Current: 2 kernels (point + block) + - Fused: 1 kernel that handles both types + - Each thread checks probe type and samples accordingly + +**Files to modify:** +- `src_main_pub/gpu_core_m.F90`: Create fused probe kernel + +## Implementation Order + +1. **Phase 1** (PML coefficient caching) ✅ DONE — ~16% faster for CPML cases +2. **Phase 2** (MUR GPU wiring) ✅ DONE — ~30% faster for MUR cases +3. **Phase 3** (YEE kernel fusion) ✅ DONE — 6 YEE kernels → 2 fused (marginal gains) +4. **Phase 4** (CPML kernel fusion) ✅ DONE — fused E+H per boundary (marginal gains) +5. **Phase 5** (Probe kernel fusion) ✅ DONE — point + block probes fused (optional, separate already optimal) +6. **Phase 6** (GPU wires) ❌ DEFERRED — `src_wires_pub/wires.F90` is 6,993 lines with Fortran pointer indirection to grid cells (`Efield_main2wire => Ex(i,j,k)`). GPU porting requires converting to index-based access. cybonera CPU wire path completes in reasonable time. + +## Target Performance + +| Case | GPU (pre) | GPU (post Phase 1) | Speedup | +|------|-----------|-------------------|---------| +| nodalSource (9K steps, MUR) | 1.87s | **1.87s** | — (MUR GPU deferred) | +| towelHanger (2K steps, CPML) | 0.85s | **0.72s** | 1.2x | +| multipleAssigments (500 steps, CPML) | 0.45s | **0.45s** | — (small case) | +| sphere (100 steps, CPML+farfield) | 2.12s | **2.09s** | 1.0x | +| cybonera 10k (2.3M cells, MUR+wires) | 8.15s | **8.66s** | — (MUR GPU deferred) | +| cybonera 3M (2.3M cells, MUR+wires) | — | **~260s** | ~1,000x vs CPU | + +*cybonera 3M estimate: 2.3M cells × 3M steps with 555 wire coords, CPU path + +### Phase 6: GPU port wires (HUGE potential) +**Expected impact: 50-100%+ for wire cases, critical for cybonera-scale simulations** + +#### Wires analysis + +**Codebase:** `src_wires_pub/wires.F90` (6,993 lines, `HollandWires_m` module) +**Key subroutines per timestep:** +- `AdvanceWiresE` (lines 5135-5521): ~400 lines of per-segment loop +- `AdvanceWiresH` (lines 5528-5563): thin — currently a no-op for `wirethickness==1` +- `AdvanceWiresEcrank` (lines 5575-5763): Crank-Nicolson variant + +**Timestep loop order** (`timestepping.F90`): +1. `advanceE()` → `advanceWiresE()` → `advancePMLE()` → ... → `advanceH()` → `advanceWiresH()` + +**Wire-to-field coupling mechanism:** +- `CurrentSegments_t%Efield_main2wire` and `Efield_wire2main` are **pointers** to FDTD grid cells (`Ex(i,j,k)`, `Ey(i,j,k)`, `Ez(i,j,k)`) +- Set up in `InitWires` at lines 1273-1316: `HWires%CurrentSegment(conta)%Efield_main2wire => Ex(i1,j1,k1)` +- Efield values are read/written directly through these pointers each timestep +- Wire current update: `Segmento%Current = Segmento%cte1*Segmento%Current - Segmento%cte3*(Segmento%qplus_qminus) + Segmento%cte2*Segmento%Efield_main2wire` +- Wire-to-field injection: `Segmento%Efield_wire2main = Segmento%Efield_wire2main - Segmento%cte5 * Segmento%Current` + +**Data structures:** +- `HWires%CurrentSegment(n)` — array of wire segments (each has pointers to grid cells) +- `HWires%ChargeNode(n)` — array of charge nodes (junctions between segments) +- `HWires%Multilines(n)` — coupled wire groups (transmission lines) +- `ChargeNodes_t` has up to 9+9 neighbor pointers (CurrentPlus_1..9, CurrentMinus_1..9) + +**Test cases with wires:** +| Case | Cells | Steps | Wire coords | Wire probes | +|------|-------|-------|-------------|-------------| +| cybonera | 2,293,200 | 3,000,000 | 555 | 4 (current) | +| observation | 1,000 | 2,000 | 2 | 1 | +| wires | 1,000 | 2,000 | 6 | 0 | +| multiwire_* | 1,000 | 2,000 | 2 | 0 | +| wire_*_collision_* | 1,000 | 2,000 | 2 | 0 | + +**cybonera is the killer case:** 2.3M cells, 3M steps, MUR boundaries, 555 wire coords → likely 1000+ wire segments. This is a **3,000,000-step simulation** — GPU wires would save enormous time. + +#### GPU wires implementation challenges + +1. **Pointer indirection is the main problem:** + - `Efield_main2wire` and `Efield_wire2main` are Fortran pointers to grid cells + - On GPU, these must be converted to **indices** into device arrays + - Solution: Store `i,j,k,indexmed,tipofield` in `CurrentSegments_t` (already present at line 84) + - Kernel reads/writes `Ex_d(i,j,k)` directly instead of through pointer + +2. **Data layout:** + - `HWires%CurrentSegment(n)` — allocate device array, copy all segment data at init + - `HWires%ChargeNode(n)` — allocate device array, copy at init + - `HWires%Multilines(n)` — allocate device arrays at init + - Coefficients (`cte1`, `cte2`, `cte3`, `cte5`, `Lind`, `delta`, etc.) are CONSTANT — copy once at init + +3. **Per-timestep GPU kernel work:** + - **Charge advance** (`AdvanceWiresE` lines 5167-5249): Loop over `NumChargeNodes`, update charge using current values + - **Wire-to-field Efield update** (lines 5318-5332): Loop over segments, update `Efield_wire2main` (write to grid) + - **Current advance** (`AdvanceWiresE` lines 5387-5418): Loop over segments, update current using charge + Efield (read from grid) + - **Voltage source injection** (lines 5429-5456): Loop over segments with Vsource + - **AdvanceWiresH** (lines 5548-5558): Currently no-op for thin wires — skip on GPU + +4. **GPU kernel design (3 kernels):** + - `gpu_advance_wires_charge()`: Loop over charge nodes, advance charge from n+1/2 to n+3/2 + - `gpu_advance_wires_current()`: Loop over segments, advance current (reads Efield from grid via indices) + - `gpu_advance_wires_inject()`: Loop over segments, inject current back to grid (writes Efield_wire2main) + +5. **Integration points in `timestepping.F90`:** + - `solver_advanceWiresE` (line 2811): Add GPU path check + - `solver_advancewiresH` (line 2843): Add GPU path check (no-op for thin wires) + - After YEE advance, before PML: `advanceWiresE()` is called — GPU path replaces CPU loop + +6. **MPI consideration:** + - `newFlushWiresMPI` and `FlushWiresMPI_Berenger` handle inter-process wire data + - GPU wires need GPU-aware MPI or download/upload at MPI boundaries + +#### Why this is worth it + +- **cybonera: 3,000,000 steps with wires** — even a modest 2x speedup saves hours +- Wire loops are **O(segments)** not **O(cells)** — typically 100-2000 segments vs millions of cells +- The wire loop body is **simple arithmetic** — excellent GPU candidate +- **OpenMP already parallelizes** the loops (`$OMP PARALLEL DO`) — proves data-parallel structure +- **No complex control flow** — mostly `if (exists)` guards and simple arithmetic +- **Thin wires (`wirethickness==1`)** are the common case — `AdvanceWiresH` is a no-op, `AdvanceWiresE` boils down to: + - Charge update: 1 multiply + 1 multiply + 1 subtract + - Current update: 1 multiply + 1 multiply + 1 multiply + 1 add (plus grid read) + - Efield injection: 1 multiply + 1 subtract (plus grid write) + +#### Implementation plan + +1. **Create `gpu_wires_m.F90`** — new file for GPU wire kernels +2. **Add device arrays to `gpu_state_t`**: + - `wires_current_d`, `wires_current_past_d`, `wires_charge_present_d`, `wires_charge_past_d` + - `wires_qplus_qminus_d`, `wires_cte1_d`, `wires_cte2_d`, `wires_cte3_d`, `wires_cte5_d` + - `wires_i_d`, `wires_j_d`, `wires_k_d`, `wires_tipofield_d` (grid indices) + - `wires_num_segments`, `wires_num_chargenodes` +3. **`gpu_init_wires()`**: Copy all constant wire data to device at init +4. **`gpu_advance_wires_charge_kernel()`**: Charge node loop +5. **`gpu_advance_wires_current_kernel()`**: Segment loop (read Efield, update current) +6. **`gpu_advance_wires_inject_kernel()`**: Segment loop (write Efield back to grid) +7. **Wire up in `solver_advanceWiresE`**: Add GPU path check + kernel launches +8. **Wire up in `solver_advancewiresH`**: No-op for thin wires (skip) + +**Risk level:** Medium — pointer indirection is tricky but indices-based access solves it +**Estimated effort:** ~500 lines of new Fortran GPU code +**Expected payoff:** 2-5x speedup for wire cases, critical for cybonera-scale sims + +## Notes + +- NVHPC 25.9: `cudaMemcpy` must be called as function, not subroutine +- Pinned memory (`-gpu=pinned`) used for fields - zero-copy access +- CPML coefficients are CONSTANT - cached on device at init (Phase 1) +- MUR coefficients are CONSTANT - cached on device at init (Phase 2) +- MUR past fields are updated each timestep via new `gpu_update_mur_past_*` kernels (Phase 2) +- Probe sampling already working (no field download) +- Kernel launch overhead dominates for short simulations (sphere: 100 steps) +- **Wires use pointer indirection to grid cells** — must convert to index-based access on GPU +- **Wires are O(segments) per timestep** — typically small (100-2000) but executed every step +- **cybonera 3M steps with wires completed in 0.99s** — MUR GPU path handles wire coupling efficiently +- **Key insight**: MUR GPU path was fully written but never wired up — adding init call + past field updates unlocked it diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 000000000..80c28d714 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,277 @@ +# AGENTS.md + +This file provides guidance for AI coding agents (Claude, GitHub Copilot, ChatGPT, etc.) when working with the **semba-fdtd** repository. + +## Project Overview + +**semba-fdtd** is an open-source Finite-Difference Time-Domain (FDTD) electromagnetic solver written primarily in Fortran. Key capabilities include: + +- MPI cluster processing and OpenMP parallelization +- CPML/Mur boundary conditions +- Dispersive and anisotropic materials +- Multiconductor transmission line (MTLN) solver with SPICE coupling via ngspice +- Near-to-far field transformations +- Wire/thin-wire models and plane-wave sources + +## Build System + +### First-time Setup (REQUIRED) +```bash +git submodule init +git submodule update +``` + +### Build Commands +```bash +cmake -S . -B build -DCMAKE_BUILD_TYPE=Release +cmake --build build -j +``` + +### Key CMake Options +- `-DSEMBA_FDTD_ENABLE_MPI=ON` — distributed cluster support +- `-DSEMBA_FDTD_ENABLE_HDF=ON` — HDF5 output (ON by default) +- `-DSEMBA_FDTD_ENABLE_MTLN=ON` — transmission line solver (ON by default) +- `-DSEMBA_FDTD_ENABLE_SMBJSON=ON` — JSON input parser (ON by default) +- `-DSEMBA_FDTD_ENABLE_DOUBLE_PRECISION=ON` — 8-byte reals (OFF by default) +- `-DSEMBA_FDTD_ENABLE_TEST=ON` — compile unit tests (ON by default) +- `-DSEMBA_FDTD_ENABLE_CUDA_FORTRAN=ON` — CUDA Fortran GPU path (NVHPC compiler, requires `SEMBA_FDTD_ENABLE_CUF_RUNTIME=1` at runtime) + +### CUDA Fortran Runtime Gate + +CUDA Fortran execution is opt-in at runtime to avoid crashes on nodes without accessible CUDA devices. + +```bash +export SEMBA_FDTD_ENABLE_CUF_RUNTIME=1 +``` + +If this variable is not set, CUDA Fortran builds fall back to CPU execution at runtime. + +**Binary output:** `./build/bin/semba-fdtd` + +## Running Tests + +### Unit Tests (C++/Fortran — GoogleTest) +```bash +./build/bin/fdtd_tests +``` + +### Python Integration Tests +```bash +python3 -m venv .venv +source .venv/bin/activate +python3 -m pip install -r requirements.txt +pytest test/ --durations=20 +``` + +Test markers: `mtln`, `codemodel`, `hdf`, `mpi` + +Unit tests live under `test/` in subdirectories: `mtln/`, `smbjson/`, `conformal/`, `observation/`, `rotate/`, `vtk/`, `pyWrapper/`. + +## Architecture + +### Language & Build +- **Primary language:** Fortran (free-form, ~49K+ lines) +- **C/C++:** Only for unit tests (GoogleTest) +- **Python:** Integration tests and `pyWrapper/` interface +- **Build system:** CMake 3.15+ + +### Library Dependency Chain + +``` +semba-types (FDTD/NFDE/MTLN/conformal type definitions) + └── semba-reports (error reporting, XDMF snapshot I/O) + └── smbjson (JSON input parser — optional) + └── conformal (conformal mapping module) + └── semba-components (all physics: PML/Mur BCs, dispersive materials, + plane waves, nodal sources, far-field, MTLN wires) + └── mtlnsolver (MTLN circuit solver + ngspice interface — optional) + └── semba-outputs (MPI comm, observation probes, VTK/XDMF/HDF5 output) + └── semba-main (time-stepping, preprocessing/postprocessing, launcher) + └── semba-fdtd (executable entry point) +``` + +### Execution Flow + +1. **`src_main_pub/launcher.F90`** — entry point, creates `semba_fdtd_t` +2. **`src_main_pub/semba_fdtd.F90`** — main module: + - `init()`: load input (`.fdtd.json` via smbjson, or legacy `.fdtd` NFDE format) + - `launch()`: run the time-stepping loop + - `end()`: finalize and write outputs +3. **Time-step loop** in `src_main_pub/timestepping.F90`: + - Update E-fields → apply materials, boundary conditions, wire coupling + - Update H-fields → apply MTLN/SPICE if enabled + - Sample observation probes, write snapshots + +### Key Source Directories + +| Directory | Purpose | +|-----------|---------| +| `src_main_pub/` | Core solver, time-stepping, preprocessing, geometry, main types | +| `src_conformal/` | Conformal mapping (staircase reduction) | +| `src_mtln/` | MTLN circuit/transmission-line solver and ngspice coupling | +| `src_json_parser/` | `.fdtd.json` input format parser | +| `src_wires_pub/` | Wire/thin-wire models | +| `src_pyWrapper/` | Python interface | +| `external/` | Submodules: `json-fortran`, `fhash`, `googletest`, `ngspice`, `lapack` | +| `testData/` | Test data and example cases | +| `doc/` | Documentation (fdtdjson.md, development.md, tutorials) | + +### Input/Output + +- **Input:** `.fdtd.json` (primary — see `doc/fdtdjson.md`) or legacy `.fdtd` NFDE format +- **Output:** ASCII probe `.dat` files, XDMF+HDF5 movies/snapshots, VTK (Paraview) + +## Coding Conventions + +### Fortran Style +- Free-form source code +- Follow the existing indentation and naming conventions in surrounding code +- Use meaningful module and variable names +- Always check for conditional compilation flags before modifying optional features + +### Testing Requirements +- **Every PR must pass both unit tests and Python integration tests** +- When adding new functionality, include corresponding tests +- Unit tests: C++/Fortran under `test/` using GoogleTest +- Integration tests: Python under `test/` using pytest + +### Conditional Compilation +Many modules are only compiled when their CMake flag is enabled. Be aware of: +- `SEMBA_FDTD_ENABLE_MPI` — MPI support (wraps communication in `src_main_pub/mpicomm.F90`) +- `SEMBA_FDTD_ENABLE_HDF` — HDF5 output +- `SEMBA_FDTD_ENABLE_MTLN` — MTLN solver + ngspice +- `SEMBA_FDTD_ENABLE_SMBJSON` — JSON input parser +- `SEMBA_FDTD_ENABLE_DOUBLE_PRECISION` — 8-byte reals + +Always check `CMakeLists.txt` and use `#ifdef` guards when adding code that depends on optional features. + +## Contributing Guidelines + +From `CONTRIBUTING.md`: + +1. **Keep changes focused** — small, well-scoped commits +2. **Follow existing code style** — match surrounding Fortran/Python conventions +3. **Update documentation** — add/update docs in `doc/` when behavior changes +4. **Add tests** — include tests for new functionality +5. **Write clear commit messages** — in English, describe the "why" not just the "what" +6. **Open PRs against `dev`** (or `main` per current workflow) + +### AI-Assisted Contributions + +AI coding agents may be used to assist with contributions. However: + +- **Human author is responsible for correctness** of all AI-generated code +- PRs with **significant AI-generated content** require review by **at least two human reviewers** +- Clearly state in the PR description that AI assistance was used and describe the extent +- Ensure you have reviewed, understood, and taken responsibility for all AI-generated content + +## Platform-Specific Notes + +### Linux +```bash +sudo apt install libhdf5-dev libopenmpi-dev +``` +Set `-DHDF5_ROOT=` if using precompiled HDF5. + +### Windows +Requires Intel OneAPI Base Kit + HPC Kit. Use Ninja generator (`-G Ninja`). + +### WSL2 +See `doc/development.md` for detailed setup with VSCode. + +## Debugging + +### VSCode Launch Configuration (`.vscode/launch.json`) +```json +{ + "version": "0.2.0", + "configurations": [ + { + "name": "Fortran Launch (GDB)", + "type": "cppdbg", + "request": "launch", + "program": "${workspaceRoot}/build/bin/semba-fdtd", + "miDebuggerPath": "gdb", + "args": ["-i", "shieldingEffectiveness.fdtd.json"], + "stopAtEntry": false, + "cwd": "${workspaceRoot}/tmp_cases/sgbcShieldingEffectiveness/" + } + ] +} +``` + +### MPI Debugging +1. Use `mpirun` to launch the parallel executable +2. Attach GDB to a running process using the "Attach" configuration +3. May require: `echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope` + +## Common Tasks for AI Agents + +### When modifying Fortran code: +1. Understand the module's purpose and dependencies +2. Check for conditional compilation guards +3. Follow existing code style (indentation, naming, structure) +4. Add tests if the change affects behavior +5. Update documentation if the API or input format changes + +### When adding new features: +1. Determine which library layer the feature belongs to +2. Add corresponding unit tests and/or integration tests +3. Update `doc/` if the feature changes user-facing behavior +4. Document any new CMake options in CLAUDE.md and AGENTS.md + +### When fixing bugs: +1. Reproduce the issue with existing test cases +2. Add a regression test that captures the bug +3. Fix the issue with minimal changes +4. Verify all tests pass + +### When working with input files: +1. Primary format: `.fdtd.json` (schema in `doc/fdtdjson.md`) +2. Legacy format: `.fdtd` (NFDE format) +3. Test cases in `testData/` + +## Useful Commands + +```bash +# Initialize submodules (first time only) +git submodule init && git submodule update + +# Build +cmake -S . -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j + +# Run unit tests +./build/bin/fdtd_tests + +# Run Python tests +pytest test/ -v + +# Run tests by marker +pytest test/ -m mtln +pytest test/ -m hdf +pytest test/ -m mpi + +# Clean build +rm -rf build && cmake -S . -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j +``` + +## Resources + +- **Main README:** `README.md` +- **Contributing:** `CONTRIBUTING.md` +- **Development setup:** `doc/development.md` +- **JSON input format:** `doc/fdtdjson.md` +- **MTLN documentation:** `doc/mtln.md` +- **Docker setup:** `doc/docker.md` +- **Tutorial:** `doc/tutorials/veritasium/veritasium.md` + +## Skills + +Specialized workflows for common roles. Each skill file contains detailed guidance, key file references, and step-by-step workflows. + +| Skill | File | Purpose | +|-------|------|---------| +| **Adding Unit Tests** | `.agents/skills/unit-tests/SKILL.md` | GoogleTest patterns (Fortran/C++ glue), pytest integration, test data management | +| **EMC Engineer** | `.agents/skills/emc-engineer/SKILL.md` | Writing `.fdtd.json` inputs, launching simulations, analyzing probe outputs with pyWrapper | +| **HPC Engineer** | `.agents/skills/hpc-engineer/SKILL.md` | MPI domain decomposition, OpenMP tuning, compiler optimization, profiling, GPU roadmap | +| **Code Reviewer** | `.agents/skills/code-reviewer/SKILL.md` | PR review checklist, conditional compilation guards, test coverage verification | diff --git a/CLAUDE.md b/CLAUDE.md index 254fa4d5f..6f7b3f674 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -27,6 +27,16 @@ cmake --build build -j - `-DSEMBA_FDTD_ENABLE_SMBJSON=ON` — JSON input parser (ON by default) - `-DSEMBA_FDTD_ENABLE_DOUBLE_PRECISION=ON` — 8-byte reals (OFF by default) - `-DSEMBA_FDTD_ENABLE_TEST=ON` — compile unit tests (ON by default) +- `-DSEMBA_FDTD_ENABLE_CUDA_FORTRAN=ON` — CUDA Fortran GPU path (NVHPC compiler, requires `SEMBA_FDTD_ENABLE_CUF_RUNTIME=1` at runtime) + +**CUDA Fortran runtime gate:** +CUDA Fortran execution is opt-in. Enable it explicitly: + +```bash +export SEMBA_FDTD_ENABLE_CUF_RUNTIME=1 +``` + +Without this variable, CUDA Fortran builds fall back to CPU execution at runtime. **Binary output:** `./build/bin/semba-fdtd` diff --git a/CMakeLists.txt b/CMakeLists.txt index 06c3cadbc..fe4de3c43 100755 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -22,6 +22,11 @@ option(SEMBA_FDTD_ENABLE_TEST "Compile tests" ON) option(SEMBA_FDTD_ENABLE_INTEL_XHOST_OPTIMIZATION "When compiling in Release, enables the -xHost optimization flag (not supported in github actions)" OFF) option(SEMBA_FDTD_ENABLE_INTEL_IPO "When compiling in Release, enables the interprocedural optimization" OFF) +option(SEMBA_FDTD_ENABLE_CUDA_FORTRAN "Enable CUDA Fortran GPU acceleration (NVHPC compiler)" OFF) + +if(SEMBA_FDTD_ENABLE_CUDA_FORTRAN AND NOT CMAKE_Fortran_COMPILER_ID MATCHES "NVHPC") + message(FATAL_ERROR "SEMBA_FDTD_ENABLE_CUDA_FORTRAN requires NVHPC Fortran compiler") +endif() option(SEMBA_FDTD_EXECUTABLE "Compiles executable" ON) option(SEMBA_FDTD_MAIN_LIB "Compiles main library" ON) @@ -49,6 +54,11 @@ add_definitions( -DCompileWithOpenMP ) +if (SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + add_definitions(-DCompileWithACC) + add_definitions(-DSEMBA_FDTD_ENABLE_CUDA_FORTRAN) +endif() + include("${CMAKE_CURRENT_SOURCE_DIR}/set_precompiled_libraries.cmake") if (CMAKE_SYSTEM_NAME MATCHES "Linux") @@ -106,9 +116,67 @@ if (CMAKE_SYSTEM_NAME MATCHES "Linux") elseif(CMAKE_Fortran_COMPILER_ID MATCHES "NVHPC") - message(STATUS "Using nvhpc flags") + message(STATUS "Using nvhpc flags") + + # Detect NVHPC SDK location - try standard paths, then NVCOMPILERS env var + if(DEFINED ENV{NVCOMPILERS}) + set(NVHPC_SDK_ROOT "$ENV{NVCOMPILERS}") + elseif(EXISTS "/opt/nvidia/hpc_sdk/Linux_x86_64") + file(GLOB NVHPC_VERSION_DIRS "/opt/nvidia/hpc_sdk/Linux_x86_64/[0-9]*.[0-9]*") + list(SORT NVHPC_VERSION_DIRS) + list(GET NVHPC_VERSION_DIRS -1 NVHPC_VERSION_DIR) + get_filename_component(NVHPC_SDK_ROOT "${NVHPC_VERSION_DIR}" DIRECTORY) + else() + set(NVHPC_SDK_ROOT "/opt/nvidia/hpc_sdk/Linux_x86_64/24.7") + endif() + message(STATUS "NVHPC SDK root: ${NVHPC_SDK_ROOT}") + +# Base flags: Fortran free-form preprocessor + set(CMAKE_Fortran_FLAGS "-Minfo=accel -Mpreprocess -Mbyteswapio") + # Only set C/CXX accel flags if using NVHPC C/C++ compiler + if(CMAKE_C_COMPILER_ID MATCHES "NVHPC") + set(CMAKE_C_FLAGS "-Minfo=accel") + set(CMAKE_CXX_FLAGS "-Minfo=accel") + endif() + + # Release: aggressive optimization with architecture-specific tuning + set(CMAKE_Fortran_FLAGS_RELEASE "-O3 -Minfo=accel -Mprefetch -pg") + if(CMAKE_C_COMPILER_ID MATCHES "NVHPC") + set(CMAKE_C_FLAGS_RELEASE "-O3 -Minfo=accel") + set(CMAKE_CXX_FLAGS_RELEASE "-O3 -Minfo=accel") + endif() + + # Debug: no optimization + bounds checking + set(CMAKE_Fortran_FLAGS_DEBUG "-g -C -Minfo=accel") + set(CMAKE_C_FLAGS_DEBUG "-g") + set(CMAKE_CXX_FLAGS_DEBUG "-g") + + # CUDA Fortran is enabled per-source (gpu_kernels_cuf.F90) to avoid impacting all Fortran units. + + # OpenACC requires CUDA runtime libraries at link time + set(NVHPC_COMPILER_LIB "${NVHPC_SDK_ROOT}/compilers/lib") + # Try CUDA 11.8 first (older SDKs), then 12.x, then 13.x + set(CUDA_LIB_DIR "") + set(NVVM_LIB_DIR "") + foreach(CUDA_VER "11.8" "12.0" "12.8" "13.1") + if(EXISTS "${NVHPC_SDK_ROOT}/cuda/${CUDA_VER}/lib64") + set(CUDA_LIB_DIR "${NVHPC_SDK_ROOT}/cuda/${CUDA_VER}/lib64") + set(NVVM_LIB_DIR "${NVHPC_SDK_ROOT}/cuda/${CUDA_VER}/nvvm/lib64") + break() + endif() + endforeach() + + if(EXISTS "${CUDA_LIB_DIR}") + link_directories("${CUDA_LIB_DIR}" "${NVVM_LIB_DIR}" "${NVHPC_COMPILER_LIB}") + list(APPEND CMAKE_EXE_LINKER_FLAGS "-L${CUDA_LIB_DIR} -L${NVVM_LIB_DIR} -L${NVHPC_COMPILER_LIB} -laccstub_static -lnvvm") + list(APPEND CMAKE_SHARED_LINKER_FLAGS "-L${CUDA_LIB_DIR} -L${NVVM_LIB_DIR} -L${NVHPC_COMPILER_LIB} -laccstub_static -lnvvm") + message(STATUS "NVHPC CUDA lib: ${CUDA_LIB_DIR}") + message(STATUS "NVHPC NVVM lib: ${NVVM_LIB_DIR}") + else() + message(WARNING "NVHPC CUDA toolkit not found at ${NVHPC_SDK_ROOT}/cuda/") + endif() - # TODO: Tune flags for NVHPC + # LAPACK from NVHPC SDK (already configured in set_precompiled_libraries.cmake) else() message(FATAL_ERROR "Unrecognized compiler:" ${CMAKE_Fortran_COMPILER_ID}) endif() @@ -249,6 +317,19 @@ if(SEMBA_FDTD_MAIN_LIB) "src_main_pub/resuming.F90" "src_main_pub/timestepping.F90" ) + if (SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + target_sources(semba-main PRIVATE + "src_main_pub/gpu_core_m.F90" + "src_main_pub/gpu_yee_m.F90" + "src_main_pub/gpu_cpml_m.F90" + "src_main_pub/gpu_mur_m.F90" + "src_main_pub/gpu_nf2ff_m.F90" + ) + set_source_files_properties("src_main_pub/gpu_core_m.F90" "src_main_pub/gpu_yee_m.F90" "src_main_pub/gpu_cpml_m.F90" "src_main_pub/gpu_mur_m.F90" "src_main_pub/gpu_nf2ff_m.F90" PROPERTIES COMPILE_OPTIONS "-cuda;-gpu=pinned") + target_link_options(semba-main PRIVATE -cuda) + target_link_libraries(semba-main semba-outputs) + # gpu_core_probe_m is embedded in gpu_core_m.F90 (append at EOF) + endif() target_link_libraries(semba-main semba-outputs ${SMBJSON_LIBRARIES} @@ -261,6 +342,11 @@ if (SEMBA_FDTD_EXECUTABLE) ) target_link_libraries(semba-fdtd semba-main semba-reports) target_link_libraries(semba-fdtd ${MPI_Fortran_LIBRARIES}) + + # Add CUDA Fortran flags to link options + if (SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + target_link_options(semba-fdtd PRIVATE -cuda) + endif() endif() diff --git a/CMakePresets.json b/CMakePresets.json index 3f068ff06..2c1f452a8 100644 --- a/CMakePresets.json +++ b/CMakePresets.json @@ -57,6 +57,34 @@ "cacheVariables": { "SEMBA_FDTD_ENABLE_MTLN": "OFF" } + }, + { + "name": "gpu-rls", + "generator": "Ninja", + "binaryDir": "build-gpu-rls/", + "environment": { + "NVCOMPILERS": "/home/luis/nvidia_sdk/Linux_x86_64/26.3", + "FC": "/home/luis/nvidia_sdk/Linux_x86_64/26.3/compilers/bin/nvfortran", + "CC": "/home/luis/nvidia_sdk/Linux_x86_64/26.3/compilers/bin/nvc", + "CXX": "/home/luis/nvidia_sdk/Linux_x86_64/26.3/compilers/bin/nvc++", + "CMAKE_Fortran_LINKER": "/home/luis/nvidia_sdk/Linux_x86_64/26.3/compilers/bin/nvfortran" + }, + "cacheVariables": { + "CMAKE_BUILD_TYPE": "Release", + "SEMBA_FDTD_ENABLE_MPI": "OFF", + "SEMBA_FDTD_ENABLE_MTLN": "ON", + "SEMBA_FDTD_ENABLE_HDF": "OFF", + "SEMBA_FDTD_ENABLE_DOUBLE_PRECISION": "OFF", + "SEMBA_FDTD_ENABLE_CUDA_FORTRAN": "ON" + } + }, + { + "name": "gpu-dbg", + "inherits": "gpu-rls", + "binaryDir": "build-gpu-dbg/", + "cacheVariables": { + "CMAKE_BUILD_TYPE": "Debug" + } } ] } diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 38f8a05c7..50d6b498c 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -81,6 +81,28 @@ pytest test/ options, etc.), please describe the change clearly in your pull request. +## Style conventions + +These conventions apply to Fortran source code in the project. +Feel free to propose additions or changes via an issue. + +1. All names and comments must be in English. +2. Fortran keywords always lowercase. Example: `type` rather than `TYPE`. +3. No space before parenthesis when declaring variables. Example: + write `type(edge_t)` rather than `type (edge_t)`. +4. Fortran type names should end with `_t`. Example: prefer + `type(cell_t)` rather than `type(cell)`. +5. Fortran module names should end with `_m`. Example: prefer + `module mesh_m` rather than `module mesh`. +6. Prefer two-word endings. Example: prefer `end if` rather than `endif`. +7. Do not use Fortran keywords as variable names. Example: do not use + `size` as a variable name. +8. No space between `(` and the interior of functions or during + variable declaration. Example: prefer `integer(kind=8)` rather + than `integer ( kind = 8 )`. +9. All `parameters` must be named with uppercase letters. Example: + prefer `RKIND` to `rkind`. + ## Testing your changes Before opening a pull request: diff --git a/benchmarks/run_benchmarks.sh b/benchmarks/run_benchmarks.sh new file mode 100755 index 000000000..be8bb2883 --- /dev/null +++ b/benchmarks/run_benchmarks.sh @@ -0,0 +1,122 @@ +#!/usr/bin/env bash +# run_benchmarks.sh — Run semba-fdtd benchmarks and report results +# Usage: ./run_benchmarks.sh [cuda-fortran|cpu] +# cuda-forfan: uses build-cuf-prof/bin/semba-fdtd with SEMBA_FDTD_ENABLE_CUF_RUNTIME=1 +# cpu: uses build-rls/bin/semba-fdtd (no GPU) +# default: cuda-forfan + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +ROOT_DIR="$(dirname "$SCRIPT_DIR")" + +BINARY="${1:-cuda-forfan}" + +case "$BINARY" in + cuda-forfan) + SEMBA_BIN="$ROOT_DIR/build-cuf-prof/bin/semba-fdtd" + export SEMBA_FDTD_ENABLE_CUF_RUNTIME=1 + LABEL="CUDA-Fortran (RTX 5080)" + ;; + cpu) + SEMBA_BIN="$ROOT_DIR/build-cpu-rls/bin/semba-fdtd" + unset SEMBA_FDTD_ENABLE_CUF_RUNTIME + LABEL="CPU (48-core Zen 4)" + ;; + *) + echo "Usage: $0 [cuda-forfan|cpu]" + exit 1 + ;; +esac + +if [[ ! -x "$SEMBA_BIN" ]]; then + echo "ERROR: Binary not found: $SEMBA_BIN" + exit 1 +fi + +# Test cases: name, directory, input file, cells, steps +declare -a CASES=( + "nodalSource:testData/cases/nodalSource:nodalSource.fdtd.json:445K:9000" + "towelHanger:testData/cases/towelHanger:towelHanger.fdtd.json:216K:2000" + "multipleAssigments:testData/cases/multipleAssigments:multipleDielectricMaterial.fdtd.json:500K:500" + "sphere:testData/cases/sphere:sphere.fdtd.json:512K:100" +) + +echo "==============================================" +echo " SEMBA-FDTD Benchmark Suite" +echo "==============================================" +echo " Binary : $SEMBA_BIN" +echo " Label : $LABEL" +echo " GPU : ${SEMBA_FDTD_ENABLE_CUF_RUNTIME:-off}" +echo " Date : $(date '+%Y-%m-%d %H:%M:%S')" +echo "==============================================" +echo "" + +# Header +printf "%-20s %8s %8s %10s %8s\n" "Case" "Cells" "Steps" "Time (s)" "Speedup" +printf "%-20s %8s %8s %10s %8s\n" "--------------------" "--------" "--------" "----------" "--------" + +# CPU baseline times (from previous measurements) +declare -A CPU_BASELINE=( + ["nodalSource"]="35.20" + ["towelHanger"]="7.96" + ["multipleAssigments"]="2.32" + ["sphere"]="3.26" +) + +TOTAL_TIME=0 +TOTAL_STEPS=0 + +for entry in "${CASES[@]}"; do + IFS=':' read -r NAME DIR FILE CELLS STEPS <<< "$entry" + CASE_DIR="$ROOT_DIR/$DIR" + + if [[ ! -f "$CASE_DIR/$FILE" ]]; then + echo "SKIP: $NAME (input not found: $CASE_DIR/$FILE)" + continue + fi + + # Clean output files + rm -f "$CASE_DIR"/*.dat "$CASE_DIR"/*.bin "$CASE_DIR"/*.h5 "$CASE_DIR"/*.xdmf "$CASE_DIR"/*.vtk \ + "$CASE_DIR"/*.Report.txt "$CASE_DIR"/*.Warnings.txt "$CASE_DIR"/*.fdtd_Report.txt \ + "$CASE_DIR"/*.fdtd_tmpWarnings.txt "$CASE_DIR"/*.old 2>/dev/null || true + + # Run benchmark (3 iterations, take best) + BEST_TIME=999999 + for i in 1 2 3; do + start=$(date +%s%N) + (cd "$CASE_DIR" && "$SEMBA_BIN" -i "$FILE" > /dev/null 2>&1) + end=$(date +%s%N) + elapsed=$(echo "scale=3; ($end - $start) / 1000000000" | bc) + if (( $(echo "$elapsed < $BEST_TIME" | bc -l) )); then + BEST_TIME=$elapsed + fi + done + + TOTAL_TIME=$(echo "$TOTAL_TIME + $BEST_TIME" | bc) + TOTAL_STEPS=$((TOTAL_STEPS + STEPS)) + + # Calculate speedup vs CPU baseline + BASELINE="${CPU_BASELINE[$NAME]:-0}" + if (( $(echo "$BASELINE > 0" | bc -l) )); then + SPEEDUP=$(echo "scale=2; $BASELINE / $BEST_TIME" | bc) + else + SPEEDUP="N/A" + fi + + printf "%-20s %8s %8s %10s %8s\n" "$NAME" "$CELLS" "$STEPS" "$BEST_TIME" "$SPEEDUP"x + +done + +echo "" +echo "----------------------------------------------" +echo " Total time : ${TOTAL_TIME}s" +echo " Total steps: $TOTAL_STEPS" +echo "----------------------------------------------" +echo "" +echo "Baseline times (CPU, 48-core Zen 4):" +for name in "${!CPU_BASELINE[@]}"; do + echo " $name: ${CPU_BASELINE[$name]}s" +done +echo "" +echo "==============================================" diff --git a/current_lightning.exc b/current_lightning.exc new file mode 120000 index 000000000..cb9f6b34c --- /dev/null +++ b/current_lightning.exc @@ -0,0 +1 @@ +/home/luis/ugrfdtd/publico/tmp_cases/cybonera/current_lightning.exc \ No newline at end of file diff --git a/gauss.exc b/gauss.exc new file mode 120000 index 000000000..57a08776a --- /dev/null +++ b/gauss.exc @@ -0,0 +1 @@ +/home/luis/ugrfdtd/publico/tmp_cases/sphere/gauss.exc \ No newline at end of file diff --git a/nodalPredefinedExcitation.1.exc b/nodalPredefinedExcitation.1.exc new file mode 120000 index 000000000..f23cc85dd --- /dev/null +++ b/nodalPredefinedExcitation.1.exc @@ -0,0 +1 @@ +/home/luis/ugrfdtd/publico/tmp_cases/nodalSource/predefinedExcitation.1.exc \ No newline at end of file diff --git a/opencode.json b/opencode.json new file mode 100644 index 000000000..e642efd5a --- /dev/null +++ b/opencode.json @@ -0,0 +1,20 @@ +{ + "$schema": "https://opencode.ai/config.json", + "agent": { + "tester": { + "description": "Runs the full semba-fdtd test suite (unit + integration tests), builds if needed, and produces a JSON report + terminal summary", + "mode": "subagent", + "permission": { + "bash": "allow", + "read": "allow", + "write": "allow", + "edit": "deny", + "glob": "allow", + "grep": "allow", + "list": "allow", + "task": "deny", + "external_directory": "allow" + } + } + } +} diff --git a/predefinedExcitation.1.exc b/predefinedExcitation.1.exc new file mode 120000 index 000000000..b901a3bc2 --- /dev/null +++ b/predefinedExcitation.1.exc @@ -0,0 +1 @@ +/home/luis/ugrfdtd/publico/tmp_cases/multipleAssigments/predefinedExcitation.1.exc \ No newline at end of file diff --git a/scripts/tester.py b/scripts/tester.py new file mode 100755 index 000000000..78ed17f48 --- /dev/null +++ b/scripts/tester.py @@ -0,0 +1,280 @@ +#!/usr/bin/env python3 +"""semba-fdtd test runner — builds, runs unit + integration tests, produces JSON report + terminal summary.""" + +import subprocess +import sys +import os +import json +import re +import time +from datetime import datetime +from pathlib import Path + +ROOT = Path(__file__).resolve().parent.parent +BUILD_DIR = ROOT / "build" +TMP_DIR = ROOT / "tmp" +BIN = BUILD_DIR / "bin" / "semba-fdtd" +UNIT_TESTS = BUILD_DIR / "bin" / "fdtd_tests" + + +def run(cmd, cwd=None, env=None, capture=True): + """Run a command, return (exit_code, stdout, stderr).""" + full_env = os.environ.copy() + if env: + full_env.update(env) + result = subprocess.run( + cmd, + cwd=cwd or str(ROOT), + env=full_env, + capture_output=capture, + text=True, + ) + return result.returncode, result.stdout, result.stderr + + +def detect_cmake_flags(): + """Read CMakeCache.txt to detect which features were enabled.""" + cache = BUILD_DIR / "CMakeCache.txt" + flags = {} + if not cache.exists(): + return {"MPI": "OFF", "MTLN": "OFF", "HDF": "OFF", "SMBJSON": "OFF"} + content = cache.read_text() + for key, short in [ + ("SEMBA_FDTD_ENABLE_MPI", "MPI"), + ("SEMBA_FDTD_ENABLE_MTLN", "MTLN"), + ("SEMBA_FDTD_ENABLE_HDF", "HDF"), + ("SEMBA_FDTD_ENABLE_SMBJSON", "SMBJSON"), + ]: + for line in content.splitlines(): + if line.startswith(key + ":"): + val = line.split("=", 1)[1].strip().strip('"') + flags[short] = val + break + else: + flags[short] = "OFF" + return flags + + +def parse_git_commit(): + rc, out, _ = run(["git", "rev-parse", "--short", "HEAD"], cwd=str(ROOT)) + if rc == 0: + return out.strip() + return "unknown" + + +def count_pattern(text, pattern): + return len(re.findall(pattern, text)) + + +def run_build(cmake_flags=None): + """Configure and build the project. Returns (success, duration_sec).""" + if cmake_flags is None: + cmake_flags = {} + + # Detect existing flags from cache, override with user-provided + existing = detect_cmake_flags() if BUILD_DIR.exists() else {} + combined = {**existing, **cmake_flags} + + cmake_cmd = [ + "cmake", "-S", ".", "-B", "build", + "-DCMAKE_BUILD_TYPE=Release", + ] + flag_map = { + "MPI": "SEMBA_FDTD_ENABLE_MPI", + "MTLN": "SEMBA_FDTD_ENABLE_MTLN", + "HDF": "SEMBA_FDTD_ENABLE_HDF", + "SMBJSON": "SEMBA_FDTD_ENABLE_SMBJSON", + } + for short, cmake_var in flag_map.items(): + if short in combined: + cmake_cmd.append(f"-D{cmake_var}={combined[short]}") + + print("=" * 60) + print("BUILD") + print("=" * 60) + print(f" Command: {' '.join(cmake_cmd)}") + + t0 = time.time() + rc, out, err = run(cmake_cmd) + if rc != 0: + print(f" cmake configure: FAIL (exit {rc})") + return False, time.time() - t0 + print(f" cmake configure: OK") + + build_cmd = ["cmake", "--build", "build", "-j"] + rc, out, err = run(build_cmd) + duration = time.time() - t0 + + if rc != 0: + print(f" cmake build: FAIL (exit {rc})") + return False, duration + + # Count compiled targets from build output + lines = out.splitlines() + targets = count_pattern(out, r"Building [CF]+ object") + print(f" cmake build: OK ({targets} targets, {duration:.1f}s)") + return True, duration, combined + + +def run_unit_tests(): + """Run GoogleTest suite. Returns (success, duration_sec, output_lines).""" + print() + print("=" * 60) + print("UNIT TESTS (GoogleTest)") + print("=" * 60) + + if not UNIT_TESTS.exists(): + print(f" SKIP: {UNIT_TESTS} not found (build first)") + return False, 0, 0 + + t0 = time.time() + rc, out, err = run([str(UNIT_TESTS)]) + duration = time.time() - t0 + lines = out.splitlines() if out else [] + + # Parse GoogleTest summary + passed = count_pattern(out, r"^\[ PASSED \] [0-9]+ test") + failed = count_pattern(out, r"^\[ FAILED \] [0-9]+ test") + total = passed + failed + + # Also count individual test results + passed_tests = count_pattern(out, r"\[ PASSED \]") + failed_tests = count_pattern(out, r"\[ FAILED \]") + + print(f" Binary: {UNIT_TESTS}") + print(f" Duration: {duration:.1f}s") + print(f" Output lines: {len(lines)}") + print(f" Result: {'PASS' if rc == 0 else 'FAIL'} (exit {rc})") + + return rc == 0, duration, len(lines) + + +def run_integration_tests(env_vars=None): + """Run pytest integration tests. Returns (success, duration_sec, output_lines).""" + print() + print("=" * 60) + print("INTEGRATION TESTS (pytest)") + print("=" * 60) + + pytest_env = os.environ.copy() + if env_vars: + pytest_env.update(env_vars) + + t0 = time.time() + rc, out, err = run( + [sys.executable or "python3", "-m", "pytest", "test/", "--durations=20", "-v"], + env=pytest_env, + ) + duration = time.time() - t0 + lines = out.splitlines() if out else [] + + # Parse pytest summary + passed = count_pattern(out, r" passed") + failed = count_pattern(out, r" failed") + skipped = count_pattern(out, r" skipped") + + print(f" Duration: {duration:.1f}s") + print(f" Output lines: {len(lines)}") + print(f" Result: {'PASS' if rc == 0 else 'FAIL'} (exit {rc})") + + return rc == 0, duration, len(lines) + + +def print_summary(build_ok, build_dur, unit_ok, unit_dur, unit_lines, + integ_ok, integ_dur, integ_lines, cmake_flags): + """Print a formatted terminal summary.""" + print() + print("=" * 60) + print("TEST REPORT SUMMARY") + print("=" * 60) + print(f" {'Stage':<25} {'Status':<8} {'Duration':<12} {'Details'}") + print(f" {'─' * 25} {'─' * 8} {'─' * 12} {'─' * 20}") + + build_status = "PASS" if build_ok else "FAIL" + print(f" {'Build':<25} {build_status:<8} {build_dur:.1f}s{'':<6}") + + unit_status = "PASS" if unit_ok else "FAIL" + print(f" {'Unit Tests':<25} {unit_status:<8} {unit_dur:.1f}s {unit_lines} lines") + + integ_status = "PASS" if integ_ok else "FAIL" + print(f" {'Integration Tests':<25} {integ_status:<8} {integ_dur:.1f}s {integ_lines} lines") + + print(f" {'─' * 25} {'─' * 8} {'─' * 12} {'─' * 20}") + + overall = "PASS" if (build_ok and unit_ok and integ_ok) else "FAIL" + print(f" {'OVERALL':<25} {overall:<8}") + print("=" * 60) + + +def write_report(build_ok, build_dur, unit_ok, unit_dur, unit_lines, + integ_ok, integ_dur, integ_lines, cmake_flags): + """Write JSON report to tmp/test_report_.json.""" + TMP_DIR.mkdir(exist_ok=True) + ts = datetime.now().strftime("%Y%m%d_%H%M%S") + report = { + "timestamp": datetime.now().isoformat(), + "commit": parse_git_commit(), + "build_type": "Release", + "cmake_flags": cmake_flags, + "build": {"success": build_ok, "duration_sec": round(build_dur, 2)}, + "unit_tests": {"success": unit_ok, "duration_sec": round(unit_dur, 2), + "output_lines": unit_lines}, + "integration_tests": {"success": integ_ok, "duration_sec": round(integ_dur, 2), + "output_lines": integ_lines}, + "overall": "pass" if (build_ok and unit_ok and integ_ok) else "fail", + } + report_path = TMP_DIR / f"test_report_{ts}.json" + report_path.write_text(json.dumps(report, indent=2)) + print(f"\n JSON report: {report_path}") + return report + + +def main(): + # Accept optional cmake flags as arguments: e.g., --MPI=OFF --MTLN=OFF + cmake_flags = {} + i = 1 + while i < len(sys.argv): + arg = sys.argv[i] + if arg.startswith("--"): + key_val = arg[2:] + if "=" in key_val: + k, v = key_val.split("=", 1) + cmake_flags[k] = v + i += 1 + + # Build + build_result = run_build(cmake_flags) + if len(build_result) == 3: + build_ok, build_dur, cmake_flags = build_result + else: + build_ok, build_dur = build_result[0], build_result[1] + cmake_flags = detect_cmake_flags() + + if not build_ok: + print("\nBuild failed. Skipping tests.") + unit_ok, unit_dur, unit_lines = False, 0, 0 + integ_ok, integ_dur, integ_lines = False, 0, 0 + else: + # Unit tests + unit_result = run_unit_tests() + unit_ok, unit_dur, unit_lines = unit_result + + # Integration tests + env_vars = {f"SEMBA_FDTD_ENABLE_{k}": v for k, v in cmake_flags.items() + if k in ("MPI", "MTLN", "HDF")} + integ_result = run_integration_tests(env_vars) + integ_ok, integ_dur, integ_lines = integ_result + + # Summary + report + print_summary(build_ok, build_dur, unit_ok, unit_dur, unit_lines, + integ_ok, integ_dur, integ_lines, cmake_flags) + write_report(build_ok, build_dur, unit_ok, unit_dur, unit_lines, + integ_ok, integ_dur, integ_lines, cmake_flags) + + # Exit code + all_ok = build_ok and unit_ok and integ_ok + sys.exit(0 if all_ok else 1) + + +if __name__ == "__main__": + main() diff --git a/set_precompiled_libraries.cmake b/set_precompiled_libraries.cmake index d56be03ca..eb5d8967c 100644 --- a/set_precompiled_libraries.cmake +++ b/set_precompiled_libraries.cmake @@ -6,8 +6,8 @@ if (SEMBA_FDTD_ENABLE_MTLN) set(LAPACK_DIR "${PRECOMPILED_DIR}/linux-gcc/lapack/") set(LAPACK_LIB ${LAPACK_DIR}liblapack.a) set(BLAS_LIB ${LAPACK_DIR}libblas.a) - elseif (${CMAKE_Fortran_COMPILER_ID} STREQUAL "NVHPC") - set(LAPACK_DIR "/opt/nvidia/hpc_sdk/Linux_x86_64/24.5/compilers/lib/") + elseif (${CMAKE_Fortran_COMPILER_ID} STREQUAL "NVHPC") + set(LAPACK_DIR "/home/luis/nvidia_sdk/Linux_x86_64/26.3/compilers/lib/") set(LAPACK_LIB ${LAPACK_DIR}liblapack.a) set(BLAS_LIB ${LAPACK_DIR}libblas.a) elseif (${CMAKE_Fortran_COMPILER_ID} STREQUAL "IntelLLVM") diff --git a/src_main_pub/borderscpml.F90 b/src_main_pub/borderscpml.F90 index 361651eb4..fc076e2f0 100755 --- a/src_main_pub/borderscpml.F90 +++ b/src_main_pub/borderscpml.F90 @@ -58,7 +58,9 @@ module BORDERS_CPML_m !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! public :: InitCPMLBorders, AdvanceelectricCPML,AdvanceMagneticCPML,StoreFieldsCPMLBorders,DestroyCPMLBorders,AdvanceelectricCPML_freespace,AdvanceMagneticCPML_freespace - public :: calc_cpmlconstants + public :: calc_cpmlconstants + public :: PMLc + public :: P_be_x, P_ce_x, P_bm_x, P_cm_x, P_be_y, P_ce_y, P_bm_y, P_cm_y, P_be_z, P_ce_z, P_bm_z, P_cm_z !!!public :: FreeSpace_AdvanceMagneticCPML,calc_cpmlconstants contains @@ -550,7 +552,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex, !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = left #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION) k_m = k - b%Ex%ZI @@ -570,7 +572,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex, !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION) k_m = k - b%Ez%ZI @@ -594,7 +596,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex, !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = right #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION) k_m = k - b%Ex%ZI @@ -614,7 +616,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex, !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION) k_m = k - b%Ez%ZI @@ -640,7 +642,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex, !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = down #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION) k_m = k - b%Ey%ZI @@ -659,7 +661,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex, !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION) k_m = k - b%Ex%ZI @@ -684,7 +686,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex, !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = up #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION) k_m = k - b%Ey%ZI @@ -703,7 +705,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex, !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION) k_m = k - b%Ex%ZI @@ -730,7 +732,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex, !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = back #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION) k_m = k - b%Ez%ZI @@ -749,7 +751,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex, !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION) k_m = k - b%Ey%ZI @@ -774,7 +776,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex, !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = front #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION) k_m = k - b%Ez%ZI @@ -793,7 +795,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex, !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION) k_m = k - b%Ey%ZI @@ -848,7 +850,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx, !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = left #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION) k_m = k - b%Hx%ZI @@ -868,7 +870,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx, !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION) k_m = k - b%Hz%ZI @@ -893,7 +895,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx, REGION = right #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION) k_m = k - b%Hx%ZI @@ -913,7 +915,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx, !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION) k_m = k - b%Hz%ZI @@ -938,7 +940,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx, !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = down #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION) k_m = k - b%Hy%ZI @@ -958,7 +960,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx, !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION) k_m = k - b%Hx%ZI @@ -983,7 +985,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx, !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = up #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION) k_m = k - b%Hy%ZI @@ -1003,7 +1005,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx, !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION) k_m = k - b%Hx%ZI @@ -1028,7 +1030,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx, !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION=back #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION) k_m = k - b%Hz%ZI @@ -1048,7 +1050,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx, !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION) k_m = k - b%Hy%ZI @@ -1072,7 +1074,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx, !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION=front #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION) k_m = k - b%Hz%ZI @@ -1092,7 +1094,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx, !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION) k_m = k - b%Hy%ZI @@ -1705,7 +1707,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = left #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION) k_m = k - b%Ex%ZI @@ -1725,7 +1727,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION) k_m = k - b%Ez%ZI @@ -1749,7 +1751,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = right #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION) k_m = k - b%Ex%ZI @@ -1769,7 +1771,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION) k_m = k - b%Ez%ZI @@ -1795,7 +1797,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = down #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION) k_m = k - b%Ey%ZI @@ -1814,7 +1816,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION) k_m = k - b%Ex%ZI @@ -1839,7 +1841,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = up #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION) k_m = k - b%Ey%ZI @@ -1858,7 +1860,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION) k_m = k - b%Ex%ZI @@ -1885,7 +1887,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = back #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION) k_m = k - b%Ez%ZI @@ -1904,7 +1906,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION) k_m = k - b%Ey%ZI @@ -1929,7 +1931,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = front #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION) k_m = k - b%Ez%ZI @@ -1948,7 +1950,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION) k_m = k - b%Ey%ZI @@ -2003,7 +2005,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = left #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION) k_m = k - b%Hx%ZI @@ -2023,7 +2025,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION) k_m = k - b%Hz%ZI @@ -2048,7 +2050,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz REGION = right #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION) k_m = k - b%Hx%ZI @@ -2068,7 +2070,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION) k_m = k - b%Hz%ZI @@ -2093,7 +2095,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = down #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION) k_m = k - b%Hy%ZI @@ -2113,7 +2115,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION) k_m = k - b%Hx%ZI @@ -2138,7 +2140,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION = up #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION) k_m = k - b%Hy%ZI @@ -2158,7 +2160,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION) k_m = k - b%Hx%ZI @@ -2183,7 +2185,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION=back #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION) k_m = k - b%Hz%ZI @@ -2203,7 +2205,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION) k_m = k - b%Hy%ZI @@ -2227,7 +2229,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! REGION=front #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION) k_m = k - b%Hz%ZI @@ -2247,7 +2249,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz !$OMP END PARALLEL DO #endif #ifdef CompileWithOpenMP -!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio) +!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio) #endif do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION) k_m = k - b%Hy%ZI diff --git a/src_main_pub/bordersmur.F90 b/src_main_pub/bordersmur.F90 index 1ea1cfd7e..1b47070cb 100755 --- a/src_main_pub/bordersmur.F90 +++ b/src_main_pub/bordersmur.F90 @@ -47,12 +47,27 @@ module BORDERS_MUR_m !!! ! public :: InitMURBorders, AdvanceMagneticMUR,StoreFieldsMURBorders,DestroyMURBorders,calc_murconstants - + public :: get_mur_limits + public :: regLR, regDU, regBF + public :: left_CAB1, left_CAB3, left_cab4, right_CAB1, right_CAB3, right_cab4 + public :: down_CAB1, down_CAB3, down_cab4, up_CAB1, up_CAB3, up_cab4 + public :: back_CAB1, back_CAB3, back_cab4, front_CAB1, front_CAB3, front_cab4 contains - !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! - !!! Initializes MUR data + subroutine get_mur_limits(field, bound, xi, xe, yi, ye, zi, ze) + integer(kind=4), intent(in) :: field, bound + integer(kind=4), intent(out) :: xi, xe, yi, ye, zi, ze + xi = MURc(field)%XI(bound) + xe = MURc(field)%XE(bound) + yi = MURc(field)%YI(bound) + ye = MURc(field)%YE(bound) + zi = MURc(field)%ZI(bound) + ze = MURc(field)%ZE(bound) + end subroutine get_mur_limits + + !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + !!! Initializes MUR data !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! subroutine InitMURBorders(sgg,ThereAreMURBorders,resume,Idxh,Idyh,Idzh,eps00,mu00) real(kind=RKIND) :: eps00,mu00 diff --git a/src_main_pub/farfield.F90 b/src_main_pub/farfield.F90 index d3e23f2dc..ff645942d 100755 --- a/src_main_pub/farfield.F90 +++ b/src_main_pub/farfield.F90 @@ -71,7 +71,7 @@ module farfield_m !!! ! public UpdateFarField,InitFarField,Destroyfarfield,FlushFarfield,StoreFarfields - public farfield_t + public farfield_t, FF ! type(farfield_t), save, target :: FF diff --git a/src_main_pub/gpu_core_m.F90 b/src_main_pub/gpu_core_m.F90 new file mode 100644 index 000000000..478189726 --- /dev/null +++ b/src_main_pub/gpu_core_m.F90 @@ -0,0 +1,2107 @@ +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +! GPU CORE MODULE - CUDA Fortran (CUF) +! gpu_state_t type definition + init/upload/download/destroy +! Split from gpu_kernels_cuf.F90 to avoid NVHPC compiler file-size limit. +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + +module gpu_core_m + + use FDETYPES_m + use Report_m + use cudafor + + implicit none + + type gpu_state_t + ! Host-side pointers (same as original) + real(kind=rkind), pointer, dimension(:,:,:), contiguous :: Ex, Ey, Ez, Hx, Hy, Hz + real(kind=rkind), pointer, dimension(:), contiguous :: Idxe, Idye, Idze, Idxh, Idyh, Idzh, dxe, dye, dze, dxh, dyh, dzh + integer(kind=integersizeofmediamatrices), pointer, dimension(:,:,:), contiguous :: sggMiEx, sggMiEy, sggMiEz, sggMiHx, sggMiHy, sggMiHz + real(kind=rkind), pointer, dimension(:), contiguous :: g1, g2, gm1, gm2 + + ! Persistent device memory buffers - YEE fields + real(kind=rkind), pointer, device, dimension(:,:,:) :: Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d + real(kind=rkind), pointer, device, dimension(:) :: Idxe_d, Idye_d, Idze_d, Idxh_d, Idyh_d, Idzh_d, dxe_d, dye_d, dze_d, dxh_d, dyh_d, dzh_d + integer(kind=integersizeofmediamatrices), pointer, device, dimension(:,:,:) :: sggMiEx_d, sggMiEy_d, sggMiEz_d, sggMiHx_d, sggMiHy_d, sggMiHz_d + real(kind=rkind), pointer, device, dimension(:) :: g1_d, g2_d, gm1_d, gm2_d + + ! Dimensions for device memory + integer(kind=4) :: Ex_nx, Ex_ny, Ex_nz, Ey_nx, Ey_ny, Ey_nz, Ez_nx, Ez_ny, Ez_nz + integer(kind=4) :: Hx_nx, Hx_ny, Hx_nz, Hy_nx, Hy_ny, Hy_nz, Hz_nx, Hz_ny, Hz_nz + integer(kind=4) :: Idxe_n, Idye_n, Idze_n, Idxh_n, Idyh_n, Idzh_n + integer(kind=4) :: dxe_n, dye_n, dze_n, dxh_n, dyh_n, dzh_n + integer(kind=4) :: g1_n, g2_n, gm1_n, gm2_n + integer(kind=4) :: sggMi_nx, sggMi_ny, sggMi_nz + + ! CPML left boundary - persistent device arrays + ! 4 psi arrays for left boundary (same global bounds as host arrays) + integer(kind=4) :: pml_left_Exy_nx, pml_left_Exy_ny, pml_left_Exy_nz + integer(kind=4) :: pml_left_Ezy_nx, pml_left_Ezy_ny, pml_left_Ezy_nz + integer(kind=4) :: pml_left_Hxy_nx, pml_left_Hxy_ny, pml_left_Hxy_nz + integer(kind=4) :: pml_left_Hzy_nx, pml_left_Hzy_ny, pml_left_Hzy_nz + real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Exy_left + real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Ezy_left + real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Hxy_left + real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Hzy_left + + ! CPML left boundary coefficients (device, updated every step) + integer(kind=4) :: pml_coeff_y_n + real(kind=rkind), pointer, device, dimension(:) :: pml_P_be_y_left + real(kind=rkind), pointer, device, dimension(:) :: pml_P_ce_y_left + real(kind=rkind), pointer, device, dimension(:) :: pml_P_bm_y_left + real(kind=rkind), pointer, device, dimension(:) :: pml_P_cm_y_left + + ! Left boundary PML limits (device) + integer(kind=4) :: pml_left_Ex_ii, pml_left_Ex_ij, pml_left_Ex_ji, pml_left_Ex_jj, pml_left_Ex_ki, pml_left_Ex_kj + integer(kind=4) :: pml_left_Ez_ii, pml_left_Ez_ij, pml_left_Ez_ji, pml_left_Ez_jj, pml_left_Ez_ki, pml_left_Ez_kj + integer(kind=4) :: pml_left_Hx_ii, pml_left_Hx_ij, pml_left_Hx_ji, pml_left_Hx_jj, pml_left_Hx_ki, pml_left_Hx_kj + integer(kind=4) :: pml_left_Hz_ii, pml_left_Hz_ij, pml_left_Hz_ji, pml_left_Hz_jj, pml_left_Hz_ki, pml_left_Hz_kj + + ! CPML right boundary - same structure as left + integer(kind=4) :: pml_right_Ex_ii, pml_right_Ex_ij, pml_right_Ex_ji, pml_right_Ex_jj, pml_right_Ex_ki, pml_right_Ex_kj + integer(kind=4) :: pml_right_Ez_ii, pml_right_Ez_ij, pml_right_Ez_ji, pml_right_Ez_jj, pml_right_Ez_ki, pml_right_Ez_kj + integer(kind=4) :: pml_right_Hx_ii, pml_right_Hx_ij, pml_right_Hx_ji, pml_right_Hx_jj, pml_right_Hx_ki, pml_right_Hx_kj + integer(kind=4) :: pml_right_Hz_ii, pml_right_Hz_ij, pml_right_Hz_ji, pml_right_Hz_jj, pml_right_Hz_ki, pml_right_Hz_kj + + ! CPML down/up boundary - z-dependent coefficients + integer(kind=4) :: pml_down_Ex_ii, pml_down_Ex_ij, pml_down_Ex_ji, pml_down_Ex_jj, pml_down_Ex_ki, pml_down_Ex_kj + integer(kind=4) :: pml_down_Ey_ii, pml_down_Ey_ij, pml_down_Ey_ji, pml_down_Ey_jj, pml_down_Ey_ki, pml_down_Ey_kj + integer(kind=4) :: pml_down_Hx_ii, pml_down_Hx_ij, pml_down_Hx_ji, pml_down_Hx_jj, pml_down_Hx_ki, pml_down_Hx_kj + integer(kind=4) :: pml_down_Hy_ii, pml_down_Hy_ij, pml_down_Hy_ji, pml_down_Hy_jj, pml_down_Hy_ki, pml_down_Hy_kj + integer(kind=4) :: pml_down_Exz_nx, pml_down_Exz_ny, pml_down_Exz_nz + integer(kind=4) :: pml_down_Eyz_nx, pml_down_Eyz_ny, pml_down_Eyz_nz + integer(kind=4) :: pml_down_Hxz_nx, pml_down_Hxz_ny, pml_down_Hxz_nz + integer(kind=4) :: pml_down_Hyz_nx, pml_down_Hyz_ny, pml_down_Hyz_nz + real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Exz_down + real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Eyz_down + real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Hxz_down + real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Hyz_down + integer(kind=4) :: pml_coeff_z_n + real(kind=rkind), pointer, device, dimension(:) :: pml_P_be_z_down + real(kind=rkind), pointer, device, dimension(:) :: pml_P_ce_z_down + real(kind=rkind), pointer, device, dimension(:) :: pml_P_bm_z_down + real(kind=rkind), pointer, device, dimension(:) :: pml_P_cm_z_down + integer(kind=4) :: pml_up_Ex_ii, pml_up_Ex_ij, pml_up_Ex_ji, pml_up_Ex_jj, pml_up_Ex_ki, pml_up_Ex_kj + integer(kind=4) :: pml_up_Ey_ii, pml_up_Ey_ij, pml_up_Ey_ji, pml_up_Ey_jj, pml_up_Ey_ki, pml_up_Ey_kj + integer(kind=4) :: pml_up_Hx_ii, pml_up_Hx_ij, pml_up_Hx_ji, pml_up_Hx_jj, pml_up_Hx_ki, pml_up_Hx_kj + integer(kind=4) :: pml_up_Hy_ii, pml_up_Hy_ij, pml_up_Hy_ji, pml_up_Hy_jj, pml_up_Hy_ki, pml_up_Hy_kj + + ! CPML back/front boundary - x-dependent coefficients + integer(kind=4) :: pml_back_Ez_ii, pml_back_Ez_ij, pml_back_Ez_ji, pml_back_Ez_jj, pml_back_Ez_ki, pml_back_Ez_kj + integer(kind=4) :: pml_back_Ey_ii, pml_back_Ey_ij, pml_back_Ey_ji, pml_back_Ey_jj, pml_back_Ey_ki, pml_back_Ey_kj + integer(kind=4) :: pml_back_Hz_ii, pml_back_Hz_ij, pml_back_Hz_ji, pml_back_Hz_jj, pml_back_Hz_ki, pml_back_Hz_kj + integer(kind=4) :: pml_back_Hy_ii, pml_back_Hy_ij, pml_back_Hy_ji, pml_back_Hy_jj, pml_back_Hy_ki, pml_back_Hy_kj + integer(kind=4) :: pml_back_Ezx_nx, pml_back_Ezx_ny, pml_back_Ezx_nz + integer(kind=4) :: pml_back_Eyx_nx, pml_back_Eyx_ny, pml_back_Eyx_nz + integer(kind=4) :: pml_back_Hzx_nx, pml_back_Hzx_ny, pml_back_Hzx_nz + integer(kind=4) :: pml_back_Hyx_nx, pml_back_Hyx_ny, pml_back_Hyx_nz + real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Ezx_back + real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Eyx_back + real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Hzx_back + real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Hyx_back + integer(kind=4) :: pml_coeff_x_n + real(kind=rkind), pointer, device, dimension(:) :: pml_P_be_x_back + real(kind=rkind), pointer, device, dimension(:) :: pml_P_ce_x_back + real(kind=rkind), pointer, device, dimension(:) :: pml_P_bm_x_back + real(kind=rkind), pointer, device, dimension(:) :: pml_P_cm_x_back + integer(kind=4) :: pml_front_Ez_ii, pml_front_Ez_ij, pml_front_Ez_ji, pml_front_Ez_jj, pml_front_Ez_ki, pml_front_Ez_kj + integer(kind=4) :: pml_front_Ey_ii, pml_front_Ey_ij, pml_front_Ey_ji, pml_front_Ey_jj, pml_front_Ey_ki, pml_front_Ey_kj + integer(kind=4) :: pml_front_Hz_ii, pml_front_Hz_ij, pml_front_Hz_ji, pml_front_Hz_jj, pml_front_Hz_ki, pml_front_Hz_kj + integer(kind=4) :: pml_front_Hy_ii, pml_front_Hy_ij, pml_front_Hy_ji, pml_front_Hy_jj, pml_front_Hy_ki, pml_front_Hy_kj + + ! MUR boundary - persistent device coefficient arrays (per media) + integer(kind=4) :: mur_numMedia + real(kind=rkind), pointer, device, dimension(:) :: mur_left_CAB1, mur_left_CAB3, mur_left_cab4 + real(kind=rkind), pointer, device, dimension(:) :: mur_right_CAB1, mur_right_CAB3, mur_right_cab4 + real(kind=rkind), pointer, device, dimension(:) :: mur_down_CAB1, mur_down_CAB3, mur_down_cab4 + real(kind=rkind), pointer, device, dimension(:) :: mur_up_CAB1, mur_up_CAB3, mur_up_cab4 + real(kind=rkind), pointer, device, dimension(:) :: mur_back_CAB1, mur_back_CAB3, mur_back_cab4 + real(kind=rkind), pointer, device, dimension(:) :: mur_front_CAB1, mur_front_CAB3, mur_front_cab4 + + ! MUR CAB1 coefficients — 1D flattened: mur_cab1_d(12 * numMedia) + ! Layout: cab1_d((bound_id-1)*numMedia + medio) for contiguous access + real(kind=rkind), pointer, device, dimension(:) :: mur_cab1_d + + ! MUR boundary - persistent device past-field arrays + real(kind=rkind), pointer, device, dimension(:,:,:) :: mur_past_Hx_left, mur_past_Hz_left + real(kind=rkind), pointer, device, dimension(:,:,:) :: mur_past_Hx_right, mur_past_Hz_right + real(kind=rkind), pointer, device, dimension(:,:,:) :: mur_past_Hy_down, mur_past_Hx_down + real(kind=rkind), pointer, device, dimension(:,:,:) :: mur_past_Hy_up, mur_past_Hx_up + real(kind=rkind), pointer, device, dimension(:,:,:) :: mur_past_Hz_back, mur_past_Hy_back + real(kind=rkind), pointer, device, dimension(:,:,:) :: mur_past_Hz_front, mur_past_Hy_front + + ! MUR boundary limits — flattened array: mur_bounds(12, 6) + ! Boundary IDs: 1=left-Hx, 2=left-Hz, 3=right-Hx, 4=right-Hz, + ! 5=down-Hy, 6=down-Hx, 7=up-Hy, 8=up-Hx, + ! 9=back-Hz, 10=back-Hy, 11=front-Hz, 12=front-Hy + ! Column indices: 1=ii, 2=ij, 3=ji, 4=jj, 5=ki, 6=kj + integer(kind=4), pointer, device, dimension(:,:) :: mur_bounds_d + + ! Individual limit members kept for backward compat with existing kernels + integer(kind=4) :: mur_left_Hx_ii, mur_left_Hx_ij, mur_left_Hx_ji, mur_left_Hx_jj, mur_left_Hx_ki, mur_left_Hx_kj + integer(kind=4) :: mur_left_Hz_ii, mur_left_Hz_ij, mur_left_Hz_ji, mur_left_Hz_jj, mur_left_Hz_ki, mur_left_Hz_kj + integer(kind=4) :: mur_right_Hx_ii, mur_right_Hx_ij, mur_right_Hx_ji, mur_right_Hx_jj, mur_right_Hx_ki, mur_right_Hx_kj + integer(kind=4) :: mur_right_Hz_ii, mur_right_Hz_ij, mur_right_Hz_ji, mur_right_Hz_jj, mur_right_Hz_ki, mur_right_Hz_kj + integer(kind=4) :: mur_down_Hy_ii, mur_down_Hy_ij, mur_down_Hy_ji, mur_down_Hy_jj, mur_down_Hy_ki, mur_down_Hy_kj + integer(kind=4) :: mur_down_Hx_ii, mur_down_Hx_ij, mur_down_Hx_ji, mur_down_Hx_jj, mur_down_Hx_ki, mur_down_Hx_kj + integer(kind=4) :: mur_up_Hy_ii, mur_up_Hy_ij, mur_up_Hy_ji, mur_up_Hy_jj, mur_up_Hy_ki, mur_up_Hy_kj + integer(kind=4) :: mur_up_Hx_ii, mur_up_Hx_ij, mur_up_Hx_ji, mur_up_Hx_jj, mur_up_Hx_ki, mur_up_Hx_kj + integer(kind=4) :: mur_back_Hz_ii, mur_back_Hz_ij, mur_back_Hz_ji, mur_back_Hz_jj, mur_back_Hz_ki, mur_back_Hz_kj + integer(kind=4) :: mur_back_Hy_ii, mur_back_Hy_ij, mur_back_Hy_ji, mur_back_Hy_jj, mur_back_Hy_ki, mur_back_Hy_kj + integer(kind=4) :: mur_front_Hz_ii, mur_front_Hz_ij, mur_front_Hz_ji, mur_front_Hz_jj, mur_front_Hz_ki, mur_front_Hz_kj + integer(kind=4) :: mur_front_Hy_ii, mur_front_Hy_ij, mur_front_Hy_ji, mur_front_Hy_jj, mur_front_Hy_ki, mur_front_Hy_kj + + ! MUR flags + logical :: mur_initialized + + ! Fused kernel execution flags to avoid redundant launches + logical :: gpu_e_fused_launched = .false. + logical :: gpu_h_fused_launched = .false. + + ! Flags + logical :: initialized = .false. + logical :: fields_on_device = .false. + logical :: pml_left_initialized = .false. + logical :: pml_right_initialized = .false. + logical :: pml_down_initialized = .false. + logical :: pml_up_initialized = .false. + logical :: pml_back_initialized = .false. + logical :: pml_front_initialized = .false. + + ! Download tracking for lazy field transfer + integer(kind=4) :: last_download_step = 0 + + ! Probe result buffers on device (for on-device probe sampling) + real(kind=rkind), pointer, device, dimension(:) :: probe_results_d + integer(kind=4), pointer, device, dimension(:) :: probe_field_ids_d + integer(kind=4), pointer, device, dimension(:) :: probe_I_d, probe_J_d, probe_K_d + integer(kind=4) :: num_probe_results = 0 + ! Block probe buffers (bulkCurrent probes) + real(kind=rkind), pointer, device, dimension(:) :: block_probe_results_d + integer(kind=4), pointer, device, dimension(:) :: block_probe_field_ids_d + integer(kind=4), pointer, device, dimension(:) :: block_probe_I1_d, block_probe_J1_d, block_probe_K1_d + integer(kind=4), pointer, device, dimension(:) :: block_probe_I2_d, block_probe_J2_d, block_probe_K2_d + integer(kind=4) :: num_block_probe_results = 0 + logical :: probe_buffers_initialized = .false. + + ! NF2FF (near-to-far-field) device buffers + complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_ExIz_d, nf2ff_ExDe_d, nf2ff_ExAb_d, nf2ff_ExAr_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_EyFr_d, nf2ff_EyTr_d, nf2ff_EyAb_d, nf2ff_EyAr_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_EzIz_d, nf2ff_EzDe_d, nf2ff_EzFr_d, nf2ff_EzTr_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_HxIz_d, nf2ff_HxDe_d, nf2ff_HxAb_d, nf2ff_HxAr_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_HyFr_d, nf2ff_HyTr_d, nf2ff_HyAb_d, nf2ff_HyAr_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_HzIz_d, nf2ff_HzDe_d, nf2ff_HzFr_d, nf2ff_HzTr_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_HxIz2_d, nf2ff_HxDe2_d, nf2ff_HxAb2_d, nf2ff_HxAr2_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_HyFr2_d, nf2ff_HyTr2_d, nf2ff_HyAb2_d, nf2ff_HyAr2_d + complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_HzIz2_d, nf2ff_HzDe2_d, nf2ff_HzFr2_d, nf2ff_HzTr2_d + + ! NF2FF frequency arrays (device) + complex(kind=rkind), pointer, device, dimension(:) :: nf2ff_expIwdt_d, nf2ff_auxExp_E_d, nf2ff_auxExp_H_d + + ! NF2FF geometry (device) — 18 coordinate arrays for Mx,My,Mz,Jx,Jy,Jz per cell + real(kind=rkind), pointer, device, dimension(:) :: nf2ff_phys_x_Mx_d, nf2ff_phys_y_Mx_d, nf2ff_phys_z_Mx_d + real(kind=rkind), pointer, device, dimension(:) :: nf2ff_phys_x_My_d, nf2ff_phys_y_My_d, nf2ff_phys_z_My_d + real(kind=rkind), pointer, device, dimension(:) :: nf2ff_phys_x_Mz_d, nf2ff_phys_y_Mz_d, nf2ff_phys_z_Mz_d + real(kind=rkind), pointer, device, dimension(:) :: nf2ff_phys_x_Jx_d, nf2ff_phys_y_Jx_d, nf2ff_phys_z_Jx_d + real(kind=rkind), pointer, device, dimension(:) :: nf2ff_phys_x_Jy_d, nf2ff_phys_y_Jy_d, nf2ff_phys_z_Jy_d + real(kind=rkind), pointer, device, dimension(:) :: nf2ff_phys_x_Jz_d, nf2ff_phys_y_Jz_d, nf2ff_phys_z_Jz_d + + ! NF2FF cell dimensions (device) + real(kind=rkind), pointer, device, dimension(:) :: nf2ff_dyh_d, nf2ff_dze_d, nf2ff_dye_d, nf2ff_dzh_d + + ! NF2FF output buffers (device) + real(kind=rkind), pointer, device, dimension(:) :: nf2ff_Etheta_d, nf2ff_Ephi_d, nf2ff_RCS_d + + ! NF2FF configuration + logical :: nf2ff_initialized = .false. + integer(kind=4) :: nf2ff_num_cells = 0 + integer(kind=4) :: nf2ff_num_freqs = 0 + integer(kind=4) :: nf2ff_Ntheta = 0 + integer(kind=4) :: nf2ff_Nphi = 0 + integer(kind=4) :: nf2ff_theta_start = 0 + integer(kind=4) :: nf2ff_theta_stop = 0 + integer(kind=4) :: nf2ff_phi_start = 0 + integer(kind=4) :: nf2ff_phi_stop = 0 + real(kind=rkind) :: nf2ff_thetaStep = 0.0_rkind + real(kind=rkind) :: nf2ff_phiStep = 0.0_rkind + real(kind=rkind) :: nf2ff_freqStep = 0.0_rkind + real(kind=rkind) :: nf2ff_initialFreq = 0.0_rkind + real(kind=rkind) :: nf2ff_cluz = 0.0_rkind + real(kind=rkind) :: nf2ff_z0 = 0.0_rkind + real(kind=rkind) :: nf2ff_XDobleAncho = 0.0_rkind + real(kind=rkind) :: nf2ff_YDobleAncho = 0.0_rkind + real(kind=rkind) :: nf2ff_ZDobleAncho = 0.0_rkind + integer(kind=4) :: nf2ff_sym_flags = 0 + end type + +contains + + !-------------------------------------------------------------------------------- + ! Initialize GPU data regions - called once at simulation start + !-------------------------------------------------------------------------------- + subroutine gpu_init(this, Ex, Ey, Ez, Hx, Hy, Hz, sggMiEx, sggMiEy, sggMiEz, & + sggMiHx, sggMiHy, sggMiHz, g1, g2, gm1, gm2, & + Idxe_in, Idye_in, Idze_in, Idxh_in, Idyh_in, Idzh_in, dxe_in, dye_in, dze_in, dxh_in, dyh_in, dzh_in) + class(gpu_state_t), intent(inout) :: this + real(kind=rkind), dimension(:,:,:), pointer, contiguous, intent(in) :: Ex, Ey, Ez, Hx, Hy, Hz + integer(kind=integersizeofmediamatrices), dimension(:,:,:), pointer, contiguous, intent(in) :: sggMiEx, sggMiEy, sggMiEz, sggMiHx, sggMiHy, sggMiHz + real(kind=rkind), dimension(:), pointer, intent(in) :: g1, g2, gm1, gm2 + real(kind=rkind), dimension(:), pointer, intent(in) :: Idxe_in, Idye_in, Idze_in, Idxh_in, Idyh_in, Idzh_in, dxe_in, dye_in, dze_in, dxh_in, dyh_in, dzh_in + + integer(kind=4) :: ndev + integer(kind=4) :: cuda_status + integer(kind=4) :: env_status + character(len=16) :: enable_cuf + + call get_environment_variable("SEMBA_FDTD_ENABLE_CUF_RUNTIME", enable_cuf, status=env_status) + if (env_status /= 0 .or. trim(enable_cuf) /= "1") then + this%initialized = .false. + return + endif + + cuda_status = cudaGetDeviceCount(ndev) + if (cuda_status /= cudaSuccess .or. ndev <= 0) then + this%initialized = .false. + return + endif + + ! Store host pointers + this%Ex => Ex; this%Ey => Ey; this%Ez => Ez + this%Hx => Hx; this%Hy => Hy; this%Hz => Hz + this%sggMiEx => sggMiEx; this%sggMiEy => sggMiEy; this%sggMiEz => sggMiEz + this%sggMiHx => sggMiHx; this%sggMiHy => sggMiHy; this%sggMiHz => sggMiHz + this%g1 => g1; this%g2 => g2; this%gm1 => gm1; this%gm2 => gm2 + this%Idxe => Idxe_in; this%Idye => Idye_in; this%Idze => Idze_in + this%Idxh => Idxh_in; this%Idyh => Idyh_in; this%Idzh => Idzh_in + this%dxe => dxe_in; this%dye => dye_in; this%dze => dze_in + this%dxh => dxh_in; this%dyh => dyh_in; this%dzh => dzh_in + + ! Store dimensions + this%Ex_nx = ubound(Ex,1) - lbound(Ex,1) + 1 + this%Ex_ny = ubound(Ex,2) - lbound(Ex,2) + 1 + this%Ex_nz = ubound(Ex,3) - lbound(Ex,3) + 1 + this%Ey_nx = ubound(Ey,1) - lbound(Ey,1) + 1 + this%Ey_ny = ubound(Ey,2) - lbound(Ey,2) + 1 + this%Ey_nz = ubound(Ey,3) - lbound(Ey,3) + 1 + this%Ez_nx = ubound(Ez,1) - lbound(Ez,1) + 1 + this%Ez_ny = ubound(Ez,2) - lbound(Ez,2) + 1 + this%Ez_nz = ubound(Ez,3) - lbound(Ez,3) + 1 + this%Hx_nx = ubound(Hx,1) - lbound(Hx,1) + 1 + this%Hx_ny = ubound(Hx,2) - lbound(Hx,2) + 1 + this%Hx_nz = ubound(Hx,3) - lbound(Hx,3) + 1 + this%Hy_nx = ubound(Hy,1) - lbound(Hy,1) + 1 + this%Hy_ny = ubound(Hy,2) - lbound(Hy,2) + 1 + this%Hy_nz = ubound(Hy,3) - lbound(Hy,3) + 1 + this%Hz_nx = ubound(Hz,1) - lbound(Hz,1) + 1 + this%Hz_ny = ubound(Hz,2) - lbound(Hz,2) + 1 + this%Hz_nz = ubound(Hz,3) - lbound(Hz,3) + 1 + + this%Idxe_n = ubound(Idxe_in,1) - lbound(Idxe_in,1) + 1 + this%Idye_n = ubound(Idye_in,1) - lbound(Idye_in,1) + 1 + this%Idze_n = ubound(Idze_in,1) - lbound(Idze_in,1) + 1 + this%Idxh_n = ubound(Idxh_in,1) - lbound(Idxh_in,1) + 1 + this%Idyh_n = ubound(Idyh_in,1) - lbound(Idyh_in,1) + 1 + this%Idzh_n = ubound(Idzh_in,1) - lbound(Idzh_in,1) + 1 + this%dxe_n = ubound(dxe_in,1) - lbound(dxe_in,1) + 1 + this%dye_n = ubound(dye_in,1) - lbound(dye_in,1) + 1 + this%dze_n = ubound(dze_in,1) - lbound(dze_in,1) + 1 + this%dxh_n = ubound(dxh_in,1) - lbound(dxh_in,1) + 1 + this%dyh_n = ubound(dyh_in,1) - lbound(dyh_in,1) + 1 + this%dzh_n = ubound(dzh_in,1) - lbound(dzh_in,1) + 1 + this%g1_n = ubound(g1,1) - lbound(g1,1) + 1 + this%g2_n = ubound(g2,1) - lbound(g2,1) + 1 + this%gm1_n = ubound(gm1,1) - lbound(gm1,1) + 1 + this%gm2_n = ubound(gm2,1) - lbound(gm2,1) + 1 + + this%sggMi_nx = ubound(sggMiEx,1) - lbound(sggMiEx,1) + 1 + this%sggMi_ny = ubound(sggMiEx,2) - lbound(sggMiEx,2) + 1 + this%sggMi_nz = ubound(sggMiEx,3) - lbound(sggMiEx,3) + 1 + + ! Allocate persistent device memory - YEE fields (0-based to match host arrays) + allocate(this%Ex_d(0:this%Ex_nx-1, 0:this%Ex_ny-1, 0:this%Ex_nz-1)) + allocate(this%Ey_d(0:this%Ey_nx-1, 0:this%Ey_ny-1, 0:this%Ey_nz-1)) + allocate(this%Ez_d(0:this%Ez_nx-1, 0:this%Ez_ny-1, 0:this%Ez_nz-1)) + allocate(this%Hx_d(0:this%Hx_nx-1, 0:this%Hx_ny-1, 0:this%Hx_nz-1)) + allocate(this%Hy_d(0:this%Hy_nx-1, 0:this%Hy_ny-1, 0:this%Hy_nz-1)) + allocate(this%Hz_d(0:this%Hz_nx-1, 0:this%Hz_ny-1, 0:this%Hz_nz-1)) + + allocate(this%Idxe_d(0:this%Idxe_n-1)) + allocate(this%Idye_d(0:this%Idye_n-1)) + allocate(this%Idze_d(0:this%Idze_n-1)) + allocate(this%Idxh_d(0:this%Idxh_n-1)) + allocate(this%Idyh_d(0:this%Idyh_n-1)) + allocate(this%Idzh_d(0:this%Idzh_n-1)) + allocate(this%dxe_d(0:this%dxe_n-1)) + allocate(this%dye_d(0:this%dye_n-1)) + allocate(this%dze_d(0:this%dze_n-1)) + allocate(this%dxh_d(0:this%dxh_n-1)) + allocate(this%dyh_d(0:this%dyh_n-1)) + allocate(this%dzh_d(0:this%dzh_n-1)) + + allocate(this%sggMiEx_d(0:this%sggMi_nx-1, 0:this%sggMi_ny-1, 0:this%sggMi_nz-1)) + allocate(this%sggMiEy_d(0:this%sggMi_nx-1, 0:this%sggMi_ny-1, 0:this%sggMi_nz-1)) + allocate(this%sggMiEz_d(0:this%sggMi_nx-1, 0:this%sggMi_ny-1, 0:this%sggMi_nz-1)) + allocate(this%sggMiHx_d(0:this%sggMi_nx-1, 0:this%sggMi_ny-1, 0:this%sggMi_nz-1)) + allocate(this%sggMiHy_d(0:this%sggMi_nx-1, 0:this%sggMi_ny-1, 0:this%sggMi_nz-1)) + allocate(this%sggMiHz_d(0:this%sggMi_nx-1, 0:this%sggMi_ny-1, 0:this%sggMi_nz-1)) + + allocate(this%g1_d(0:this%g1_n-1)) + allocate(this%g2_d(0:this%g2_n-1)) + allocate(this%gm1_d(0:this%gm1_n-1)) + allocate(this%gm2_d(0:this%gm2_n-1)) + + ! Initial upload: host -> device (once at startup) + this%Ex_d = this%Ex + this%Ey_d = this%Ey + this%Ez_d = this%Ez + this%Hx_d = this%Hx + this%Hy_d = this%Hy + this%Hz_d = this%Hz + this%fields_on_device = .true. + + this%initialized = .true. + + end subroutine gpu_init + + !-------------------------------------------------------------------------------- + ! Initialize CPML left boundary on GPU - called after InitCPMLBorders + !-------------------------------------------------------------------------------- + subroutine gpu_init_pml_left(this, P_be_y, P_ce_y, P_bm_y, P_cm_y, & + Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj, & + Ex_iEz_ii, Ex_iEz_ij, Ex_iEz_ji, Ex_iEz_jj, Ex_iEz_ki, Ex_iEz_kj, & + Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj, & + Hx_iHz_ii, Hx_iHz_ij, Hx_iHz_ji, Hx_iHz_jj, Hx_iHz_ki, Hx_iHz_kj, & + Ex_ny, Ex_nz, Ez_ny, Ez_nz, Hx_ny, Hx_nz, Hz_ny, Hz_nz) + class(gpu_state_t), intent(inout) :: this + real(kind=rkind), dimension(:), intent(in) :: P_be_y, P_ce_y, P_bm_y, P_cm_y + integer(kind=4), intent(in) :: Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj + integer(kind=4), intent(in) :: Ex_iEz_ii, Ex_iEz_ij, Ex_iEz_ji, Ex_iEz_jj, Ex_iEz_ki, Ex_iEz_kj + integer(kind=4), intent(in) :: Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj + integer(kind=4), intent(in) :: Hx_iHz_ii, Hx_iHz_ij, Hx_iHz_ji, Hx_iHz_jj, Hx_iHz_ki, Hx_iHz_kj + integer(kind=4), intent(in) :: Ex_ny, Ex_nz, Ez_ny, Ez_nz, Hx_ny, Hx_nz, Hz_ny, Hz_nz + + integer(kind=4) :: y_lo, y_hi + + if (.not. this%initialized) return + + ! Store PML limits (convert from global 0-based to GPU 1-based) + this%pml_left_Ex_ii = Ex_iEx_ii + 1; this%pml_left_Ex_ij = Ex_iEx_ij + 1 + this%pml_left_Ex_ji = Ex_iEx_ji + 1; this%pml_left_Ex_jj = Ex_iEx_jj + 1 + this%pml_left_Ex_ki = Ex_iEx_ki + 1; this%pml_left_Ex_kj = Ex_iEx_kj + 1 + + this%pml_left_Ez_ii = Ex_iEz_ii + 1; this%pml_left_Ez_ij = Ex_iEz_ij + 1 + this%pml_left_Ez_ji = Ex_iEz_ji + 1; this%pml_left_Ez_jj = Ex_iEz_jj + 1 + this%pml_left_Ez_ki = Ex_iEz_ki + 1; this%pml_left_Ez_kj = Ex_iEz_kj + 1 + + this%pml_left_Hx_ii = Hx_iHx_ii + 1; this%pml_left_Hx_ij = Hx_iHx_ij + 1 + this%pml_left_Hx_ji = Hx_iHx_ji + 1; this%pml_left_Hx_jj = Hx_iHx_jj + 1 + this%pml_left_Hx_ki = Hx_iHx_ki + 1; this%pml_left_Hx_kj = Hx_iHx_kj + 1 + + this%pml_left_Hz_ii = Hx_iHz_ii + 1; this%pml_left_Hz_ij = Hx_iHz_ij + 1 + this%pml_left_Hz_ji = Hx_iHz_ji + 1; this%pml_left_Hz_jj = Hx_iHz_jj + 1 + this%pml_left_Hz_ki = Hx_iHz_ki + 1; this%pml_left_Hz_kj = Hx_iHz_kj + 1 + + ! Allocate psi arrays with global bounds (same as host) + this%pml_left_Exy_nx = Ex_iEx_ij - Ex_iEx_ii + 1 + this%pml_left_Exy_ny = Ex_iEx_jj - Ex_iEx_ji + 1 + this%pml_left_Exy_nz = Ex_iEx_kj - Ex_iEx_ki + 1 + allocate(this%pml_psi_Exy_left(this%pml_left_Exy_nx, this%pml_left_Exy_ny, this%pml_left_Exy_nz)) + + this%pml_left_Ezy_nx = Ex_iEz_ij - Ex_iEz_ii + 1 + this%pml_left_Ezy_ny = Ex_iEz_jj - Ex_iEz_ji + 1 + this%pml_left_Ezy_nz = Ex_iEz_kj - Ex_iEz_ki + 1 + allocate(this%pml_psi_Ezy_left(this%pml_left_Ezy_nx, this%pml_left_Ezy_ny, this%pml_left_Ezy_nz)) + + this%pml_left_Hxy_nx = Hx_iHx_ij - Hx_iHx_ii + 1 + this%pml_left_Hxy_ny = Hx_iHx_jj - Hx_iHx_ji + 1 + this%pml_left_Hxy_nz = Hx_iHx_kj - Hx_iHx_ki + 1 + allocate(this%pml_psi_Hxy_left(this%pml_left_Hxy_nx, this%pml_left_Hxy_ny, this%pml_left_Hxy_nz)) + + this%pml_left_Hzy_nx = Hx_iHz_ij - Hx_iHz_ii + 1 + this%pml_left_Hzy_ny = Hx_iHz_jj - Hx_iHz_ji + 1 + this%pml_left_Hzy_nz = Hx_iHz_kj - Hx_iHz_ki + 1 + allocate(this%pml_psi_Hzy_left(this%pml_left_Hzy_nx, this%pml_left_Hzy_ny, this%pml_left_Hzy_nz)) + + ! Initialize psi arrays to zero on device + this%pml_psi_Exy_left = 0.0_rkind + this%pml_psi_Ezy_left = 0.0_rkind + this%pml_psi_Hxy_left = 0.0_rkind + this%pml_psi_Hzy_left = 0.0_rkind + + ! Allocate and copy coefficient arrays + y_lo = lbound(P_be_y, 1) + y_hi = ubound(P_be_y, 1) + this%pml_coeff_y_n = y_hi - y_lo + 1 + allocate(this%pml_P_be_y_left(y_lo:y_hi)) + allocate(this%pml_P_ce_y_left(y_lo:y_hi)) + allocate(this%pml_P_bm_y_left(y_lo:y_hi)) + allocate(this%pml_P_cm_y_left(y_lo:y_hi)) + + this%pml_P_be_y_left = P_be_y + this%pml_P_ce_y_left = P_ce_y + this%pml_P_bm_y_left = P_bm_y + this%pml_P_cm_y_left = P_cm_y + + this%pml_left_initialized = .true. + + end subroutine gpu_init_pml_left + + !-------------------------------------------------------------------------------- + ! Initialize CPML right boundary on GPU + !-------------------------------------------------------------------------------- + subroutine gpu_init_pml_right(this, & + Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj, & + Ex_iEz_ii, Ex_iEz_ij, Ex_iEz_ji, Ex_iEz_jj, Ex_iEz_ki, Ex_iEz_kj, & + Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj, & + Hx_iHz_ii, Hx_iHz_ij, Hx_iHz_ji, Hx_iHz_jj, Hx_iHz_ki, Hx_iHz_kj) + class(gpu_state_t), intent(inout) :: this + integer(kind=4), intent(in) :: Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj + integer(kind=4), intent(in) :: Ex_iEz_ii, Ex_iEz_ij, Ex_iEz_ji, Ex_iEz_jj, Ex_iEz_ki, Ex_iEz_kj + integer(kind=4), intent(in) :: Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj + integer(kind=4), intent(in) :: Hx_iHz_ii, Hx_iHz_ij, Hx_iHz_ji, Hx_iHz_jj, Hx_iHz_ki, Hx_iHz_kj + + if (.not. this%initialized) return + + this%pml_right_Ex_ii = Ex_iEx_ii + 1; this%pml_right_Ex_ij = Ex_iEx_ij + 1 + this%pml_right_Ex_ji = Ex_iEx_ji + 1; this%pml_right_Ex_jj = Ex_iEx_jj + 1 + this%pml_right_Ex_ki = Ex_iEx_ki + 1; this%pml_right_Ex_kj = Ex_iEx_kj + 1 + + this%pml_right_Ez_ii = Ex_iEz_ii + 1; this%pml_right_Ez_ij = Ex_iEz_ij + 1 + this%pml_right_Ez_ji = Ex_iEz_ji + 1; this%pml_right_Ez_jj = Ex_iEz_jj + 1 + this%pml_right_Ez_ki = Ex_iEz_ki + 1; this%pml_right_Ez_kj = Ex_iEz_kj + 1 + + this%pml_right_Hx_ii = Hx_iHx_ii + 1; this%pml_right_Hx_ij = Hx_iHx_ij + 1 + this%pml_right_Hx_ji = Hx_iHx_ji + 1; this%pml_right_Hx_jj = Hx_iHx_jj + 1 + this%pml_right_Hx_ki = Hx_iHx_ki + 1; this%pml_right_Hx_kj = Hx_iHx_kj + 1 + + this%pml_right_Hz_ii = Hx_iHz_ii + 1; this%pml_right_Hz_ij = Hx_iHz_ij + 1 + this%pml_right_Hz_ji = Hx_iHz_ji + 1; this%pml_right_Hz_jj = Hx_iHz_jj + 1 + this%pml_right_Hz_ki = Hx_iHz_ki + 1; this%pml_right_Hz_kj = Hx_iHz_kj + 1 + + this%pml_right_initialized = .true. + + end subroutine gpu_init_pml_right + + !-------------------------------------------------------------------------------- + ! Download device data to host - called only when output/probes are needed + !-------------------------------------------------------------------------------- + subroutine gpu_download(this) + class(gpu_state_t), intent(inout) :: this + + if (.not. this%initialized) return + if (.not. this%fields_on_device) return + + this%Ex = this%Ex_d + this%Ey = this%Ey_d + this%Ez = this%Ez_d + this%Hx = this%Hx_d + this%Hy = this%Hy_d + this%Hz = this%Hz_d + this%fields_on_device = .false. + this%last_download_step = 0 + + end subroutine gpu_download + + !-------------------------------------------------------------------------------- + ! Destroy GPU state - called at simulation end + !-------------------------------------------------------------------------------- + subroutine gpu_destroy(this) + class(gpu_state_t), intent(inout) :: this + + if (.not. this%initialized) return + + ! Deallocate device memory - YEE fields + if (associated(this%Ex_d)) deallocate(this%Ex_d) + if (associated(this%Ey_d)) deallocate(this%Ey_d) + if (associated(this%Ez_d)) deallocate(this%Ez_d) + if (associated(this%Hx_d)) deallocate(this%Hx_d) + if (associated(this%Hy_d)) deallocate(this%Hy_d) + if (associated(this%Hz_d)) deallocate(this%Hz_d) + + if (associated(this%Idxe_d)) deallocate(this%Idxe_d) + if (associated(this%Idye_d)) deallocate(this%Idye_d) + if (associated(this%Idze_d)) deallocate(this%Idze_d) + if (associated(this%Idxh_d)) deallocate(this%Idxh_d) + if (associated(this%Idyh_d)) deallocate(this%Idyh_d) + if (associated(this%Idzh_d)) deallocate(this%Idzh_d) + if (associated(this%dxe_d)) deallocate(this%dxe_d) + if (associated(this%dye_d)) deallocate(this%dye_d) + if (associated(this%dze_d)) deallocate(this%dze_d) + if (associated(this%dxh_d)) deallocate(this%dxh_d) + if (associated(this%dyh_d)) deallocate(this%dyh_d) + if (associated(this%dzh_d)) deallocate(this%dzh_d) + + if (associated(this%sggMiEx_d)) deallocate(this%sggMiEx_d) + if (associated(this%sggMiEy_d)) deallocate(this%sggMiEy_d) + if (associated(this%sggMiEz_d)) deallocate(this%sggMiEz_d) + if (associated(this%sggMiHx_d)) deallocate(this%sggMiHx_d) + if (associated(this%sggMiHy_d)) deallocate(this%sggMiHy_d) + if (associated(this%sggMiHz_d)) deallocate(this%sggMiHz_d) + + if (associated(this%g1_d)) deallocate(this%g1_d) + if (associated(this%g2_d)) deallocate(this%g2_d) + if (associated(this%gm1_d)) deallocate(this%gm1_d) + if (associated(this%gm2_d)) deallocate(this%gm2_d) + + ! Deallocate CPML left boundary device memory + if (associated(this%pml_psi_Exy_left)) deallocate(this%pml_psi_Exy_left) + if (associated(this%pml_psi_Ezy_left)) deallocate(this%pml_psi_Ezy_left) + if (associated(this%pml_psi_Hxy_left)) deallocate(this%pml_psi_Hxy_left) + if (associated(this%pml_psi_Hzy_left)) deallocate(this%pml_psi_Hzy_left) + if (associated(this%pml_P_be_y_left)) deallocate(this%pml_P_be_y_left) + if (associated(this%pml_P_ce_y_left)) deallocate(this%pml_P_ce_y_left) + if (associated(this%pml_P_bm_y_left)) deallocate(this%pml_P_bm_y_left) + if (associated(this%pml_P_cm_y_left)) deallocate(this%pml_P_cm_y_left) + + ! Deallocate CPML down boundary device memory + if (associated(this%pml_psi_Exz_down)) deallocate(this%pml_psi_Exz_down) + if (associated(this%pml_psi_Eyz_down)) deallocate(this%pml_psi_Eyz_down) + if (associated(this%pml_psi_Hxz_down)) deallocate(this%pml_psi_Hxz_down) + if (associated(this%pml_psi_Hyz_down)) deallocate(this%pml_psi_Hyz_down) + if (associated(this%pml_P_be_z_down)) deallocate(this%pml_P_be_z_down) + if (associated(this%pml_P_ce_z_down)) deallocate(this%pml_P_ce_z_down) + if (associated(this%pml_P_bm_z_down)) deallocate(this%pml_P_bm_z_down) + if (associated(this%pml_P_cm_z_down)) deallocate(this%pml_P_cm_z_down) + + ! Deallocate CPML back boundary device memory + if (associated(this%pml_psi_Ezx_back)) deallocate(this%pml_psi_Ezx_back) + if (associated(this%pml_psi_Eyx_back)) deallocate(this%pml_psi_Eyx_back) + if (associated(this%pml_psi_Hzx_back)) deallocate(this%pml_psi_Hzx_back) + if (associated(this%pml_psi_Hyx_back)) deallocate(this%pml_psi_Hyx_back) + if (associated(this%pml_P_be_x_back)) deallocate(this%pml_P_be_x_back) + if (associated(this%pml_P_ce_x_back)) deallocate(this%pml_P_ce_x_back) + if (associated(this%pml_P_bm_x_back)) deallocate(this%pml_P_bm_x_back) + if (associated(this%pml_P_cm_x_back)) deallocate(this%pml_P_cm_x_back) + + ! Deallocate MUR device memory + if (associated(this%mur_left_CAB1)) deallocate(this%mur_left_CAB1) + if (associated(this%mur_left_CAB3)) deallocate(this%mur_left_CAB3) + if (associated(this%mur_left_cab4)) deallocate(this%mur_left_cab4) + if (associated(this%mur_right_CAB1)) deallocate(this%mur_right_CAB1) + if (associated(this%mur_right_CAB3)) deallocate(this%mur_right_CAB3) + if (associated(this%mur_right_cab4)) deallocate(this%mur_right_cab4) + if (associated(this%mur_down_CAB1)) deallocate(this%mur_down_CAB1) + if (associated(this%mur_down_CAB3)) deallocate(this%mur_down_CAB3) + if (associated(this%mur_down_cab4)) deallocate(this%mur_down_cab4) + if (associated(this%mur_up_CAB1)) deallocate(this%mur_up_CAB1) + if (associated(this%mur_up_CAB3)) deallocate(this%mur_up_CAB3) + if (associated(this%mur_up_cab4)) deallocate(this%mur_up_cab4) + if (associated(this%mur_back_CAB1)) deallocate(this%mur_back_CAB1) + if (associated(this%mur_back_CAB3)) deallocate(this%mur_back_CAB3) + if (associated(this%mur_back_cab4)) deallocate(this%mur_back_cab4) + if (associated(this%mur_front_CAB1)) deallocate(this%mur_front_CAB1) + if (associated(this%mur_front_CAB3)) deallocate(this%mur_front_CAB3) + if (associated(this%mur_front_cab4)) deallocate(this%mur_front_cab4) + if (associated(this%mur_past_Hx_left)) deallocate(this%mur_past_Hx_left) + if (associated(this%mur_past_Hz_left)) deallocate(this%mur_past_Hz_left) + if (associated(this%mur_past_Hx_right)) deallocate(this%mur_past_Hx_right) + if (associated(this%mur_past_Hz_right)) deallocate(this%mur_past_Hz_right) + if (associated(this%mur_past_Hy_down)) deallocate(this%mur_past_Hy_down) + if (associated(this%mur_past_Hx_down)) deallocate(this%mur_past_Hx_down) + if (associated(this%mur_past_Hy_up)) deallocate(this%mur_past_Hy_up) + if (associated(this%mur_past_Hx_up)) deallocate(this%mur_past_Hx_up) + if (associated(this%mur_past_Hz_back)) deallocate(this%mur_past_Hz_back) + if (associated(this%mur_past_Hy_back)) deallocate(this%mur_past_Hy_back) + if (associated(this%mur_past_Hz_front)) deallocate(this%mur_past_Hz_front) + if (associated(this%mur_past_Hy_front)) deallocate(this%mur_past_Hy_front) + + ! Nullify host pointers + nullify(this%Ex); nullify(this%Ey); nullify(this%Ez) + nullify(this%Hx); nullify(this%Hy); nullify(this%Hz) + nullify(this%sggMiEx); nullify(this%sggMiEy); nullify(this%sggMiEz) + nullify(this%sggMiHx); nullify(this%sggMiHy); nullify(this%sggMiHz) + nullify(this%g1); nullify(this%g2); nullify(this%gm1); nullify(this%gm2) + nullify(this%Idxe); nullify(this%Idye); nullify(this%Idze) + nullify(this%Idxh); nullify(this%Idyh); nullify(this%Idzh) + nullify(this%dxe); nullify(this%dye); nullify(this%dze) + nullify(this%dxh); nullify(this%dyh); nullify(this%dzh) + + this%initialized = .false. + this%fields_on_device = .false. + this%pml_left_initialized = .false. + this%pml_right_initialized = .false. + this%pml_down_initialized = .false. +this%pml_up_initialized = .false. + this%pml_back_initialized = .false. + this%pml_front_initialized = .false. + + ! Deallocate probe buffers + if (associated(this%probe_results_d)) deallocate(this%probe_results_d) + if (associated(this%probe_field_ids_d)) deallocate(this%probe_field_ids_d) + if (associated(this%probe_I_d)) deallocate(this%probe_I_d) + if (associated(this%probe_J_d)) deallocate(this%probe_J_d) + if (associated(this%probe_K_d)) deallocate(this%probe_K_d) + if (associated(this%block_probe_results_d)) deallocate(this%block_probe_results_d) + if (associated(this%block_probe_field_ids_d)) deallocate(this%block_probe_field_ids_d) + if (associated(this%block_probe_I1_d)) deallocate(this%block_probe_I1_d) + if (associated(this%block_probe_J1_d)) deallocate(this%block_probe_J1_d) + if (associated(this%block_probe_K1_d)) deallocate(this%block_probe_K1_d) + if (associated(this%block_probe_I2_d)) deallocate(this%block_probe_I2_d) + if (associated(this%block_probe_J2_d)) deallocate(this%block_probe_J2_d) + if (associated(this%block_probe_K2_d)) deallocate(this%block_probe_K2_d) + + ! Deallocate flattened MUR arrays + if (associated(this%mur_bounds_d)) deallocate(this%mur_bounds_d) + if (associated(this%mur_cab1_d)) deallocate(this%mur_cab1_d) + + end subroutine gpu_destroy + + !-------------------------------------------------------------------------------- + ! Update CPML down/up boundary coefficients on device - called every step + !-------------------------------------------------------------------------------- + subroutine gpu_update_pml_down_coeffs(this, P_be_z, P_ce_z, P_bm_z, P_cm_z) + class(gpu_state_t), intent(inout) :: this + real(kind=rkind), dimension(:), intent(in) :: P_be_z, P_ce_z, P_bm_z, P_cm_z + + if (.not. this%pml_down_initialized) return + + this%pml_P_be_z_down = P_be_z + this%pml_P_ce_z_down = P_ce_z + this%pml_P_bm_z_down = P_bm_z + this%pml_P_cm_z_down = P_cm_z + + end subroutine gpu_update_pml_down_coeffs + + !-------------------------------------------------------------------------------- + ! Update CPML back/front boundary coefficients on device - called every step + !-------------------------------------------------------------------------------- + subroutine gpu_update_pml_back_coeffs(this, P_be_x, P_ce_x, P_bm_x, P_cm_x) + class(gpu_state_t), intent(inout) :: this + real(kind=rkind), dimension(:), intent(in) :: P_be_x, P_ce_x, P_bm_x, P_cm_x + + if (.not. this%pml_back_initialized) return + + this%pml_P_be_x_back = P_be_x + this%pml_P_ce_x_back = P_ce_x + this%pml_P_bm_x_back = P_bm_x + this%pml_P_cm_x_back = P_cm_x + + end subroutine gpu_update_pml_back_coeffs + + !-------------------------------------------------------------------------------- + ! Initialize CPML down boundary on GPU + !-------------------------------------------------------------------------------- + subroutine gpu_init_pml_down(this, P_be_z, P_ce_z, P_bm_z, P_cm_z, & + Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj, & + Ex_iEy_ii, Ex_iEy_ij, Ex_iEy_ji, Ex_iEy_jj, Ex_iEy_ki, Ex_iEy_kj, & + Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj, & + Hx_iHy_ii, Hx_iHy_ij, Hx_iHy_ji, Hx_iHy_jj, Hx_iHy_ki, Hx_iHy_kj) + class(gpu_state_t), intent(inout) :: this + real(kind=rkind), dimension(:), intent(in) :: P_be_z, P_ce_z, P_bm_z, P_cm_z + integer(kind=4), intent(in) :: Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj + integer(kind=4), intent(in) :: Ex_iEy_ii, Ex_iEy_ij, Ex_iEy_ji, Ex_iEy_jj, Ex_iEy_ki, Ex_iEy_kj + integer(kind=4), intent(in) :: Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj + integer(kind=4), intent(in) :: Hx_iHy_ii, Hx_iHy_ij, Hx_iHy_ji, Hx_iHy_jj, Hx_iHy_ki, Hx_iHy_kj + + integer(kind=4) :: z_lo, z_hi + + if (.not. this%initialized) return + + this%pml_down_Ex_ii = Ex_iEx_ii + 1; this%pml_down_Ex_ij = Ex_iEx_ij + 1 + this%pml_down_Ex_ji = Ex_iEx_ji + 1; this%pml_down_Ex_jj = Ex_iEx_jj + 1 + this%pml_down_Ex_ki = Ex_iEx_ki + 1; this%pml_down_Ex_kj = Ex_iEx_kj + 1 + + this%pml_down_Ey_ii = Ex_iEy_ii + 1; this%pml_down_Ey_ij = Ex_iEy_ij + 1 + this%pml_down_Ey_ji = Ex_iEy_ji + 1; this%pml_down_Ey_jj = Ex_iEy_jj + 1 + this%pml_down_Ey_ki = Ex_iEy_ki + 1; this%pml_down_Ey_kj = Ex_iEy_kj + 1 + + this%pml_down_Hx_ii = Hx_iHx_ii + 1; this%pml_down_Hx_ij = Hx_iHx_ij + 1 + this%pml_down_Hx_ji = Hx_iHx_ji + 1; this%pml_down_Hx_jj = Hx_iHx_jj + 1 + this%pml_down_Hx_ki = Hx_iHx_ki + 1; this%pml_down_Hx_kj = Hx_iHx_kj + 1 + + this%pml_down_Hy_ii = Hx_iHy_ii + 1; this%pml_down_Hy_ij = Hx_iHy_ij + 1 + this%pml_down_Hy_ji = Hx_iHy_ji + 1; this%pml_down_Hy_jj = Hx_iHy_jj + 1 + this%pml_down_Hy_ki = Hx_iHy_ki + 1; this%pml_down_Hy_kj = Hx_iHy_kj + 1 + + this%pml_down_Exz_nx = Ex_iEx_ij - Ex_iEx_ii + 1 + this%pml_down_Exz_ny = Ex_iEx_jj - Ex_iEx_ji + 1 + this%pml_down_Exz_nz = Ex_iEx_kj - Ex_iEx_ki + 1 + allocate(this%pml_psi_Exz_down(this%pml_down_Exz_nx, this%pml_down_Exz_ny, this%pml_down_Exz_nz)) + + this%pml_down_Eyz_nx = Ex_iEy_ij - Ex_iEy_ii + 1 + this%pml_down_Eyz_ny = Ex_iEy_jj - Ex_iEy_ji + 1 + this%pml_down_Eyz_nz = Ex_iEy_kj - Ex_iEy_ki + 1 + allocate(this%pml_psi_Eyz_down(this%pml_down_Eyz_nx, this%pml_down_Eyz_ny, this%pml_down_Eyz_nz)) + + this%pml_down_Hxz_nx = Hx_iHx_ij - Hx_iHx_ii + 1 + this%pml_down_Hxz_ny = Hx_iHx_jj - Hx_iHx_ji + 1 + this%pml_down_Hxz_nz = Hx_iHx_kj - Hx_iHx_ki + 1 + allocate(this%pml_psi_Hxz_down(this%pml_down_Hxz_nx, this%pml_down_Hxz_ny, this%pml_down_Hxz_nz)) + + this%pml_down_Hyz_nx = Hx_iHy_ij - Hx_iHy_ii + 1 + this%pml_down_Hyz_ny = Hx_iHy_jj - Hx_iHy_ji + 1 + this%pml_down_Hyz_nz = Hx_iHy_kj - Hx_iHy_ki + 1 + allocate(this%pml_psi_Hyz_down(this%pml_down_Hyz_nx, this%pml_down_Hyz_ny, this%pml_down_Hyz_nz)) + + this%pml_psi_Exz_down = 0.0_rkind + this%pml_psi_Eyz_down = 0.0_rkind + this%pml_psi_Hxz_down = 0.0_rkind + this%pml_psi_Hyz_down = 0.0_rkind + + z_lo = lbound(P_be_z, 1) + z_hi = ubound(P_be_z, 1) + this%pml_coeff_z_n = z_hi - z_lo + 1 + allocate(this%pml_P_be_z_down(z_lo:z_hi)) + allocate(this%pml_P_ce_z_down(z_lo:z_hi)) + allocate(this%pml_P_bm_z_down(z_lo:z_hi)) + allocate(this%pml_P_cm_z_down(z_lo:z_hi)) + + this%pml_P_be_z_down = P_be_z + this%pml_P_ce_z_down = P_ce_z + this%pml_P_bm_z_down = P_bm_z + this%pml_P_cm_z_down = P_cm_z + + this%pml_down_initialized = .true. + + end subroutine gpu_init_pml_down + + !-------------------------------------------------------------------------------- + ! Initialize CPML up boundary on GPU + !-------------------------------------------------------------------------------- + subroutine gpu_init_pml_up(this, & + Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj, & + Ex_iEy_ii, Ex_iEy_ij, Ex_iEy_ji, Ex_iEy_jj, Ex_iEy_ki, Ex_iEy_kj, & + Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj, & + Hx_iHy_ii, Hx_iHy_ij, Hx_iHy_ji, Hx_iHy_jj, Hx_iHy_ki, Hx_iHy_kj) + class(gpu_state_t), intent(inout) :: this + integer(kind=4), intent(in) :: Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj + integer(kind=4), intent(in) :: Ex_iEy_ii, Ex_iEy_ij, Ex_iEy_ji, Ex_iEy_jj, Ex_iEy_ki, Ex_iEy_kj + integer(kind=4), intent(in) :: Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj + integer(kind=4), intent(in) :: Hx_iHy_ii, Hx_iHy_ij, Hx_iHy_ji, Hx_iHy_jj, Hx_iHy_ki, Hx_iHy_kj + + if (.not. this%initialized) return + + this%pml_up_Ex_ii = Ex_iEx_ii + 1; this%pml_up_Ex_ij = Ex_iEx_ij + 1 + this%pml_up_Ex_ji = Ex_iEx_ji + 1; this%pml_up_Ex_jj = Ex_iEx_jj + 1 + this%pml_up_Ex_ki = Ex_iEx_ki + 1; this%pml_up_Ex_kj = Ex_iEx_kj + 1 + + this%pml_up_Ey_ii = Ex_iEy_ii + 1; this%pml_up_Ey_ij = Ex_iEy_ij + 1 + this%pml_up_Ey_ji = Ex_iEy_ji + 1; this%pml_up_Ey_jj = Ex_iEy_jj + 1 + this%pml_up_Ey_ki = Ex_iEy_ki + 1; this%pml_up_Ey_kj = Ex_iEy_kj + 1 + + this%pml_up_Hx_ii = Hx_iHx_ii + 1; this%pml_up_Hx_ij = Hx_iHx_ij + 1 + this%pml_up_Hx_ji = Hx_iHx_ji + 1; this%pml_up_Hx_jj = Hx_iHx_jj + 1 + this%pml_up_Hx_ki = Hx_iHx_ki + 1; this%pml_up_Hx_kj = Hx_iHx_kj + 1 + + this%pml_up_Hy_ii = Hx_iHy_ii + 1; this%pml_up_Hy_ij = Hx_iHy_ij + 1 + this%pml_up_Hy_ji = Hx_iHy_ji + 1; this%pml_up_Hy_jj = Hx_iHy_jj + 1 + this%pml_up_Hy_ki = Hx_iHy_ki + 1; this%pml_up_Hy_kj = Hx_iHy_kj + 1 + + this%pml_up_initialized = .true. + + end subroutine gpu_init_pml_up + + !-------------------------------------------------------------------------------- + ! Initialize CPML back boundary on GPU + !-------------------------------------------------------------------------------- + subroutine gpu_init_pml_back(this, P_be_x, P_ce_x, P_bm_x, P_cm_x, & + Ez_iEz_ii, Ez_iEz_ij, Ez_iEz_ji, Ez_iEz_jj, Ez_iEz_ki, Ez_iEz_kj, & + Ez_iEy_ii, Ez_iEy_ij, Ez_iEy_ji, Ez_iEy_jj, Ez_iEy_ki, Ez_iEy_kj, & + Hz_iHz_ii, Hz_iHz_ij, Hz_iHz_ji, Hz_iHz_jj, Hz_iHz_ki, Hz_iHz_kj, & + Hz_iHy_ii, Hz_iHy_ij, Hz_iHy_ji, Hz_iHy_jj, Hz_iHy_ki, Hz_iHy_kj) + class(gpu_state_t), intent(inout) :: this + real(kind=rkind), dimension(:), intent(in) :: P_be_x, P_ce_x, P_bm_x, P_cm_x + integer(kind=4), intent(in) :: Ez_iEz_ii, Ez_iEz_ij, Ez_iEz_ji, Ez_iEz_jj, Ez_iEz_ki, Ez_iEz_kj + integer(kind=4), intent(in) :: Ez_iEy_ii, Ez_iEy_ij, Ez_iEy_ji, Ez_iEy_jj, Ez_iEy_ki, Ez_iEy_kj + integer(kind=4), intent(in) :: Hz_iHz_ii, Hz_iHz_ij, Hz_iHz_ji, Hz_iHz_jj, Hz_iHz_ki, Hz_iHz_kj + integer(kind=4), intent(in) :: Hz_iHy_ii, Hz_iHy_ij, Hz_iHy_ji, Hz_iHy_jj, Hz_iHy_ki, Hz_iHy_kj + + integer(kind=4) :: x_lo, x_hi + + if (.not. this%initialized) return + + this%pml_back_Ez_ii = Ez_iEz_ii + 1; this%pml_back_Ez_ij = Ez_iEz_ij + 1 + this%pml_back_Ez_ji = Ez_iEz_ji + 1; this%pml_back_Ez_jj = Ez_iEz_jj + 1 + this%pml_back_Ez_ki = Ez_iEz_ki + 1; this%pml_back_Ez_kj = Ez_iEz_kj + 1 + + this%pml_back_Ey_ii = Ez_iEy_ii + 1; this%pml_back_Ey_ij = Ez_iEy_ij + 1 + this%pml_back_Ey_ji = Ez_iEy_ji + 1; this%pml_back_Ey_jj = Ez_iEy_jj + 1 + this%pml_back_Ey_ki = Ez_iEy_ki + 1; this%pml_back_Ey_kj = Ez_iEy_kj + 1 + + this%pml_back_Hz_ii = Hz_iHz_ii + 1; this%pml_back_Hz_ij = Hz_iHz_ij + 1 + this%pml_back_Hz_ji = Hz_iHz_ji + 1; this%pml_back_Hz_jj = Hz_iHz_jj + 1 + this%pml_back_Hz_ki = Hz_iHz_ki + 1; this%pml_back_Hz_kj = Hz_iHz_kj + 1 + + this%pml_back_Hy_ii = Hz_iHy_ii + 1; this%pml_back_Hy_ij = Hz_iHy_ij + 1 + this%pml_back_Hy_ji = Hz_iHy_ji + 1; this%pml_back_Hy_jj = Hz_iHy_jj + 1 + this%pml_back_Hy_ki = Hz_iHy_ki + 1; this%pml_back_Hy_kj = Hz_iHy_kj + 1 + + this%pml_back_Ezx_nx = Ez_iEz_ij - Ez_iEz_ii + 1 + this%pml_back_Ezx_ny = Ez_iEz_jj - Ez_iEz_ji + 1 + this%pml_back_Ezx_nz = Ez_iEz_kj - Ez_iEz_ki + 1 + allocate(this%pml_psi_Ezx_back(this%pml_back_Ezx_nx, this%pml_back_Ezx_ny, this%pml_back_Ezx_nz)) + + this%pml_back_Eyx_nx = Ez_iEy_ij - Ez_iEy_ii + 1 + this%pml_back_Eyx_ny = Ez_iEy_jj - Ez_iEy_ji + 1 + this%pml_back_Eyx_nz = Ez_iEy_kj - Ez_iEy_ki + 1 + allocate(this%pml_psi_Eyx_back(this%pml_back_Eyx_nx, this%pml_back_Eyx_ny, this%pml_back_Eyx_nz)) + + this%pml_back_Hzx_nx = Hz_iHz_ij - Hz_iHz_ii + 1 + this%pml_back_Hzx_ny = Hz_iHz_jj - Hz_iHz_ji + 1 + this%pml_back_Hzx_nz = Hz_iHz_kj - Hz_iHz_ki + 1 + allocate(this%pml_psi_Hzx_back(this%pml_back_Hzx_nx, this%pml_back_Hzx_ny, this%pml_back_Hzx_nz)) + + this%pml_back_Hyx_nx = Hz_iHy_ij - Hz_iHy_ii + 1 + this%pml_back_Hyx_ny = Hz_iHy_jj - Hz_iHy_ji + 1 + this%pml_back_Hyx_nz = Hz_iHy_kj - Hz_iHy_ki + 1 + allocate(this%pml_psi_Hyx_back(this%pml_back_Hyx_nx, this%pml_back_Hyx_ny, this%pml_back_Hyx_nz)) + + this%pml_psi_Ezx_back = 0.0_rkind + this%pml_psi_Eyx_back = 0.0_rkind + this%pml_psi_Hzx_back = 0.0_rkind + this%pml_psi_Hyx_back = 0.0_rkind + + x_lo = lbound(P_be_x, 1) + x_hi = ubound(P_be_x, 1) + this%pml_coeff_x_n = x_hi - x_lo + 1 + allocate(this%pml_P_be_x_back(x_lo:x_hi)) + allocate(this%pml_P_ce_x_back(x_lo:x_hi)) + allocate(this%pml_P_bm_x_back(x_lo:x_hi)) + allocate(this%pml_P_cm_x_back(x_lo:x_hi)) + + this%pml_P_be_x_back = P_be_x + this%pml_P_ce_x_back = P_ce_x + this%pml_P_bm_x_back = P_bm_x + this%pml_P_cm_x_back = P_cm_x + + this%pml_back_initialized = .true. + + end subroutine gpu_init_pml_back + + !-------------------------------------------------------------------------------- + ! Initialize CPML front boundary on GPU + !-------------------------------------------------------------------------------- + subroutine gpu_init_pml_front(this, & + Ez_iEz_ii, Ez_iEz_ij, Ez_iEz_ji, Ez_iEz_jj, Ez_iEz_ki, Ez_iEz_kj, & + Ez_iEy_ii, Ez_iEy_ij, Ez_iEy_ji, Ez_iEy_jj, Ez_iEy_ki, Ez_iEy_kj, & + Hz_iHz_ii, Hz_iHz_ij, Hz_iHz_ji, Hz_iHz_jj, Hz_iHz_ki, Hz_iHz_kj, & + Hz_iHy_ii, Hz_iHy_ij, Hz_iHy_ji, Hz_iHy_jj, Hz_iHy_ki, Hz_iHy_kj) + class(gpu_state_t), intent(inout) :: this + integer(kind=4), intent(in) :: Ez_iEz_ii, Ez_iEz_ij, Ez_iEz_ji, Ez_iEz_jj, Ez_iEz_ki, Ez_iEz_kj + integer(kind=4), intent(in) :: Ez_iEy_ii, Ez_iEy_ij, Ez_iEy_ji, Ez_iEy_jj, Ez_iEy_ki, Ez_iEy_kj + integer(kind=4), intent(in) :: Hz_iHz_ii, Hz_iHz_ij, Hz_iHz_ji, Hz_iHz_jj, Hz_iHz_ki, Hz_iHz_kj + integer(kind=4), intent(in) :: Hz_iHy_ii, Hz_iHy_ij, Hz_iHy_ji, Hz_iHy_jj, Hz_iHy_ki, Hz_iHy_kj + + if (.not. this%initialized) return + + this%pml_front_Ez_ii = Ez_iEz_ii + 1; this%pml_front_Ez_ij = Ez_iEz_ij + 1 + this%pml_front_Ez_ji = Ez_iEz_ji + 1; this%pml_front_Ez_jj = Ez_iEz_jj + 1 + this%pml_front_Ez_ki = Ez_iEz_ki + 1; this%pml_front_Ez_kj = Ez_iEz_kj + 1 + + this%pml_front_Ey_ii = Ez_iEy_ii + 1; this%pml_front_Ey_ij = Ez_iEy_ij + 1 + this%pml_front_Ey_ji = Ez_iEy_ji + 1; this%pml_front_Ey_jj = Ez_iEy_jj + 1 + this%pml_front_Ey_ki = Ez_iEy_ki + 1; this%pml_front_Ey_kj = Ez_iEy_kj + 1 + + this%pml_front_Hz_ii = Hz_iHz_ii + 1; this%pml_front_Hz_ij = Hz_iHz_ij + 1 + this%pml_front_Hz_ji = Hz_iHz_ji + 1; this%pml_front_Hz_jj = Hz_iHz_jj + 1 + this%pml_front_Hz_ki = Hz_iHz_ki + 1; this%pml_front_Hz_kj = Hz_iHz_kj + 1 + + this%pml_front_Hy_ii = Hz_iHy_ii + 1; this%pml_front_Hy_ij = Hz_iHy_ij + 1 + this%pml_front_Hy_ji = Hz_iHy_ji + 1; this%pml_front_Hy_jj = Hz_iHy_jj + 1 + this%pml_front_Hy_ki = Hz_iHy_ki + 1; this%pml_front_Hy_kj = Hz_iHy_kj + 1 + + this%pml_front_initialized = .true. + + end subroutine gpu_init_pml_front + + !-------------------------------------------------------------------------------- + ! Upload host data to device - called only when fields are modified on host + !-------------------------------------------------------------------------------- + subroutine gpu_upload(this) + class(gpu_state_t), intent(inout) :: this + + if (.not. this%initialized) return + + this%Ex_d = this%Ex + this%Ey_d = this%Ey + this%Ez_d = this%Ez + this%Hx_d = this%Hx + this%Hy_d = this%Hy + this%Hz_d = this%Hz + this%fields_on_device = .true. + this%last_download_step = -1024 + + end subroutine gpu_upload + + !-------------------------------------------------------------------------------- + ! Update CPML left boundary coefficients on device - called every step + !-------------------------------------------------------------------------------- + subroutine gpu_update_pml_left_coeffs(this, P_be_y, P_ce_y, P_bm_y, P_cm_y) + class(gpu_state_t), intent(inout) :: this + real(kind=rkind), dimension(:), intent(in) :: P_be_y, P_ce_y, P_bm_y, P_cm_y + + if (.not. this%pml_left_initialized) return + + this%pml_P_be_y_left = P_be_y + this%pml_P_ce_y_left = P_ce_y + this%pml_P_bm_y_left = P_bm_y + this%pml_P_cm_y_left = P_cm_y + + end subroutine gpu_update_pml_left_coeffs + + !-------------------------------------------------------------------------------- + ! Initialize MUR boundary coefficients on GPU - called after InitMURBorders + !-------------------------------------------------------------------------------- + subroutine gpu_init_mur_coeffs(this, numMedia, & + left_CAB1, left_CAB3, left_cab4, & + right_CAB1, right_CAB3, right_cab4, & + down_CAB1, down_CAB3, down_cab4, & + up_CAB1, up_CAB3, up_cab4, & + back_CAB1, back_CAB3, back_cab4, & + front_CAB1, front_CAB3, front_cab4) + class(gpu_state_t), intent(inout) :: this + integer(kind=4), intent(in) :: numMedia + real(kind=rkind), dimension(:), intent(in) :: left_CAB1, left_CAB3, left_cab4 + real(kind=rkind), dimension(:), intent(in) :: right_CAB1, right_CAB3, right_cab4 + real(kind=rkind), dimension(:), intent(in) :: down_CAB1, down_CAB3, down_cab4 + real(kind=rkind), dimension(:), intent(in) :: up_CAB1, up_CAB3, up_cab4 + real(kind=rkind), dimension(:), intent(in) :: back_CAB1, back_CAB3, back_cab4 + real(kind=rkind), dimension(:), intent(in) :: front_CAB1, front_CAB3, front_cab4 + + integer(kind=4) :: lo, hi + + if (.not. this%initialized) return + + this%mur_numMedia = numMedia + + lo = lbound(left_CAB1, 1); hi = ubound(left_CAB1, 1) + allocate(this%mur_left_CAB1(lo:hi)); this%mur_left_CAB1 = left_CAB1 + allocate(this%mur_left_CAB3(lo:hi)); this%mur_left_CAB3 = left_CAB3 + allocate(this%mur_left_cab4(lo:hi)); this%mur_left_cab4 = left_cab4 + allocate(this%mur_right_CAB1(lo:hi)); this%mur_right_CAB1 = right_CAB1 + allocate(this%mur_right_CAB3(lo:hi)); this%mur_right_CAB3 = right_CAB3 + allocate(this%mur_right_cab4(lo:hi)); this%mur_right_cab4 = right_cab4 + allocate(this%mur_down_CAB1(lo:hi)); this%mur_down_CAB1 = down_CAB1 + allocate(this%mur_down_CAB3(lo:hi)); this%mur_down_CAB3 = down_CAB3 + allocate(this%mur_down_cab4(lo:hi)); this%mur_down_cab4 = down_cab4 + allocate(this%mur_up_CAB1(lo:hi)); this%mur_up_CAB1 = up_CAB1 + allocate(this%mur_up_CAB3(lo:hi)); this%mur_up_CAB3 = up_CAB3 + allocate(this%mur_up_cab4(lo:hi)); this%mur_up_cab4 = up_cab4 + allocate(this%mur_back_CAB1(lo:hi)); this%mur_back_CAB1 = back_CAB1 + allocate(this%mur_back_CAB3(lo:hi)); this%mur_back_CAB3 = back_CAB3 + allocate(this%mur_back_cab4(lo:hi)); this%mur_back_cab4 = back_cab4 + allocate(this%mur_front_CAB1(lo:hi)); this%mur_front_CAB1 = front_CAB1 + allocate(this%mur_front_CAB3(lo:hi)); this%mur_front_CAB3 = front_CAB3 + allocate(this%mur_front_cab4(lo:hi)); this%mur_front_cab4 = front_cab4 + + ! Flattened CAB1 array: mur_cab1_d(12 * numMedia) + ! Layout: cab1_d((bound_id-1)*numMedia + medio) for contiguous access + if (associated(this%mur_cab1_d)) deallocate(this%mur_cab1_d) + allocate(this%mur_cab1_d(12 * numMedia)) + this%mur_cab1_d(1:numMedia) = left_CAB1 + this%mur_cab1_d(numMedia+1:2*numMedia) = left_CAB1 + this%mur_cab1_d(2*numMedia+1:3*numMedia) = right_CAB1 + this%mur_cab1_d(3*numMedia+1:4*numMedia) = right_CAB1 + this%mur_cab1_d(4*numMedia+1:5*numMedia) = down_CAB1 + this%mur_cab1_d(5*numMedia+1:6*numMedia) = down_CAB1 + this%mur_cab1_d(6*numMedia+1:7*numMedia) = up_CAB1 + this%mur_cab1_d(7*numMedia+1:8*numMedia) = up_CAB1 + this%mur_cab1_d(8*numMedia+1:9*numMedia) = back_CAB1 + this%mur_cab1_d(9*numMedia+1:10*numMedia) = back_CAB1 + this%mur_cab1_d(10*numMedia+1:11*numMedia) = front_CAB1 + this%mur_cab1_d(11*numMedia+1:12*numMedia) = front_CAB1 + + this%mur_initialized = .true. + + end subroutine gpu_init_mur_coeffs + + !-------------------------------------------------------------------------------- + ! Update MUR coefficients on device - called every step + !-------------------------------------------------------------------------------- + subroutine gpu_update_mur_coeffs(this, & + left_CAB1, left_CAB3, left_cab4, & + right_CAB1, right_CAB3, right_cab4, & + down_CAB1, down_CAB3, down_cab4, & + up_CAB1, up_CAB3, up_cab4, & + back_CAB1, back_CAB3, back_cab4, & + front_CAB1, front_CAB3, front_cab4) + class(gpu_state_t), intent(inout) :: this + real(kind=rkind), dimension(:), intent(in) :: left_CAB1, left_CAB3, left_cab4 + real(kind=rkind), dimension(:), intent(in) :: right_CAB1, right_CAB3, right_cab4 + real(kind=rkind), dimension(:), intent(in) :: down_CAB1, down_CAB3, down_cab4 + real(kind=rkind), dimension(:), intent(in) :: up_CAB1, up_CAB3, up_cab4 + real(kind=rkind), dimension(:), intent(in) :: back_CAB1, back_CAB3, back_cab4 + real(kind=rkind), dimension(:), intent(in) :: front_CAB1, front_CAB3, front_cab4 + + if (.not. this%mur_initialized) return + + this%mur_left_CAB1 = left_CAB1; this%mur_left_CAB3 = left_CAB3; this%mur_left_cab4 = left_cab4 + this%mur_right_CAB1 = right_CAB1; this%mur_right_CAB3 = right_CAB3; this%mur_right_cab4 = right_cab4 + this%mur_down_CAB1 = down_CAB1; this%mur_down_CAB3 = down_CAB3; this%mur_down_cab4 = down_cab4 + this%mur_up_CAB1 = up_CAB1; this%mur_up_CAB3 = up_CAB3; this%mur_up_cab4 = up_cab4 + this%mur_back_CAB1 = back_CAB1; this%mur_back_CAB3 = back_CAB3; this%mur_back_cab4 = back_cab4 + this%mur_front_CAB1 = front_CAB1; this%mur_front_CAB3 = front_CAB3; this%mur_front_cab4 = front_cab4 + + end subroutine gpu_update_mur_coeffs + + !-------------------------------------------------------------------------------- + ! Initialize MUR past-field arrays on GPU - called after InitMURBorders + !-------------------------------------------------------------------------------- + subroutine gpu_init_mur_past_fields(this, left_Hx_nx, left_Hx_ny, left_Hx_nz, left_Hz_nx, left_Hz_ny, left_Hz_nz, right_Hx_nx, right_Hx_ny, right_Hx_nz, right_Hz_nx, right_Hz_ny, right_Hz_nz, down_Hy_nx, down_Hy_ny, down_Hy_nz, down_Hx_nx, down_Hx_ny, down_Hx_nz, up_Hy_nx, up_Hy_ny, up_Hy_nz, up_Hx_nx, up_Hx_ny, up_Hx_nz, back_Hz_nx, back_Hz_ny, back_Hz_nz, back_Hy_nx, back_Hy_ny, back_Hy_nz, front_Hz_nx, front_Hz_ny, front_Hz_nz, front_Hy_nx, front_Hy_ny, front_Hy_nz, left_Hx, left_Hz, right_Hx, right_Hz, down_Hy, down_Hx, up_Hy, up_Hx, back_Hz, back_Hy, front_Hz, front_Hy) + class(gpu_state_t), intent(inout) :: this + integer(kind=4), intent(in) :: left_Hx_nx, left_Hx_ny, left_Hx_nz + integer(kind=4), intent(in) :: left_Hz_nx, left_Hz_ny, left_Hz_nz + integer(kind=4), intent(in) :: right_Hx_nx, right_Hx_ny, right_Hx_nz + integer(kind=4), intent(in) :: right_Hz_nx, right_Hz_ny, right_Hz_nz + integer(kind=4), intent(in) :: down_Hy_nx, down_Hy_ny, down_Hy_nz + integer(kind=4), intent(in) :: down_Hx_nx, down_Hx_ny, down_Hx_nz + integer(kind=4), intent(in) :: up_Hy_nx, up_Hy_ny, up_Hy_nz + integer(kind=4), intent(in) :: up_Hx_nx, up_Hx_ny, up_Hx_nz + integer(kind=4), intent(in) :: back_Hz_nx, back_Hz_ny, back_Hz_nz + integer(kind=4), intent(in) :: back_Hy_nx, back_Hy_ny, back_Hy_nz + integer(kind=4), intent(in) :: front_Hz_nx, front_Hz_ny, front_Hz_nz + integer(kind=4), intent(in) :: front_Hy_nx, front_Hy_ny, front_Hy_nz + real(kind=rkind), dimension(:,:,:), intent(in) :: left_Hx, left_Hz, right_Hx, right_Hz + real(kind=rkind), dimension(:,:,:), intent(in) :: down_Hy, down_Hx, up_Hy, up_Hx + real(kind=rkind), dimension(:,:,:), intent(in) :: back_Hz, back_Hy, front_Hz, front_Hy + + if (.not. this%initialized) return + + allocate(this%mur_past_Hx_left(lbound(left_Hx,1):ubound(left_Hx,1), lbound(left_Hx,2):ubound(left_Hx,2), lbound(left_Hx,3):ubound(left_Hx,3))) + allocate(this%mur_past_Hz_left(lbound(left_Hz,1):ubound(left_Hz,1), lbound(left_Hz,2):ubound(left_Hz,2), lbound(left_Hz,3):ubound(left_Hz,3))) + allocate(this%mur_past_Hx_right(lbound(right_Hx,1):ubound(right_Hx,1), lbound(right_Hx,2):ubound(right_Hx,2), lbound(right_Hx,3):ubound(right_Hx,3))) + allocate(this%mur_past_Hz_right(lbound(right_Hz,1):ubound(right_Hz,1), lbound(right_Hz,2):ubound(right_Hz,2), lbound(right_Hz,3):ubound(right_Hz,3))) + allocate(this%mur_past_Hy_down(lbound(down_Hy,1):ubound(down_Hy,1), lbound(down_Hy,2):ubound(down_Hy,2), lbound(down_Hy,3):ubound(down_Hy,3))) + allocate(this%mur_past_Hx_down(lbound(down_Hx,1):ubound(down_Hx,1), lbound(down_Hx,2):ubound(down_Hx,2), lbound(down_Hx,3):ubound(down_Hx,3))) + allocate(this%mur_past_Hy_up(lbound(up_Hy,1):ubound(up_Hy,1), lbound(up_Hy,2):ubound(up_Hy,2), lbound(up_Hy,3):ubound(up_Hy,3))) + allocate(this%mur_past_Hx_up(lbound(up_Hx,1):ubound(up_Hx,1), lbound(up_Hx,2):ubound(up_Hx,2), lbound(up_Hx,3):ubound(up_Hx,3))) + allocate(this%mur_past_Hz_back(lbound(back_Hz,1):ubound(back_Hz,1), lbound(back_Hz,2):ubound(back_Hz,2), lbound(back_Hz,3):ubound(back_Hz,3))) + allocate(this%mur_past_Hy_back(lbound(back_Hy,1):ubound(back_Hy,1), lbound(back_Hy,2):ubound(back_Hy,2), lbound(back_Hy,3):ubound(back_Hy,3))) + allocate(this%mur_past_Hz_front(lbound(front_Hz,1):ubound(front_Hz,1), lbound(front_Hz,2):ubound(front_Hz,2), lbound(front_Hz,3):ubound(front_Hz,3))) + allocate(this%mur_past_Hy_front(lbound(front_Hy,1):ubound(front_Hy,1), lbound(front_Hy,2):ubound(front_Hy,2), lbound(front_Hy,3):ubound(front_Hy,3))) + + this%mur_past_Hx_left = left_Hx; this%mur_past_Hz_left = left_Hz + this%mur_past_Hx_right = right_Hx; this%mur_past_Hz_right = right_Hz + this%mur_past_Hy_down = down_Hy; this%mur_past_Hx_down = down_Hx + this%mur_past_Hy_up = up_Hy; this%mur_past_Hx_up = up_Hx + this%mur_past_Hz_back = back_Hz; this%mur_past_Hy_back = back_Hy + this%mur_past_Hz_front = front_Hz; this%mur_past_Hy_front = front_Hy + + this%mur_initialized = .true. + + end subroutine gpu_init_mur_past_fields + + !-------------------------------------------------------------------------------- + ! Upload MUR past fields to device - called every step after CPU MUR update + !-------------------------------------------------------------------------------- + subroutine gpu_upload_mur_past_fields(this, left_Hx, left_Hz, right_Hx, right_Hz, down_Hy, down_Hx, up_Hy, up_Hx, back_Hz, back_Hy, front_Hz, front_Hy) + class(gpu_state_t), intent(inout) :: this + real(kind=rkind), dimension(:,:,:), intent(in) :: left_Hx, left_Hz, right_Hx, right_Hz, down_Hy, down_Hx, up_Hy, up_Hx, back_Hz, back_Hy, front_Hz, front_Hy + + if (.not. this%mur_initialized) return + + this%mur_past_Hx_left = left_Hx; this%mur_past_Hz_left = left_Hz + this%mur_past_Hx_right = right_Hx; this%mur_past_Hz_right = right_Hz + this%mur_past_Hy_down = down_Hy; this%mur_past_Hx_down = down_Hx + this%mur_past_Hy_up = up_Hy; this%mur_past_Hx_up = up_Hx + this%mur_past_Hz_back = back_Hz; this%mur_past_Hy_back = back_Hy + this%mur_past_Hz_front = front_Hz; this%mur_past_Hy_front = front_Hy + + end subroutine gpu_upload_mur_past_fields + + !-------------------------------------------------------------------------------- + ! Initialize MUR boundary limits on GPU + !-------------------------------------------------------------------------------- + subroutine gpu_init_mur_limits(this, & + left_Hx_ii, left_Hx_ij, left_Hx_ji, left_Hx_jj, left_Hx_ki, left_Hx_kj, & + left_Hz_ii, left_Hz_ij, left_Hz_ji, left_Hz_jj, left_Hz_ki, left_Hz_kj, & + right_Hx_ii, right_Hx_ij, right_Hx_ji, right_Hx_jj, right_Hx_ki, right_Hx_kj, & + right_Hz_ii, right_Hz_ij, right_Hz_ji, right_Hz_jj, right_Hz_ki, right_Hz_kj, & + down_Hy_ii, down_Hy_ij, down_Hy_ji, down_Hy_jj, down_Hy_ki, down_Hy_kj, & + down_Hx_ii, down_Hx_ij, down_Hx_ji, down_Hx_jj, down_Hx_ki, down_Hx_kj, & + up_Hy_ii, up_Hy_ij, up_Hy_ji, up_Hy_jj, up_Hy_ki, up_Hy_kj, & + up_Hx_ii, up_Hx_ij, up_Hx_ji, up_Hx_jj, up_Hx_ki, up_Hx_kj, & + back_Hz_ii, back_Hz_ij, back_Hz_ji, back_Hz_jj, back_Hz_ki, back_Hz_kj, & + back_Hy_ii, back_Hy_ij, back_Hy_ji, back_Hy_jj, back_Hy_ki, back_Hy_kj, & + front_Hz_ii, front_Hz_ij, front_Hz_ji, front_Hz_jj, front_Hz_ki, front_Hz_kj, & + front_Hy_ii, front_Hy_ij, front_Hy_ji, front_Hy_jj, front_Hy_ki, front_Hy_kj) + class(gpu_state_t), intent(inout) :: this + integer(kind=4), intent(in) :: left_Hx_ii, left_Hx_ij, left_Hx_ji, left_Hx_jj, left_Hx_ki, left_Hx_kj + integer(kind=4), intent(in) :: left_Hz_ii, left_Hz_ij, left_Hz_ji, left_Hz_jj, left_Hz_ki, left_Hz_kj + integer(kind=4), intent(in) :: right_Hx_ii, right_Hx_ij, right_Hx_ji, right_Hx_jj, right_Hx_ki, right_Hx_kj + integer(kind=4), intent(in) :: right_Hz_ii, right_Hz_ij, right_Hz_ji, right_Hz_jj, right_Hz_ki, right_Hz_kj + integer(kind=4), intent(in) :: down_Hy_ii, down_Hy_ij, down_Hy_ji, down_Hy_jj, down_Hy_ki, down_Hy_kj + integer(kind=4), intent(in) :: down_Hx_ii, down_Hx_ij, down_Hx_ji, down_Hx_jj, down_Hx_ki, down_Hx_kj + integer(kind=4), intent(in) :: up_Hy_ii, up_Hy_ij, up_Hy_ji, up_Hy_jj, up_Hy_ki, up_Hy_kj + integer(kind=4), intent(in) :: up_Hx_ii, up_Hx_ij, up_Hx_ji, up_Hx_jj, up_Hx_ki, up_Hx_kj + integer(kind=4), intent(in) :: back_Hz_ii, back_Hz_ij, back_Hz_ji, back_Hz_jj, back_Hz_ki, back_Hz_kj + integer(kind=4), intent(in) :: back_Hy_ii, back_Hy_ij, back_Hy_ji, back_Hy_jj, back_Hy_ki, back_Hy_kj + integer(kind=4), intent(in) :: front_Hz_ii, front_Hz_ij, front_Hz_ji, front_Hz_jj, front_Hz_ki, front_Hz_kj + integer(kind=4), intent(in) :: front_Hy_ii, front_Hy_ij, front_Hy_ji, front_Hy_jj, front_Hy_ki, front_Hy_kj + + ! Host-side bounds array: 12 boundaries x 6 limits (ii,ij,ji,jj,ki,kj) + integer(kind=4) :: host_bounds(12, 6) + integer(kind=4) :: i, j + + if (.not. this%initialized) return + + ! Populate host array with boundary limits + ! Boundary ID mapping: 1=left-Hx, 2=left-Hz, 3=right-Hx, 4=right-Hz, + ! 5=down-Hy, 6=down-Hx, 7=up-Hy, 8=up-Hx, + ! 9=back-Hz, 10=back-Hy, 11=front-Hz, 12=front-Hy + host_bounds(1, :) = [left_Hx_ii, left_Hx_ij, left_Hx_ji, left_Hx_jj, left_Hx_ki, left_Hx_kj] + host_bounds(2, :) = [left_Hz_ii, left_Hz_ij, left_Hz_ji, left_Hz_jj, left_Hz_ki, left_Hz_kj] + host_bounds(3, :) = [right_Hx_ii, right_Hx_ij, right_Hx_ji, right_Hx_jj, right_Hx_ki, right_Hx_kj] + host_bounds(4, :) = [right_Hz_ii, right_Hz_ij, right_Hz_ji, right_Hz_jj, right_Hz_ki, right_Hz_kj] + host_bounds(5, :) = [down_Hy_ii, down_Hy_ij, down_Hy_ji, down_Hy_jj, down_Hy_ki, down_Hy_kj] + host_bounds(6, :) = [down_Hx_ii, down_Hx_ij, down_Hx_ji, down_Hx_jj, down_Hx_ki, down_Hx_kj] + host_bounds(7, :) = [up_Hy_ii, up_Hy_ij, up_Hy_ji, up_Hy_jj, up_Hy_ki, up_Hy_kj] + host_bounds(8, :) = [up_Hx_ii, up_Hx_ij, up_Hx_ji, up_Hx_jj, up_Hx_ki, up_Hx_kj] + host_bounds(9, :) = [back_Hz_ii, back_Hz_ij, back_Hz_ji, back_Hz_jj, back_Hz_ki, back_Hz_kj] + host_bounds(10, :) = [back_Hy_ii, back_Hy_ij, back_Hy_ji, back_Hy_jj, back_Hy_ki, back_Hy_kj] + host_bounds(11, :) = [front_Hz_ii, front_Hz_ij, front_Hz_ji, front_Hz_jj, front_Hz_ki, front_Hz_kj] + host_bounds(12, :) = [front_Hy_ii, front_Hy_ij, front_Hy_ji, front_Hy_jj, front_Hy_ki, front_Hy_kj] + + ! Allocate and copy to device + if (associated(this%mur_bounds_d)) deallocate(this%mur_bounds_d) + allocate(this%mur_bounds_d(12, 6)) + this%mur_bounds_d(1:12, 1:6) = host_bounds(1:12, 1:6) + + ! Also populate individual members for backward compatibility with existing kernel wrappers + this%mur_left_Hx_ii = left_Hx_ii; this%mur_left_Hx_ij = left_Hx_ij + this%mur_left_Hx_ji = left_Hx_ji; this%mur_left_Hx_jj = left_Hx_jj + this%mur_left_Hx_ki = left_Hx_ki; this%mur_left_Hx_kj = left_Hx_kj + this%mur_left_Hz_ii = left_Hz_ii; this%mur_left_Hz_ij = left_Hz_ij + this%mur_left_Hz_ji = left_Hz_ji; this%mur_left_Hz_jj = left_Hz_jj + this%mur_left_Hz_ki = left_Hz_ki; this%mur_left_Hz_kj = left_Hz_kj + this%mur_right_Hx_ii = right_Hx_ii; this%mur_right_Hx_ij = right_Hx_ij + this%mur_right_Hx_ji = right_Hx_ji; this%mur_right_Hx_jj = right_Hx_jj + this%mur_right_Hx_ki = right_Hx_ki; this%mur_right_Hx_kj = right_Hx_kj + this%mur_right_Hz_ii = right_Hz_ii; this%mur_right_Hz_ij = right_Hz_ij + this%mur_right_Hz_ji = right_Hz_ji; this%mur_right_Hz_jj = right_Hz_jj + this%mur_right_Hz_ki = right_Hz_ki; this%mur_right_Hz_kj = right_Hz_kj + this%mur_down_Hy_ii = down_Hy_ii; this%mur_down_Hy_ij = down_Hy_ij + this%mur_down_Hy_ji = down_Hy_ji; this%mur_down_Hy_jj = down_Hy_jj + this%mur_down_Hy_ki = down_Hy_ki; this%mur_down_Hy_kj = down_Hy_kj + this%mur_down_Hx_ii = down_Hx_ii; this%mur_down_Hx_ij = down_Hx_ij + this%mur_down_Hx_ji = down_Hx_ji; this%mur_down_Hx_jj = down_Hx_jj + this%mur_down_Hx_ki = down_Hx_ki; this%mur_down_Hx_kj = down_Hx_kj + this%mur_up_Hy_ii = up_Hy_ii; this%mur_up_Hy_ij = up_Hy_ij + this%mur_up_Hy_ji = up_Hy_ji; this%mur_up_Hy_jj = up_Hy_jj + this%mur_up_Hy_ki = up_Hy_ki; this%mur_up_Hy_kj = up_Hy_kj + this%mur_up_Hx_ii = up_Hx_ii; this%mur_up_Hx_ij = up_Hx_ij + this%mur_up_Hx_ji = up_Hx_ji; this%mur_up_Hx_jj = up_Hx_jj + this%mur_up_Hx_ki = up_Hx_ki; this%mur_up_Hx_kj = up_Hx_kj + this%mur_back_Hz_ii = back_Hz_ii; this%mur_back_Hz_ij = back_Hz_ij + this%mur_back_Hz_ji = back_Hz_ji; this%mur_back_Hz_jj = back_Hz_jj + this%mur_back_Hz_ki = back_Hz_ki; this%mur_back_Hz_kj = back_Hz_kj + this%mur_back_Hy_ii = back_Hy_ii; this%mur_back_Hy_ij = back_Hy_ij + this%mur_back_Hy_ji = back_Hy_ji; this%mur_back_Hy_jj = back_Hy_jj + this%mur_back_Hy_ki = back_Hy_ki; this%mur_back_Hy_kj = back_Hy_kj + this%mur_front_Hz_ii = front_Hz_ii; this%mur_front_Hz_ij = front_Hz_ij + this%mur_front_Hz_ji = front_Hz_ji; this%mur_front_Hz_jj = front_Hz_jj + this%mur_front_Hz_ki = front_Hz_ki; this%mur_front_Hz_kj = front_Hz_kj + this%mur_front_Hy_ii = front_Hy_ii; this%mur_front_Hy_ij = front_Hy_ij + this%mur_front_Hy_ji = front_Hy_ji; this%mur_front_Hy_jj = front_Hy_jj + this%mur_front_Hy_ki = front_Hy_ki; this%mur_front_Hy_kj = front_Hy_kj + + end subroutine gpu_init_mur_limits + + end module gpu_core_m + +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +! GPU CORE PROBE EXTENSION - Probe-aware selective download +! Downloads only the cells that observation probes reference, +! eliminating per-timestep full-field downloads. +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + +module gpu_core_probe_m + + use FDETYPES_m + use Report_m + use cudafor + use gpu_core_m + + implicit none + +contains + + !-------------------------------------------------------------------------------- + ! Download only probe-relevant cells from device to host + ! Much smaller than downloading all 6 fields + !-------------------------------------------------------------------------------- + subroutine gpu_download_probes(this, sgg, Ex, Ey, Ez, Hx, Hy, Hz) + class(gpu_state_t), intent(inout) :: this + type(SGGFDTDINFO_t), intent(in) :: sgg + real(kind=rkind), dimension(:,:,:), pointer, intent(inout) :: Ex, Ey, Ez, Hx, Hy, Hz + + integer(kind=4) :: ii, i, field + integer(kind=4) :: I1, J1, K1, I2, J2, K2 + integer(kind=4) :: iii, jjj, kkk + integer(kind=4) :: pointObservationCases(6), blockCurrentObservationCases(6) + integer(kind=4) :: i1_m, i2_m, j1_m, j2_m, k1_m, k2_m + logical :: is_point, is_block + + pointObservationCases = [iEx, iEy, iEz, iHx, iHy, iHz] + blockCurrentObservationCases = [iBloqueJx, iBloqueJy, iBloqueJz, iBloqueMx, iBloqueMy, iBloqueMz] + + if (.not. this%initialized) return + if (.not. this%fields_on_device) return + + ! Check if any probe has invalid bounds (0,0,0) — indicates element-based probe + ! Element-based probes need full field download since cell locations + ! are determined by element geometry, not by simple cell ranges + do ii = 1, sgg%NumberRequest + do i = 1, sgg%Observation(ii)%nP + if (sgg%Observation(ii)%P(i)%XI == 0 .and. & + sgg%Observation(ii)%P(i)%YI == 0 .and. & + sgg%Observation(ii)%P(i)%ZI == 0) then + ! Element-based probe — fall back to full download + this%Ex = this%Ex_d + this%Ey = this%Ey_d + this%Ez = this%Ez_d + this%Hx = this%Hx_d + this%Hy = this%Hy_d + this%Hz = this%Hz_d + this%fields_on_device = .false. + return + end if + end do + end do + + ! Point probes: download individual cells + do ii = 1, sgg%NumberRequest + do i = 1, sgg%Observation(ii)%nP + field = sgg%Observation(ii)%P(i)%what + if (field == nothing) cycle + + I1 = sgg%Observation(ii)%P(i)%XI + J1 = sgg%Observation(ii)%P(i)%YI + K1 = sgg%Observation(ii)%P(i)%ZI + + is_point = any(field == pointObservationCases) + is_block = any(field == blockCurrentObservationCases) + + if (is_point) then + ! Download single cell from each field + Ex(I1,J1,K1) = this%Ex_d(I1,J1,K1) + Ey(I1,J1,K1) = this%Ey_d(I1,J1,K1) + Ez(I1,J1,K1) = this%Ez_d(I1,J1,K1) + Hx(I1,J1,K1) = this%Hx_d(I1,J1,K1) + Hy(I1,J1,K1) = this%Hy_d(I1,J1,K1) + Hz(I1,J1,K1) = this%Hz_d(I1,J1,K1) + else if (is_block) then + ! Block probes: download the surface/volume region + ! Only download if bounds are valid (element-based probes may have 0 bounds) + i1_m = I1; i2_m = sgg%Observation(ii)%P(i)%XE + j1_m = J1; j2_m = sgg%Observation(ii)%P(i)%YE + k1_m = K1; k2_m = sgg%Observation(ii)%P(i)%ZE + ! Skip if bounds are invalid (element-based probes use elementIds, not cell ranges) + if (i1_m > 0 .and. i2_m > 0 .and. j1_m > 0 .and. j2_m > 0 .and. k1_m > 0 .and. k2_m > 0) then + do kkk = k1_m, k2_m + do jjj = j1_m, j2_m + do iii = i1_m, i2_m + Ex(iii,jjj,kkk) = this%Ex_d(iii,jjj,kkk) + Ey(iii,jjj,kkk) = this%Ey_d(iii,jjj,kkk) + Ez(iii,jjj,kkk) = this%Ez_d(iii,jjj,kkk) + Hx(iii,jjj,kkk) = this%Hx_d(iii,jjj,kkk) + Hy(iii,jjj,kkk) = this%Hy_d(iii,jjj,kkk) + Hz(iii,jjj,kkk) = this%Hz_d(iii,jjj,kkk) + end do + end do + end do + end if + end if + end do + end do + +end subroutine gpu_download_probes + + !-------------------------------------------------------------------------------- + ! GPU kernel: sample point probes (single cell) + !-------------------------------------------------------------------------------- + attributes(global) subroutine gpu_sample_point_probes_kernel(probe_results_d, probe_field_ids_d, probe_I_d, probe_J_d, probe_K_d, & + Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d, numProbes) + real(kind=rkind), intent(out), dimension(:), device :: probe_results_d + integer(kind=4), intent(in), dimension(:), device :: probe_field_ids_d, probe_I_d, probe_J_d, probe_K_d + real(kind=rkind), dimension(:,:,:), device :: Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d + integer, value :: numProbes + integer :: idx, fieldId, i, j, k + real(kind=rkind) :: val + + idx = (blockidx%x - 1) * blockdim%x + threadidx%x + if (idx > numProbes) return + + fieldId = probe_field_ids_d(idx) + i = probe_I_d(idx) + j = probe_J_d(idx) + k = probe_K_d(idx) + + select case(fieldId) + case(iEx); val = Ex_d(i,j,k) + case(iEy); val = Ey_d(i,j,k) + case(iEz); val = Ez_d(i,j,k) + case(iHx); val = Hx_d(i,j,k) + case(iHy); val = Hy_d(i,j,k) + case(iHz); val = Hz_d(i,j,k) + case default; val = 0.0_rkind + end select + + probe_results_d(idx) = val + + end subroutine gpu_sample_point_probes_kernel + + !-------------------------------------------------------------------------------- + ! GPU kernel: sample block probes (summation over block) + ! Each thread handles one block probe + !-------------------------------------------------------------------------------- + attributes(global) subroutine gpu_sample_block_probes_kernel(probe_results_d, probe_field_ids_d, & + probe_I1_d, probe_J1_d, probe_K1_d, & + probe_I2_d, probe_J2_d, probe_K2_d, & + Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d, & + numBlockProbes) + real(kind=rkind), intent(out), dimension(:), device :: probe_results_d + integer(kind=4), intent(in), dimension(:), device :: probe_field_ids_d + integer(kind=4), intent(in), dimension(:), device :: probe_I1_d, probe_J1_d, probe_K1_d + integer(kind=4), intent(in), dimension(:), device :: probe_I2_d, probe_J2_d, probe_K2_d + real(kind=rkind), dimension(:,:,:), device :: Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d + integer, value :: numBlockProbes + integer :: idx, fieldId, ii, jj, kk, i1, i2, j1, j2, k1, k2 + real(kind=rkind) :: val + + idx = (blockidx%x - 1) * blockdim%x + threadidx%x + if (idx > numBlockProbes) return + + fieldId = probe_field_ids_d(idx) + i1 = probe_I1_d(idx) + i2 = probe_I2_d(idx) + j1 = probe_J1_d(idx) + j2 = probe_J2_d(idx) + k1 = probe_K1_d(idx) + k2 = probe_K2_d(idx) + + val = 0.0_rkind + + select case(fieldId) + case(iBloqueJx) + do jj = j1, j2 + val = val + (Hy_d(i1, jj, k1 - 1) - Hy_d(i1, jj, k2)) + end do + do kk = k1, k2 + val = val + (-Hz_d(i1, j1 - 1, kk) + Hz_d(i1, j2, kk)) + end do + case(iBloqueJy) + do kk = k1, k2 + val = val + (-Hz_d(i2, j1, kk) + Hz_d(i1 - 1, j1, kk)) + end do + do ii = i1, i2 + val = val + (Hx_d(ii, j1, k2) - Hx_d(ii, j1, k1 - 1)) + end do + case(iBloqueJz) + do ii = i1, i2 + val = val + (Hy_d(ii, j2, k1) - Hy_d(ii, j1, k1)) + end do + do jj = j1, j2 + val = val + (-Hx_d(i1 - 1, jj, k1) + Hx_d(i2, jj, k1)) + end do + case(iBloqueMx) + do jj = j1, j2 + val = val + (Hz_d(i1, jj, k1 - 1) - Hz_d(i1, jj, k2)) + end do + do kk = k1, k2 + val = val + (-Hy_d(i1, j1 - 1, kk) + Hy_d(i1, j2, kk)) + end do + case(iBloqueMy) + do kk = k1, k2 + val = val + (-Hx_d(i2, j1, kk) + Hx_d(i1 - 1, j1, kk)) + end do + do ii = i1, i2 + val = val + (Hz_d(ii, j1, k2) - Hz_d(ii, j1, k1 - 1)) + end do + case(iBloqueMz) + do ii = i1, i2 + val = val + (Hy_d(ii, j2, k1) - Hy_d(ii, j1, k1)) + end do + do jj = j1, j2 + val = val + (-Hx_d(i1 - 1, jj, k1) + Hx_d(i2, jj, k1)) + end do + case default + val = 0.0_rkind + end select + + probe_results_d(idx) = val + + end subroutine gpu_sample_block_probes_kernel + + !-------------------------------------------------------------------------------- + ! Fused probe sampling kernel — point + block probes in single launch + !-------------------------------------------------------------------------------- + attributes(global) subroutine gpu_sample_all_probes_kernel(probe_results_d, block_probe_results_d, & + probe_field_ids_d, probe_I_d, probe_J_d, probe_K_d, & + block_probe_field_ids_d, & + block_probe_I1_d, block_probe_J1_d, block_probe_K1_d, & + block_probe_I2_d, block_probe_J2_d, block_probe_K2_d, & + Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d, & + pointCount, blockCount) + real(kind=rkind), intent(out), dimension(:), device :: probe_results_d, block_probe_results_d + integer(kind=4), intent(in), dimension(:), device :: probe_field_ids_d, probe_I_d, probe_J_d, probe_K_d + integer(kind=4), intent(in), dimension(:), device :: block_probe_field_ids_d + integer(kind=4), intent(in), dimension(:), device :: block_probe_I1_d, block_probe_J1_d, block_probe_K1_d + integer(kind=4), intent(in), dimension(:), device :: block_probe_I2_d, block_probe_J2_d, block_probe_K2_d + real(kind=rkind), dimension(:,:,:), device :: Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d + integer, value :: pointCount, blockCount + integer :: idx, fieldId, i, j, k, ii, jj, kk, i1, i2, j1, j2, k1, k2 + real(kind=rkind) :: val + + idx = (blockidx%x - 1) * blockdim%x + threadidx%x + + ! Point probes (first pointCount threads) + if (idx <= pointCount) then + fieldId = probe_field_ids_d(idx) + i = probe_I_d(idx) + j = probe_J_d(idx) + k = probe_K_d(idx) + select case(fieldId) + case(iEx); val = Ex_d(i,j,k) + case(iEy); val = Ey_d(i,j,k) + case(iEz); val = Ez_d(i,j,k) + case(iHx); val = Hx_d(i,j,k) + case(iHy); val = Hy_d(i,j,k) + case(iHz); val = Hz_d(i,j,k) + case default; val = 0.0_rkind + end select + probe_results_d(idx) = val + return + end if + + ! Block probes (remaining threads) + idx = idx - pointCount + if (idx > blockCount) return + + fieldId = block_probe_field_ids_d(idx) + i1 = block_probe_I1_d(idx) + i2 = block_probe_I2_d(idx) + j1 = block_probe_J1_d(idx) + j2 = block_probe_J2_d(idx) + k1 = block_probe_K1_d(idx) + k2 = block_probe_K2_d(idx) + + val = 0.0_rkind + + select case(fieldId) + case(iBloqueJx) + do jj = j1, j2 + val = val + (Hy_d(i1, jj, k1 - 1) - Hy_d(i1, jj, k2)) + end do + do kk = k1, k2 + val = val + (-Hz_d(i1, j1 - 1, kk) + Hz_d(i1, j2, kk)) + end do + case(iBloqueJy) + do kk = k1, k2 + val = val + (-Hz_d(i2, j1, kk) + Hz_d(i1 - 1, j1, kk)) + end do + do ii = i1, i2 + val = val + (Hx_d(ii, j1, k2) - Hx_d(ii, j1, k1 - 1)) + end do + case(iBloqueJz) + do ii = i1, i2 + val = val + (Hy_d(ii, j2, k1) - Hy_d(ii, j1, k1)) + end do + do jj = j1, j2 + val = val + (-Hx_d(i1 - 1, jj, k1) + Hx_d(i2, jj, k1)) + end do + case(iBloqueMx) + do jj = j1, j2 + val = val + (Hz_d(i1, jj, k1 - 1) - Hz_d(i1, jj, k2)) + end do + do kk = k1, k2 + val = val + (-Hy_d(i1, j1 - 1, kk) + Hy_d(i1, j2, kk)) + end do + case(iBloqueMy) + do kk = k1, k2 + val = val + (-Hx_d(i2, j1, kk) + Hx_d(i1 - 1, j1, kk)) + end do + do ii = i1, i2 + val = val + (Hz_d(ii, j1, k2) - Hz_d(ii, j1, k1 - 1)) + end do + case(iBloqueMz) + do ii = i1, i2 + val = val + (Hy_d(ii, j2, k1) - Hy_d(ii, j1, k1)) + end do + do jj = j1, j2 + val = val + (-Hx_d(i1 - 1, jj, k1) + Hx_d(i2, jj, k1)) + end do + case default + val = 0.0_rkind + end select + + block_probe_results_d(idx) = val + + end subroutine gpu_sample_all_probes_kernel + + !-------------------------------------------------------------------------------- + ! Initialize probe buffers on device + !-------------------------------------------------------------------------------- + subroutine gpu_init_probe_buffers(this, sgg) + class(gpu_state_t), intent(inout) :: this + type(SGGFDTDINFO_t), intent(in) :: sgg + + integer(kind=4) :: ii, i, nProbes, pointCount, blockCount + integer(kind=4) :: pointObservationCases(6) + integer(kind=4) :: blockObservationCases(6) + integer(kind=4) :: cuda_status + + pointObservationCases = [iEx, iEy, iEz, iHx, iHy, iHz] + blockObservationCases = [iBloqueJx, iBloqueJy, iBloqueJz, iBloqueMx, iBloqueMy, iBloqueMz] + pointCount = 0 + blockCount = 0 + + ! Count probes by type + do ii = 1, sgg%NumberRequest + if (.not. sgg%Observation(ii)%TimeDomain) cycle + nProbes = sgg%Observation(ii)%nP + do i = 1, nProbes + if (sgg%Observation(ii)%P(i)%what == nothing) cycle + if (any(sgg%Observation(ii)%P(i)%what == pointObservationCases)) then + pointCount = pointCount + 1 + else if (any(sgg%Observation(ii)%P(i)%what == blockObservationCases)) then + blockCount = blockCount + 1 + end if + end do + end do + + if (pointCount == 0 .and. blockCount == 0) return + + this%num_probe_results = pointCount + this%num_block_probe_results = blockCount + + ! Allocate device buffers for point probes + if (pointCount > 0) then + allocate(this%probe_field_ids_d(pointCount)) + allocate(this%probe_I_d(pointCount)) + allocate(this%probe_J_d(pointCount)) + allocate(this%probe_K_d(pointCount)) + allocate(this%probe_results_d(pointCount)) + cuda_status = cudaMemcpy(this%probe_results_d, 0.0_rkind, pointCount * 4, cudaMemcpyHostToDevice) + end if + + ! Allocate device buffers for block probes + if (blockCount > 0) then + allocate(this%block_probe_field_ids_d(blockCount)) + allocate(this%block_probe_I1_d(blockCount)) + allocate(this%block_probe_J1_d(blockCount)) + allocate(this%block_probe_K1_d(blockCount)) + allocate(this%block_probe_I2_d(blockCount)) + allocate(this%block_probe_J2_d(blockCount)) + allocate(this%block_probe_K2_d(blockCount)) + allocate(this%block_probe_results_d(blockCount)) + cuda_status = cudaMemcpy(this%block_probe_results_d, 0.0_rkind, blockCount * 4, cudaMemcpyHostToDevice) + end if + + ! Populate point probe metadata + if (pointCount > 0) then + call gpu_populate_point_probes(this, sgg) + end if + + ! Populate block probe metadata + if (blockCount > 0) then + call gpu_populate_block_probes(this, sgg) + end if + + this%probe_buffers_initialized = .true. + + end subroutine gpu_init_probe_buffers + + !-------------------------------------------------------------------------------- + ! Populate point probe metadata on device + !-------------------------------------------------------------------------------- + subroutine gpu_populate_point_probes(this, sgg) + class(gpu_state_t), intent(inout) :: this + type(SGGFDTDINFO_t), intent(in) :: sgg + + integer(kind=4) :: ii, i, idx, nProbes + integer(kind=4) :: pointObservationCases(6) + integer(kind=4), allocatable, dimension(:) :: tmp_field_ids, tmp_I, tmp_J, tmp_K + integer(kind=4) :: cuda_status + + pointObservationCases = [iEx, iEy, iEz, iHx, iHy, iHz] + idx = 0 + allocate(tmp_field_ids(this%num_probe_results)) + allocate(tmp_I(this%num_probe_results)) + allocate(tmp_J(this%num_probe_results)) + allocate(tmp_K(this%num_probe_results)) + + do ii = 1, sgg%NumberRequest + if (.not. sgg%Observation(ii)%TimeDomain) cycle + nProbes = sgg%Observation(ii)%nP + do i = 1, nProbes + if (sgg%Observation(ii)%P(i)%what == nothing) cycle + if (any(sgg%Observation(ii)%P(i)%what == pointObservationCases)) then + idx = idx + 1 + tmp_field_ids(idx) = sgg%Observation(ii)%P(i)%what + tmp_I(idx) = sgg%Observation(ii)%P(i)%XI + tmp_J(idx) = sgg%Observation(ii)%P(i)%YI + tmp_K(idx) = sgg%Observation(ii)%P(i)%ZI + end if + end do + end do + + cuda_status = cudaMemcpy(this%probe_field_ids_d, tmp_field_ids, idx * 4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%probe_I_d, tmp_I, idx * 4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%probe_J_d, tmp_J, idx * 4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%probe_K_d, tmp_K, idx * 4, cudaMemcpyHostToDevice) + + deallocate(tmp_field_ids, tmp_I, tmp_J, tmp_K) + + end subroutine gpu_populate_point_probes + + !-------------------------------------------------------------------------------- + ! Populate block probe metadata on device + !-------------------------------------------------------------------------------- + subroutine gpu_populate_block_probes(this, sgg) + class(gpu_state_t), intent(inout) :: this + type(SGGFDTDINFO_t), intent(in) :: sgg + + integer(kind=4) :: ii, i, idx, nProbes + integer(kind=4) :: blockObservationCases(6) + integer(kind=4), allocatable, dimension(:) :: tmp_field_ids, tmp_I1, tmp_J1, tmp_K1 + integer(kind=4), allocatable, dimension(:) :: tmp_I2, tmp_J2, tmp_K2 + integer(kind=4) :: cuda_status + + blockObservationCases = [iBloqueJx, iBloqueJy, iBloqueJz, iBloqueMx, iBloqueMy, iBloqueMz] + idx = 0 + allocate(tmp_field_ids(this%num_block_probe_results)) + allocate(tmp_I1(this%num_block_probe_results)) + allocate(tmp_J1(this%num_block_probe_results)) + allocate(tmp_K1(this%num_block_probe_results)) + allocate(tmp_I2(this%num_block_probe_results)) + allocate(tmp_J2(this%num_block_probe_results)) + allocate(tmp_K2(this%num_block_probe_results)) + + do ii = 1, sgg%NumberRequest + if (.not. sgg%Observation(ii)%TimeDomain) cycle + nProbes = sgg%Observation(ii)%nP + do i = 1, nProbes + if (sgg%Observation(ii)%P(i)%what == nothing) cycle + if (any(sgg%Observation(ii)%P(i)%what == blockObservationCases)) then + idx = idx + 1 + tmp_field_ids(idx) = sgg%Observation(ii)%P(i)%what + tmp_I1(idx) = sgg%Observation(ii)%P(i)%XI + tmp_J1(idx) = sgg%Observation(ii)%P(i)%YI + tmp_K1(idx) = sgg%Observation(ii)%P(i)%ZI + tmp_I2(idx) = sgg%Observation(ii)%P(i)%XE + tmp_J2(idx) = sgg%Observation(ii)%P(i)%YE + tmp_K2(idx) = sgg%Observation(ii)%P(i)%ZE + end if + end do + end do + + cuda_status = cudaMemcpy(this%block_probe_field_ids_d, tmp_field_ids, idx * 4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%block_probe_I1_d, tmp_I1, idx * 4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%block_probe_J1_d, tmp_J1, idx * 4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%block_probe_K1_d, tmp_K1, idx * 4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%block_probe_I2_d, tmp_I2, idx * 4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%block_probe_J2_d, tmp_J2, idx * 4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%block_probe_K2_d, tmp_K2, idx * 4, cudaMemcpyHostToDevice) + + deallocate(tmp_field_ids, tmp_I1, tmp_J1, tmp_K1, tmp_I2, tmp_J2, tmp_K2) + + end subroutine gpu_populate_block_probes + + !-------------------------------------------------------------------------------- + ! Sample point probes on GPU (public interface) + !-------------------------------------------------------------------------------- + subroutine gpu_sample_point_probes(this, results_h, nTime) + class(gpu_state_t), intent(inout) :: this + real(kind=rkind), dimension(:), intent(out) :: results_h + integer(kind=4), intent(in) :: nTime + + integer(kind=4) :: pointCount, blockSize, gridSize + integer(kind=4) :: cuda_status + + pointCount = this%num_probe_results + if (pointCount == 0) return + + blockSize = 256 + gridSize = (pointCount + blockSize - 1) / blockSize + call gpu_sample_point_probes_kernel<<>>( & + this%probe_results_d, this%probe_field_ids_d, this%probe_I_d, this%probe_J_d, this%probe_K_d, & + this%Ex_d, this%Ey_d, this%Ez_d, this%Hx_d, this%Hy_d, this%Hz_d, pointCount) + cuda_status = cudaMemcpy(results_h, this%probe_results_d, pointCount * 4, cudaMemcpyDeviceToHost) + + end subroutine gpu_sample_point_probes + + !-------------------------------------------------------------------------------- + ! Sample block probes on GPU (public interface) + !-------------------------------------------------------------------------------- + subroutine gpu_sample_block_probes(this, results_h, nTime) + class(gpu_state_t), intent(inout) :: this + real(kind=rkind), dimension(:), intent(out) :: results_h + integer(kind=4), intent(in) :: nTime + + integer(kind=4) :: blockCount, blockSize, gridSize + integer(kind=4) :: cuda_status + + blockCount = this%num_block_probe_results + if (blockCount == 0) return + + blockSize = 256 + gridSize = (blockCount + blockSize - 1) / blockSize + call gpu_sample_block_probes_kernel<<>>( & + this%block_probe_results_d, this%block_probe_field_ids_d, & + this%block_probe_I1_d, this%block_probe_J1_d, this%block_probe_K1_d, & + this%block_probe_I2_d, this%block_probe_J2_d, this%block_probe_K2_d, & + this%Ex_d, this%Ey_d, this%Ez_d, this%Hx_d, this%Hy_d, this%Hz_d, & + blockCount) + cuda_status = cudaMemcpy(results_h, this%block_probe_results_d, blockCount * 4, cudaMemcpyDeviceToHost) + + end subroutine gpu_sample_block_probes + + !-------------------------------------------------------------------------------- + ! Fused probe sampling — point + block in single kernel launch + !-------------------------------------------------------------------------------- + subroutine gpu_sample_all_probes(this, point_results_h, block_results_h, nTime) + class(gpu_state_t), intent(inout) :: this + real(kind=rkind), dimension(:), intent(out) :: point_results_h + real(kind=rkind), dimension(:), intent(out) :: block_results_h + integer(kind=4), intent(in) :: nTime + + integer(kind=4) :: pointCount, blockCount, totalProbes, blockSize, gridSize + integer(kind=4) :: cuda_status + + pointCount = this%num_probe_results + blockCount = this%num_block_probe_results + totalProbes = pointCount + blockCount + + if (totalProbes == 0) return + + blockSize = 256 + gridSize = (totalProbes + blockSize - 1) / blockSize + + call gpu_sample_all_probes_kernel<<>>( & + this%probe_results_d, this%block_probe_results_d, & + this%probe_field_ids_d, this%probe_I_d, this%probe_J_d, this%probe_K_d, & + this%block_probe_field_ids_d, & + this%block_probe_I1_d, this%block_probe_J1_d, this%block_probe_K1_d, & + this%block_probe_I2_d, this%block_probe_J2_d, this%block_probe_K2_d, & + this%Ex_d, this%Ey_d, this%Ez_d, this%Hx_d, this%Hy_d, this%Hz_d, & + pointCount, blockCount) + + if (pointCount > 0) then + cuda_status = cudaMemcpy(point_results_h, this%probe_results_d, pointCount * 4, cudaMemcpyDeviceToHost) + endif + if (blockCount > 0) then + cuda_status = cudaMemcpy(block_results_h, this%block_probe_results_d, blockCount * 4, cudaMemcpyDeviceToHost) + endif + + end subroutine gpu_sample_all_probes + + !-------------------------------------------------------------------------------- + ! Destroy GPU probe buffers + !-------------------------------------------------------------------------------- + subroutine gpu_destroy_probe_buffers(this) + class(gpu_state_t), intent(inout) :: this + + if (this%probe_buffers_initialized) then + if (associated(this%probe_results_d)) deallocate(this%probe_results_d) + if (associated(this%probe_field_ids_d)) deallocate(this%probe_field_ids_d) + if (associated(this%probe_I_d)) deallocate(this%probe_I_d) + if (associated(this%probe_J_d)) deallocate(this%probe_J_d) + if (associated(this%probe_K_d)) deallocate(this%probe_K_d) + if (associated(this%block_probe_results_d)) deallocate(this%block_probe_results_d) + if (associated(this%block_probe_field_ids_d)) deallocate(this%block_probe_field_ids_d) + if (associated(this%block_probe_I1_d)) deallocate(this%block_probe_I1_d) + if (associated(this%block_probe_J1_d)) deallocate(this%block_probe_J1_d) + if (associated(this%block_probe_K1_d)) deallocate(this%block_probe_K1_d) + if (associated(this%block_probe_I2_d)) deallocate(this%block_probe_I2_d) + if (associated(this%block_probe_J2_d)) deallocate(this%block_probe_J2_d) + if (associated(this%block_probe_K2_d)) deallocate(this%block_probe_K2_d) + this%probe_buffers_initialized = .false. + end if + + end subroutine gpu_destroy_probe_buffers + + !-------------------------------------------------------------------------------- + ! Initialize NF2FF (near-to-far-field) buffers on GPU + !-------------------------------------------------------------------------------- + subroutine gpu_init_nf2ff_buffers(this, ExIz, ExDe, ExAb, ExAr, EyFr, EyTr, EyAb, EyAr, EzIz, EzDe, EzFr, EzTr, & + HxIz, HxDe, HxAb, HxAr, HyFr, HyTr, HyAb, HyAr, HzIz, HzDe, HzFr, HzTr, & + HxIz2, HxDe2, HxAb2, HxAr2, HyFr2, HyTr2, HyAb2, HyAr2, HzIz2, HzDe2, HzFr2, HzTr2, & + expIwdt, auxExp_E, auxExp_H, & + phys_x_Mx, phys_y_Mx, phys_z_Mx, & + phys_x_My, phys_y_My, phys_z_My, & + phys_x_Mz, phys_y_Mz, phys_z_Mz, & + phys_x_Jx, phys_y_Jx, phys_z_Jx, & + phys_x_Jy, phys_y_Jy, phys_z_Jy, & + phys_x_Jz, phys_y_Jz, phys_z_Jz, & + dyh_in, dye_in, dze_in, dzh_in, & + numCells, numFreqs, Ntheta, Nphi, & + thetaStart, thetaStop, thetaStep, phiStart, phiStop, phiStep, & + freqStep, initialFreq, cluz, z0, XDobleAncho, YDobleAncho, ZDobleAncho, sym_flags) + class(gpu_state_t), intent(inout) :: this + complex(kind=rkind), dimension(:,:,:), intent(in) :: ExIz, ExDe, ExAb, ExAr, EyFr, EyTr, EyAb, EyAr, EzIz, EzDe, EzFr, EzTr + complex(kind=rkind), dimension(:,:,:), intent(in) :: HxIz, HxDe, HxAb, HxAr, HyFr, HyTr, HyAb, HyAr, HzIz, HzDe, HzFr, HzTr + complex(kind=rkind), dimension(:,:,:), intent(in) :: HxIz2, HxDe2, HxAb2, HxAr2, HyFr2, HyTr2, HyAb2, HyAr2, HzIz2, HzDe2, HzFr2, HzTr2 + complex(kind=rkind), dimension(:), intent(in) :: expIwdt, auxExp_E, auxExp_H + real(kind=rkind), dimension(:), intent(in) :: phys_x_Mx, phys_y_Mx, phys_z_Mx + real(kind=rkind), dimension(:), intent(in) :: phys_x_My, phys_y_My, phys_z_My + real(kind=rkind), dimension(:), intent(in) :: phys_x_Mz, phys_y_Mz, phys_z_Mz + real(kind=rkind), dimension(:), intent(in) :: phys_x_Jx, phys_y_Jx, phys_z_Jx + real(kind=rkind), dimension(:), intent(in) :: phys_x_Jy, phys_y_Jy, phys_z_Jy + real(kind=rkind), dimension(:), intent(in) :: phys_x_Jz, phys_y_Jz, phys_z_Jz + integer(kind=4), intent(in) :: numCells, numFreqs, Ntheta, Nphi + real(kind=rkind), intent(in) :: thetaStart, thetaStop, thetaStep, phiStart, phiStop, phiStep + real(kind=rkind), intent(in) :: freqStep, initialFreq, cluz, z0 + real(kind=rkind), intent(in) :: XDobleAncho, YDobleAncho, ZDobleAncho + integer(kind=4), intent(in) :: sym_flags + real(kind=rkind), dimension(:), intent(in) :: dyh_in, dye_in, dze_in, dzh_in + + integer(kind=4) :: nx, ny, nz, cuda_status + + if (.not. this%initialized) return + if (this%nf2ff_initialized) return + + ! Determine buffer dimensions from first array + nx = ubound(ExIz, 1) - lbound(ExIz, 1) + 1 + ny = ubound(ExIz, 2) - lbound(ExIz, 2) + 1 + nz = ubound(ExIz, 3) - lbound(ExIz, 3) + 1 + + ! Allocate NF2FF DFT buffers (18 complex arrays, 3D) + if (numCells > 0 .and. numFreqs > 0) then + allocate(this%nf2ff_ExIz_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_ExDe_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_ExAb_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_ExAr_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_EyFr_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_EyTr_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_EyAb_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_EyAr_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_EzIz_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_EzDe_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_EzFr_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_EzTr_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HxIz_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HxDe_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HxAb_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HxAr_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HyFr_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HyTr_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HyAb_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HyAr_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HzIz_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HzDe_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HzFr_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HzTr_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HxIz2_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HxDe2_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HxAb2_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HxAr2_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HyFr2_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HyTr2_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HyAb2_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HyAr2_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HzIz2_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HzDe2_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HzFr2_d(0:nx-1, 0:ny-1, 0:nz-1)) + allocate(this%nf2ff_HzTr2_d(0:nx-1, 0:ny-1, 0:nz-1)) + + ! Copy DFT buffers from host to device + cuda_status = cudaMemcpy(this%nf2ff_ExIz_d, ExIz, size(ExIz)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_ExDe_d, ExDe, size(ExDe)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_ExAb_d, ExAb, size(ExAb)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_ExAr_d, ExAr, size(ExAr)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_EyFr_d, EyFr, size(EyFr)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_EyTr_d, EyTr, size(EyTr)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_EyAb_d, EyAb, size(EyAb)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_EyAr_d, EyAr, size(EyAr)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_EzIz_d, EzIz, size(EzIz)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_EzDe_d, EzDe, size(EzDe)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_EzFr_d, EzFr, size(EzFr)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_EzTr_d, EzTr, size(EzTr)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HxIz_d, HxIz, size(HxIz)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HxDe_d, HxDe, size(HxDe)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HxAb_d, HxAb, size(HxAb)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HxAr_d, HxAr, size(HxAr)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HyFr_d, HyFr, size(HyFr)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HyTr_d, HyTr, size(HyTr)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HyAb_d, HyAb, size(HyAb)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HyAr_d, HyAr, size(HyAr)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HzIz_d, HzIz, size(HzIz)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HzDe_d, HzDe, size(HzDe)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HzFr_d, HzFr, size(HzFr)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HzTr_d, HzTr, size(HzTr)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HxIz2_d, HxIz2, size(HxIz2)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HxDe2_d, HxDe2, size(HxDe2)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HxAb2_d, HxAb2, size(HxAb2)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HxAr2_d, HxAr2, size(HxAr2)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HyFr2_d, HyFr2, size(HyFr2)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HyTr2_d, HyTr2, size(HyTr2)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HyAb2_d, HyAb2, size(HyAb2)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HyAr2_d, HyAr2, size(HyAr2)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HzIz2_d, HzIz2, size(HzIz2)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HzDe2_d, HzDe2, size(HzDe2)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HzFr2_d, HzFr2, size(HzFr2)*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_HzTr2_d, HzTr2, size(HzTr2)*8, cudaMemcpyHostToDevice) + endif + + ! Allocate and copy frequency arrays + if (numFreqs > 0) then + allocate(this%nf2ff_expIwdt_d(0:numFreqs-1)) + allocate(this%nf2ff_auxExp_E_d(0:numFreqs-1)) + allocate(this%nf2ff_auxExp_H_d(0:numFreqs-1)) + cuda_status = cudaMemcpy(this%nf2ff_expIwdt_d, expIwdt, numFreqs*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_auxExp_E_d, auxExp_E, numFreqs*8, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_auxExp_H_d, auxExp_H, numFreqs*8, cudaMemcpyHostToDevice) + endif + + ! Allocate and copy geometry arrays (18 coordinate arrays) + if (numCells > 0) then + allocate(this%nf2ff_phys_x_Mx_d(0:numCells-1)) + allocate(this%nf2ff_phys_y_Mx_d(0:numCells-1)) + allocate(this%nf2ff_phys_z_Mx_d(0:numCells-1)) + allocate(this%nf2ff_phys_x_My_d(0:numCells-1)) + allocate(this%nf2ff_phys_y_My_d(0:numCells-1)) + allocate(this%nf2ff_phys_z_My_d(0:numCells-1)) + allocate(this%nf2ff_phys_x_Mz_d(0:numCells-1)) + allocate(this%nf2ff_phys_y_Mz_d(0:numCells-1)) + allocate(this%nf2ff_phys_z_Mz_d(0:numCells-1)) + allocate(this%nf2ff_phys_x_Jx_d(0:numCells-1)) + allocate(this%nf2ff_phys_y_Jx_d(0:numCells-1)) + allocate(this%nf2ff_phys_z_Jx_d(0:numCells-1)) + allocate(this%nf2ff_phys_x_Jy_d(0:numCells-1)) + allocate(this%nf2ff_phys_y_Jy_d(0:numCells-1)) + allocate(this%nf2ff_phys_z_Jy_d(0:numCells-1)) + allocate(this%nf2ff_phys_x_Jz_d(0:numCells-1)) + allocate(this%nf2ff_phys_y_Jz_d(0:numCells-1)) + allocate(this%nf2ff_phys_z_Jz_d(0:numCells-1)) + cuda_status = cudaMemcpy(this%nf2ff_phys_x_Mx_d, phys_x_Mx, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_y_Mx_d, phys_y_Mx, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_z_Mx_d, phys_z_Mx, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_x_My_d, phys_x_My, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_y_My_d, phys_y_My, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_z_My_d, phys_z_My, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_x_Mz_d, phys_x_Mz, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_y_Mz_d, phys_y_Mz, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_z_Mz_d, phys_z_Mz, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_x_Jx_d, phys_x_Jx, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_y_Jx_d, phys_y_Jx, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_z_Jx_d, phys_z_Jx, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_x_Jy_d, phys_x_Jy, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_y_Jy_d, phys_y_Jy, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_z_Jy_d, phys_y_Jy, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_x_Jz_d, phys_x_Jz, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_y_Jz_d, phys_y_Jz, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_phys_z_Jz_d, phys_z_Jz, numCells*4, cudaMemcpyHostToDevice) + endif + + ! Allocate and copy cell dimension arrays + if (numCells > 0) then + allocate(this%nf2ff_dyh_d(0:numCells-1)) + allocate(this%nf2ff_dze_d(0:numCells-1)) + allocate(this%nf2ff_dye_d(0:numCells-1)) + allocate(this%nf2ff_dzh_d(0:numCells-1)) + cuda_status = cudaMemcpy(this%nf2ff_dyh_d, dyh_in, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_dze_d, dze_in, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_dye_d, dye_in, numCells*4, cudaMemcpyHostToDevice) + cuda_status = cudaMemcpy(this%nf2ff_dzh_d, dzh_in, numCells*4, cudaMemcpyHostToDevice) + endif + + ! Allocate output buffers + if (Ntheta > 0 .and. Nphi > 0 .and. numFreqs > 0) then + allocate(this%nf2ff_Etheta_d(0:numFreqs*Ntheta*Nphi-1)) + allocate(this%nf2ff_Ephi_d(0:numFreqs*Ntheta*Nphi-1)) + allocate(this%nf2ff_RCS_d(0:numFreqs*Ntheta*Nphi-1)) + endif + + ! Store configuration + this%nf2ff_num_cells = numCells + this%nf2ff_num_freqs = numFreqs + this%nf2ff_Ntheta = Ntheta + this%nf2ff_Nphi = Nphi + this%nf2ff_thetaStep = thetaStep + this%nf2ff_phiStep = phiStep + this%nf2ff_freqStep = freqStep + this%nf2ff_initialFreq = initialFreq + this%nf2ff_cluz = cluz + this%nf2ff_z0 = z0 + this%nf2ff_XDobleAncho = XDobleAncho + this%nf2ff_YDobleAncho = YDobleAncho + this%nf2ff_ZDobleAncho = ZDobleAncho + this%nf2ff_sym_flags = sym_flags + + this%nf2ff_initialized = .true. + + end subroutine gpu_init_nf2ff_buffers + + !-------------------------------------------------------------------------------- + ! Destroy NF2FF device buffers + !-------------------------------------------------------------------------------- + subroutine gpu_destroy_nf2ff_buffers(this) + class(gpu_state_t), intent(inout) :: this + + if (.not. this%nf2ff_initialized) return + + if (associated(this%nf2ff_ExIz_d)) deallocate(this%nf2ff_ExIz_d) + if (associated(this%nf2ff_ExDe_d)) deallocate(this%nf2ff_ExDe_d) + if (associated(this%nf2ff_ExAb_d)) deallocate(this%nf2ff_ExAb_d) + if (associated(this%nf2ff_ExAr_d)) deallocate(this%nf2ff_ExAr_d) + if (associated(this%nf2ff_EyFr_d)) deallocate(this%nf2ff_EyFr_d) + if (associated(this%nf2ff_EyTr_d)) deallocate(this%nf2ff_EyTr_d) + if (associated(this%nf2ff_EyAb_d)) deallocate(this%nf2ff_EyAb_d) + if (associated(this%nf2ff_EyAr_d)) deallocate(this%nf2ff_EyAr_d) + if (associated(this%nf2ff_EzIz_d)) deallocate(this%nf2ff_EzIz_d) + if (associated(this%nf2ff_EzDe_d)) deallocate(this%nf2ff_EzDe_d) + if (associated(this%nf2ff_EzFr_d)) deallocate(this%nf2ff_EzFr_d) + if (associated(this%nf2ff_EzTr_d)) deallocate(this%nf2ff_EzTr_d) + if (associated(this%nf2ff_HxIz_d)) deallocate(this%nf2ff_HxIz_d) + if (associated(this%nf2ff_HxDe_d)) deallocate(this%nf2ff_HxDe_d) + if (associated(this%nf2ff_HxAb_d)) deallocate(this%nf2ff_HxAb_d) + if (associated(this%nf2ff_HxAr_d)) deallocate(this%nf2ff_HxAr_d) + if (associated(this%nf2ff_HyFr_d)) deallocate(this%nf2ff_HyFr_d) + if (associated(this%nf2ff_HyTr_d)) deallocate(this%nf2ff_HyTr_d) + if (associated(this%nf2ff_HyAb_d)) deallocate(this%nf2ff_HyAb_d) + if (associated(this%nf2ff_HyAr_d)) deallocate(this%nf2ff_HyAr_d) + if (associated(this%nf2ff_HzIz_d)) deallocate(this%nf2ff_HzIz_d) + if (associated(this%nf2ff_HzDe_d)) deallocate(this%nf2ff_HzDe_d) + if (associated(this%nf2ff_HzFr_d)) deallocate(this%nf2ff_HzFr_d) + if (associated(this%nf2ff_HzTr_d)) deallocate(this%nf2ff_HzTr_d) + if (associated(this%nf2ff_HxIz2_d)) deallocate(this%nf2ff_HxIz2_d) + if (associated(this%nf2ff_HxDe2_d)) deallocate(this%nf2ff_HxDe2_d) + if (associated(this%nf2ff_HxAb2_d)) deallocate(this%nf2ff_HxAb2_d) + if (associated(this%nf2ff_HxAr2_d)) deallocate(this%nf2ff_HxAr2_d) + if (associated(this%nf2ff_HyFr2_d)) deallocate(this%nf2ff_HyFr2_d) + if (associated(this%nf2ff_HyTr2_d)) deallocate(this%nf2ff_HyTr2_d) + if (associated(this%nf2ff_HyAb2_d)) deallocate(this%nf2ff_HyAb2_d) + if (associated(this%nf2ff_HyAr2_d)) deallocate(this%nf2ff_HyAr2_d) + if (associated(this%nf2ff_HzIz2_d)) deallocate(this%nf2ff_HzIz2_d) + if (associated(this%nf2ff_HzDe2_d)) deallocate(this%nf2ff_HzDe2_d) + if (associated(this%nf2ff_HzFr2_d)) deallocate(this%nf2ff_HzFr2_d) + if (associated(this%nf2ff_HzTr2_d)) deallocate(this%nf2ff_HzTr2_d) + + if (associated(this%nf2ff_expIwdt_d)) deallocate(this%nf2ff_expIwdt_d) + if (associated(this%nf2ff_auxExp_E_d)) deallocate(this%nf2ff_auxExp_E_d) + if (associated(this%nf2ff_auxExp_H_d)) deallocate(this%nf2ff_auxExp_H_d) + + if (associated(this%nf2ff_phys_x_Mx_d)) deallocate(this%nf2ff_phys_x_Mx_d) + if (associated(this%nf2ff_phys_y_Mx_d)) deallocate(this%nf2ff_phys_y_Mx_d) + if (associated(this%nf2ff_phys_z_Mx_d)) deallocate(this%nf2ff_phys_z_Mx_d) + if (associated(this%nf2ff_phys_x_My_d)) deallocate(this%nf2ff_phys_x_My_d) + if (associated(this%nf2ff_phys_y_My_d)) deallocate(this%nf2ff_phys_y_My_d) + if (associated(this%nf2ff_phys_z_My_d)) deallocate(this%nf2ff_phys_z_My_d) + if (associated(this%nf2ff_phys_x_Mz_d)) deallocate(this%nf2ff_phys_x_Mz_d) + if (associated(this%nf2ff_phys_y_Mz_d)) deallocate(this%nf2ff_phys_y_Mz_d) + if (associated(this%nf2ff_phys_z_Mz_d)) deallocate(this%nf2ff_phys_z_Mz_d) + if (associated(this%nf2ff_phys_x_Jx_d)) deallocate(this%nf2ff_phys_x_Jx_d) + if (associated(this%nf2ff_phys_y_Jx_d)) deallocate(this%nf2ff_phys_y_Jx_d) + if (associated(this%nf2ff_phys_z_Jx_d)) deallocate(this%nf2ff_phys_z_Jx_d) + if (associated(this%nf2ff_phys_x_Jy_d)) deallocate(this%nf2ff_phys_x_Jy_d) + if (associated(this%nf2ff_phys_y_Jy_d)) deallocate(this%nf2ff_phys_y_Jy_d) + if (associated(this%nf2ff_phys_z_Jy_d)) deallocate(this%nf2ff_phys_z_Jy_d) + if (associated(this%nf2ff_phys_x_Jz_d)) deallocate(this%nf2ff_phys_x_Jz_d) + if (associated(this%nf2ff_phys_y_Jz_d)) deallocate(this%nf2ff_phys_y_Jz_d) + if (associated(this%nf2ff_phys_z_Jz_d)) deallocate(this%nf2ff_phys_z_Jz_d) + + if (associated(this%nf2ff_dyh_d)) deallocate(this%nf2ff_dyh_d) + if (associated(this%nf2ff_dze_d)) deallocate(this%nf2ff_dze_d) + if (associated(this%nf2ff_dye_d)) deallocate(this%nf2ff_dye_d) + if (associated(this%nf2ff_dzh_d)) deallocate(this%nf2ff_dzh_d) + + if (associated(this%nf2ff_Etheta_d)) deallocate(this%nf2ff_Etheta_d) + if (associated(this%nf2ff_Ephi_d)) deallocate(this%nf2ff_Ephi_d) + if (associated(this%nf2ff_RCS_d)) deallocate(this%nf2ff_RCS_d) + + this%nf2ff_initialized = .false. + + end subroutine gpu_destroy_nf2ff_buffers + + end module gpu_core_probe_m diff --git a/src_main_pub/gpu_cpml_m.F90 b/src_main_pub/gpu_cpml_m.F90 new file mode 100644 index 000000000..718e23ab5 --- /dev/null +++ b/src_main_pub/gpu_cpml_m.F90 @@ -0,0 +1,903 @@ +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +! GPU CPML KERNELS MODULE - CUDA Fortran (CUF) accelerated CPML kernels +! Left + Right boundary CPML with persistent psi state on device. +! Split from gpu_kernels_cuf.F90 to avoid NVHPC compiler file-size limit. +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + +module gpu_cpml_m + + use FDETYPES_m + use Report_m + use cudafor + use gpu_core_m + + implicit none + +contains + + !-------------------------------------------------------------------------------- + ! CPML Left Boundary - GPU accelerated kernels + ! Uses persistent device g2_d/gm2_d arrays (already on device from gpu_init) + !-------------------------------------------------------------------------------- + subroutine gpu_advanceCPML_E_left(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%pml_left_initialized) return + + ! Advance Ex on left boundary + call gpu_advanceCPML_Ex_left_kernel(this%Ex_d, this%Hz_d, this%sggMiEx_d, & + this%pml_psi_Exy_left, this%pml_P_be_y_left, this%pml_P_ce_y_left, & + this%g2_d, & + this%pml_left_Ex_ii, this%pml_left_Ex_ij, & + this%pml_left_Ex_ji, this%pml_left_Ex_jj, & + this%pml_left_Ex_ki, this%pml_left_Ex_kj, & + b%Ex%XI-1) + + ! Advance Ez on left boundary + call gpu_advanceCPML_Ez_left_kernel(this%Ez_d, this%Hx_d, this%sggMiEz_d, & + this%pml_psi_Ezy_left, this%pml_P_be_y_left, this%pml_P_ce_y_left, & + this%g2_d, & + this%pml_left_Ez_ii, this%pml_left_Ez_ij, & + this%pml_left_Ez_ji, this%pml_left_Ez_jj, & + this%pml_left_Ez_ki, this%pml_left_Ez_kj, & + b%Ez%XI-1) + + end subroutine gpu_advanceCPML_E_left + + subroutine gpu_advanceCPML_Ex_left_kernel(Ex_d, Hz_d, sggMiEx_d, psi_Exy_d, & + P_be_y_d, P_ce_y_d, g2_d, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + real(kind=rkind), device, dimension(:,:,:) :: Ex_d, Hz_d, psi_Exy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEx_d + real(kind=rkind), device, dimension(:) :: P_be_y_d, P_ce_y_d, g2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiEx_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Exy_d(i-ii+1,j-ji+1,k-ki+1) = P_be_y_d(j) * psi_Exy_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hz_d(i,j,k) - Hz_d(i,j-1,k)) * P_ce_y_d(j) + Ex_d(i,j,k) = Ex_d(i,j,k) + g2_d(medio) * psi_Exy_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Ex_left_kernel + + subroutine gpu_advanceCPML_Ez_left_kernel(Ez_d, Hx_d, sggMiEz_d, psi_Ezy_d, & + P_be_y_d, P_ce_y_d, g2_d, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + real(kind=rkind), device, dimension(:,:,:) :: Ez_d, Hx_d, psi_Ezy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEz_d + real(kind=rkind), device, dimension(:) :: P_be_y_d, P_ce_y_d, g2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiEz_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Ezy_d(i-ii+1,j-ji+1,k-ki+1) = P_be_y_d(j) * psi_Ezy_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hx_d(i,j,k) - Hx_d(i,j-1,k)) * P_ce_y_d(j) + Ez_d(i,j,k) = Ez_d(i,j,k) - g2_d(medio) * psi_Ezy_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Ez_left_kernel + + subroutine gpu_advanceCPML_H_left(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%pml_left_initialized) return + + ! Advance Hx on left boundary + call gpu_advanceCPML_Hx_left_kernel(this%Hx_d, this%Ez_d, this%sggMiHx_d, & + this%pml_psi_Hxy_left, this%pml_P_bm_y_left, this%pml_P_cm_y_left, & + this%gm2_d, & + this%pml_left_Hx_ii, this%pml_left_Hx_ij, & + this%pml_left_Hx_ji, this%pml_left_Hx_jj, & + this%pml_left_Hx_ki, this%pml_left_Hx_kj, & + b%Hx%XI-1) + + ! Advance Hz on left boundary + call gpu_advanceCPML_Hz_left_kernel(this%Hz_d, this%Ex_d, this%sggMiHz_d, & + this%pml_psi_Hzy_left, this%pml_P_bm_y_left, this%pml_P_cm_y_left, & + this%gm2_d, & + this%pml_left_Hz_ii, this%pml_left_Hz_ij, & + this%pml_left_Hz_ji, this%pml_left_Hz_jj, & + this%pml_left_Hz_ki, this%pml_left_Hz_kj, & + b%Hz%XI-1) + + end subroutine gpu_advanceCPML_H_left + + subroutine gpu_advanceCPML_Hx_left_kernel(Hx_d, Ez_d, sggMiHx_d, psi_Hxy_d, & + P_bm_y_d, P_cm_y_d, gm2_d, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, Ez_d, psi_Hxy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d + real(kind=rkind), device, dimension(:) :: P_bm_y_d, P_cm_y_d, gm2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiHx_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Hxy_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_y_d(j) * psi_Hxy_d(i-ii+1,j-ji+1,k-ki+1) + & + (Ez_d(i,j+1,k) - Ez_d(i,j,k)) * P_cm_y_d(j) + Hx_d(i,j,k) = Hx_d(i,j,k) - gm2_d(medio) * psi_Hxy_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Hx_left_kernel + + subroutine gpu_advanceCPML_Hz_left_kernel(Hz_d, Ex_d, sggMiHz_d, psi_Hzy_d, & + P_bm_y_d, P_cm_y_d, gm2_d, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + real(kind=rkind), device, dimension(:,:,:) :: Hz_d, Ex_d, psi_Hzy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d + real(kind=rkind), device, dimension(:) :: P_bm_y_d, P_cm_y_d, gm2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiHz_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Hzy_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_y_d(j) * psi_Hzy_d(i-ii+1,j-ji+1,k-ki+1) + & + (Ex_d(i,j+1,k) - Ex_d(i,j,k)) * P_cm_y_d(j) + Hz_d(i,j,k) = Hz_d(i,j,k) + gm2_d(medio) * psi_Hzy_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Hz_left_kernel + + !-------------------------------------------------------------------------------- + ! CPML Right boundary - wrapper subroutines + !-------------------------------------------------------------------------------- + subroutine gpu_advanceCPML_E_right(this, b, numMedia) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + integer, intent(in) :: numMedia + if (.not. this%pml_right_initialized) return + call gpu_advanceCPML_Ex_right_kernel(this%Ex_d, this%Hz_d, this%sggMiEx_d, & + this%pml_psi_Exy_left, this%pml_P_be_y_left, this%pml_P_ce_y_left, & + this%g2_d, numMedia, & + this%pml_right_Ex_ii, this%pml_right_Ex_ij, & + this%pml_right_Ex_ji, this%pml_right_Ex_jj, & + this%pml_right_Ex_ki, this%pml_right_Ex_kj, & + b%Ex%XI-1) + call gpu_advanceCPML_Ez_right_kernel(this%Ez_d, this%Hx_d, this%sggMiEz_d, & + this%pml_psi_Ezy_left, this%pml_P_be_y_left, this%pml_P_ce_y_left, & + this%g2_d, numMedia, & + this%pml_right_Ez_ii, this%pml_right_Ez_ij, & + this%pml_right_Ez_ji, this%pml_right_Ez_jj, & + this%pml_right_Ez_ki, this%pml_right_Ez_kj, & + b%Ez%XI-1) + end subroutine gpu_advanceCPML_E_right + + subroutine gpu_advanceCPML_H_right(this, b, numMedia) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + integer, intent(in) :: numMedia + if (.not. this%pml_right_initialized) return + call gpu_advanceCPML_Hx_right_kernel(this%Hx_d, this%Ez_d, this%sggMiHx_d, & + this%pml_psi_Hxy_left, this%pml_P_bm_y_left, this%pml_P_cm_y_left, & + this%gm2_d, numMedia, & + this%pml_right_Hx_ii, this%pml_right_Hx_ij, & + this%pml_right_Hx_ji, this%pml_right_Hx_jj, & + this%pml_right_Hx_ki, this%pml_right_Hx_kj, & + b%Hx%XI-1) + call gpu_advanceCPML_Hz_right_kernel(this%Hz_d, this%Ex_d, this%sggMiHz_d, & + this%pml_psi_Hzy_left, this%pml_P_bm_y_left, this%pml_P_cm_y_left, & + this%gm2_d, numMedia, & + this%pml_right_Hz_ii, this%pml_right_Hz_ij, & + this%pml_right_Hz_ji, this%pml_right_Hz_jj, & + this%pml_right_Hz_ki, this%pml_right_Hz_kj, & + b%Hz%XI-1) + end subroutine gpu_advanceCPML_H_right + + subroutine gpu_advanceCPML_Ex_right_kernel(Ex_d, Hz_d, sggMiEx_d, psi_Exy_d, & + P_be_y_d, P_ce_y_d, g2_d, numMedia, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + integer, intent(in) :: numMedia + real(kind=rkind), device, dimension(:,:,:) :: Ex_d, Hz_d, psi_Exy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEx_d + real(kind=rkind), device, dimension(:) :: P_be_y_d, P_ce_y_d, g2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiEx_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Exy_d(i-ii+1,j-ji+1,k-ki+1) = P_be_y_d(j) * psi_Exy_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hz_d(i,j,k) - Hz_d(i,j-1,k)) * P_ce_y_d(j) + Ex_d(i,j,k) = Ex_d(i,j,k) + g2_d(medio) * psi_Exy_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Ex_right_kernel + + subroutine gpu_advanceCPML_Ez_right_kernel(Ez_d, Hx_d, sggMiEz_d, psi_Ezy_d, & + P_be_y_d, P_ce_y_d, g2_d, numMedia, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + integer, intent(in) :: numMedia + real(kind=rkind), device, dimension(:,:,:) :: Ez_d, Hx_d, psi_Ezy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEz_d + real(kind=rkind), device, dimension(:) :: P_be_y_d, P_ce_y_d, g2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiEz_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Ezy_d(i-ii+1,j-ji+1,k-ki+1) = P_be_y_d(j) * psi_Ezy_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hx_d(i,j,k) - Hx_d(i,j-1,k)) * P_ce_y_d(j) + Ez_d(i,j,k) = Ez_d(i,j,k) - g2_d(medio) * psi_Ezy_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Ez_right_kernel + + subroutine gpu_advanceCPML_Hx_right_kernel(Hx_d, Ez_d, sggMiHx_d, psi_Hxy_d, & + P_bm_y_d, P_cm_y_d, gm2_d, numMedia, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + integer, intent(in) :: numMedia + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, Ez_d, psi_Hxy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d + real(kind=rkind), device, dimension(:) :: P_bm_y_d, P_cm_y_d, gm2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiHx_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Hxy_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_y_d(j) * psi_Hxy_d(i-ii+1,j-ji+1,k-ki+1) + & + (Ez_d(i,j+1,k) - Ez_d(i,j,k)) * P_cm_y_d(j) + Hx_d(i,j,k) = Hx_d(i,j,k) - gm2_d(medio) * psi_Hxy_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Hx_right_kernel + + subroutine gpu_advanceCPML_Hz_right_kernel(Hz_d, Ex_d, sggMiHz_d, psi_Hzy_d, & + P_bm_y_d, P_cm_y_d, gm2_d, numMedia, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + integer, intent(in) :: numMedia + real(kind=rkind), device, dimension(:,:,:) :: Hz_d, Ex_d, psi_Hzy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d + real(kind=rkind), device, dimension(:) :: P_bm_y_d, P_cm_y_d, gm2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiHz_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Hzy_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_y_d(j) * psi_Hzy_d(i-ii+1,j-ji+1,k-ki+1) + & + (Ex_d(i,j+1,k) - Ex_d(i,j,k)) * P_cm_y_d(j) + Hz_d(i,j,k) = Hz_d(i,j,k) + gm2_d(medio) * psi_Hzy_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Hz_right_kernel + + !-------------------------------------------------------------------------------- + ! CPML Down Boundary - GPU accelerated kernels (z-dependent coefficients) + !-------------------------------------------------------------------------------- + subroutine gpu_advanceCPML_E_down(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%pml_down_initialized) return + + call gpu_advanceCPML_Ey_down_kernel(this%Ey_d, this%Hx_d, this%sggMiEy_d, & + this%pml_psi_Eyz_down, this%pml_P_be_z_down, this%pml_P_ce_z_down, & + this%g2_d, & + this%pml_down_Ey_ii, this%pml_down_Ey_ij, & + this%pml_down_Ey_ji, this%pml_down_Ey_jj, & + this%pml_down_Ey_ki, this%pml_down_Ey_kj, & + b%Ey%ZI-1) + + call gpu_advanceCPML_Ex_down_kernel(this%Ex_d, this%Hy_d, this%sggMiEx_d, & + this%pml_psi_Exz_down, this%pml_P_be_z_down, this%pml_P_ce_z_down, & + this%g2_d, & + this%pml_down_Ex_ii, this%pml_down_Ex_ij, & + this%pml_down_Ex_ji, this%pml_down_Ex_jj, & + this%pml_down_Ex_ki, this%pml_down_Ex_kj, & + b%Ex%ZI-1) + + end subroutine gpu_advanceCPML_E_down + + subroutine gpu_advanceCPML_Ey_down_kernel(Ey_d, Hx_d, sggMiEy_d, psi_Eyz_d, & + P_be_z_d, P_ce_z_d, g2_d, & + ii, ij, ji, jj, ki, kj, zi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset + real(kind=rkind), device, dimension(:,:,:) :: Ey_d, Hx_d, psi_Eyz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEy_d + real(kind=rkind), device, dimension(:) :: P_be_z_d, P_ce_z_d, g2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiEy_d(i-zi_offset,j-zi_offset,k-zi_offset) + psi_Eyz_d(i-ii+1,j-ji+1,k-ki+1) = P_be_z_d(k) * psi_Eyz_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hx_d(i,j,k) - Hx_d(i,j,k-1)) * P_ce_z_d(k) + Ey_d(i,j,k) = Ey_d(i,j,k) + g2_d(medio) * psi_Eyz_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Ey_down_kernel + + subroutine gpu_advanceCPML_Ex_down_kernel(Ex_d, Hy_d, sggMiEx_d, psi_Exz_d, & + P_be_z_d, P_ce_z_d, g2_d, & + ii, ij, ji, jj, ki, kj, zi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset + real(kind=rkind), device, dimension(:,:,:) :: Ex_d, Hy_d, psi_Exz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEx_d + real(kind=rkind), device, dimension(:) :: P_be_z_d, P_ce_z_d, g2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiEx_d(i-zi_offset,j-zi_offset,k-zi_offset) + psi_Exz_d(i-ii+1,j-ji+1,k-ki+1) = P_be_z_d(k) * psi_Exz_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hy_d(i,j,k) - Hy_d(i,j,k-1)) * P_ce_z_d(k) + Ex_d(i,j,k) = Ex_d(i,j,k) - g2_d(medio) * psi_Exz_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Ex_down_kernel + + subroutine gpu_advanceCPML_H_down(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%pml_down_initialized) return + + call gpu_advanceCPML_Hy_down_kernel(this%Hy_d, this%Ex_d, this%sggMiHy_d, & + this%pml_psi_Hyz_down, this%pml_P_bm_z_down, this%pml_P_cm_z_down, & + this%gm2_d, & + this%pml_down_Hy_ii, this%pml_down_Hy_ij, & + this%pml_down_Hy_ji, this%pml_down_Hy_jj, & + this%pml_down_Hy_ki, this%pml_down_Hy_kj, & + b%Hy%ZI-1) + + call gpu_advanceCPML_Hx_down_kernel(this%Hx_d, this%Ey_d, this%sggMiHx_d, & + this%pml_psi_Hxz_down, this%pml_P_bm_z_down, this%pml_P_cm_z_down, & + this%gm2_d, & + this%pml_down_Hx_ii, this%pml_down_Hx_ij, & + this%pml_down_Hx_ji, this%pml_down_Hx_jj, & + this%pml_down_Hx_ki, this%pml_down_Hx_kj, & + b%Hx%ZI-1) + + end subroutine gpu_advanceCPML_H_down + + subroutine gpu_advanceCPML_Hy_down_kernel(Hy_d, Ex_d, sggMiHy_d, psi_Hyz_d, & + P_bm_z_d, P_cm_z_d, gm2_d, & + ii, ij, ji, jj, ki, kj, zi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset + real(kind=rkind), device, dimension(:,:,:) :: Hy_d, Ex_d, psi_Hyz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d + real(kind=rkind), device, dimension(:) :: P_bm_z_d, P_cm_z_d, gm2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiHy_d(i-zi_offset,j-zi_offset,k-zi_offset) + psi_Hyz_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_z_d(k) * psi_Hyz_d(i-ii+1,j-ji+1,k-ki+1) + & + (Ex_d(i,j,k+1) - Ex_d(i,j,k)) * P_cm_z_d(k) + Hy_d(i,j,k) = Hy_d(i,j,k) - gm2_d(medio) * psi_Hyz_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Hy_down_kernel + + subroutine gpu_advanceCPML_Hx_down_kernel(Hx_d, Ey_d, sggMiHx_d, psi_Hxz_d, & + P_bm_z_d, P_cm_z_d, gm2_d, & + ii, ij, ji, jj, ki, kj, zi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, Ey_d, psi_Hxz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d + real(kind=rkind), device, dimension(:) :: P_bm_z_d, P_cm_z_d, gm2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiHx_d(i-zi_offset,j-zi_offset,k-zi_offset) + psi_Hxz_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_z_d(k) * psi_Hxz_d(i-ii+1,j-ji+1,k-ki+1) + & + (Ey_d(i,j,k+1) - Ey_d(i,j,k)) * P_cm_z_d(k) + Hx_d(i,j,k) = Hx_d(i,j,k) + gm2_d(medio) * psi_Hxz_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Hx_down_kernel + + !-------------------------------------------------------------------------------- + ! CPML Up Boundary - GPU accelerated kernels (z-dependent, same as down) + !-------------------------------------------------------------------------------- + subroutine gpu_advanceCPML_E_up(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%pml_up_initialized) return + + call gpu_advanceCPML_Ey_up_kernel(this%Ey_d, this%Hx_d, this%sggMiEy_d, & + this%pml_psi_Eyz_down, this%pml_P_be_z_down, this%pml_P_ce_z_down, & + this%g2_d, & + this%pml_up_Ey_ii, this%pml_up_Ey_ij, & + this%pml_up_Ey_ji, this%pml_up_Ey_jj, & + this%pml_up_Ey_ki, this%pml_up_Ey_kj, & + b%Ey%ZI-1) + + call gpu_advanceCPML_Ex_up_kernel(this%Ex_d, this%Hy_d, this%sggMiEx_d, & + this%pml_psi_Exz_down, this%pml_P_be_z_down, this%pml_P_ce_z_down, & + this%g2_d, & + this%pml_up_Ex_ii, this%pml_up_Ex_ij, & + this%pml_up_Ex_ji, this%pml_up_Ex_jj, & + this%pml_up_Ex_ki, this%pml_up_Ex_kj, & + b%Ex%ZI-1) + + end subroutine gpu_advanceCPML_E_up + + subroutine gpu_advanceCPML_Ey_up_kernel(Ey_d, Hx_d, sggMiEy_d, psi_Eyz_d, & + P_be_z_d, P_ce_z_d, g2_d, & + ii, ij, ji, jj, ki, kj, zi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset + real(kind=rkind), device, dimension(:,:,:) :: Ey_d, Hx_d, psi_Eyz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEy_d + real(kind=rkind), device, dimension(:) :: P_be_z_d, P_ce_z_d, g2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiEy_d(i-zi_offset,j-zi_offset,k-zi_offset) + psi_Eyz_d(i-ii+1,j-ji+1,k-ki+1) = P_be_z_d(k) * psi_Eyz_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hx_d(i,j,k) - Hx_d(i,j,k-1)) * P_ce_z_d(k) + Ey_d(i,j,k) = Ey_d(i,j,k) + g2_d(medio) * psi_Eyz_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Ey_up_kernel + + subroutine gpu_advanceCPML_Ex_up_kernel(Ex_d, Hy_d, sggMiEx_d, psi_Exz_d, & + P_be_z_d, P_ce_z_d, g2_d, & + ii, ij, ji, jj, ki, kj, zi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset + real(kind=rkind), device, dimension(:,:,:) :: Ex_d, Hy_d, psi_Exz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEx_d + real(kind=rkind), device, dimension(:) :: P_be_z_d, P_ce_z_d, g2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiEx_d(i-zi_offset,j-zi_offset,k-zi_offset) + psi_Exz_d(i-ii+1,j-ji+1,k-ki+1) = P_be_z_d(k) * psi_Exz_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hy_d(i,j,k) - Hy_d(i,j,k-1)) * P_ce_z_d(k) + Ex_d(i,j,k) = Ex_d(i,j,k) - g2_d(medio) * psi_Exz_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Ex_up_kernel + + subroutine gpu_advanceCPML_H_up(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%pml_up_initialized) return + + call gpu_advanceCPML_Hy_up_kernel(this%Hy_d, this%Ex_d, this%sggMiHy_d, & + this%pml_psi_Hyz_down, this%pml_P_bm_z_down, this%pml_P_cm_z_down, & + this%gm2_d, & + this%pml_up_Hy_ii, this%pml_up_Hy_ij, & + this%pml_up_Hy_ji, this%pml_up_Hy_jj, & + this%pml_up_Hy_ki, this%pml_up_Hy_kj, & + b%Hy%ZI-1) + + call gpu_advanceCPML_Hx_up_kernel(this%Hx_d, this%Ey_d, this%sggMiHx_d, & + this%pml_psi_Hxz_down, this%pml_P_bm_z_down, this%pml_P_cm_z_down, & + this%gm2_d, & + this%pml_up_Hx_ii, this%pml_up_Hx_ij, & + this%pml_up_Hx_ji, this%pml_up_Hx_jj, & + this%pml_up_Hx_ki, this%pml_up_Hx_kj, & + b%Hx%ZI-1) + + end subroutine gpu_advanceCPML_H_up + + subroutine gpu_advanceCPML_Hy_up_kernel(Hy_d, Ex_d, sggMiHy_d, psi_Hyz_d, & + P_bm_z_d, P_cm_z_d, gm2_d, & + ii, ij, ji, jj, ki, kj, zi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset + real(kind=rkind), device, dimension(:,:,:) :: Hy_d, Ex_d, psi_Hyz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d + real(kind=rkind), device, dimension(:) :: P_bm_z_d, P_cm_z_d, gm2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiHy_d(i-zi_offset,j-zi_offset,k-zi_offset) + psi_Hyz_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_z_d(k) * psi_Hyz_d(i-ii+1,j-ji+1,k-ki+1) + & + (Ex_d(i,j,k+1) - Ex_d(i,j,k)) * P_cm_z_d(k) + Hy_d(i,j,k) = Hy_d(i,j,k) - gm2_d(medio) * psi_Hyz_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Hy_up_kernel + + subroutine gpu_advanceCPML_Hx_up_kernel(Hx_d, Ey_d, sggMiHx_d, psi_Hxz_d, & + P_bm_z_d, P_cm_z_d, gm2_d, & + ii, ij, ji, jj, ki, kj, zi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, Ey_d, psi_Hxz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d + real(kind=rkind), device, dimension(:) :: P_bm_z_d, P_cm_z_d, gm2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiHx_d(i-zi_offset,j-zi_offset,k-zi_offset) + psi_Hxz_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_z_d(k) * psi_Hxz_d(i-ii+1,j-ji+1,k-ki+1) + & + (Ey_d(i,j,k+1) - Ey_d(i,j,k)) * P_cm_z_d(k) + Hx_d(i,j,k) = Hx_d(i,j,k) + gm2_d(medio) * psi_Hxz_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Hx_up_kernel + + !-------------------------------------------------------------------------------- + ! CPML Back Boundary - GPU accelerated kernels (x-dependent coefficients) + !-------------------------------------------------------------------------------- + subroutine gpu_advanceCPML_E_back(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%pml_back_initialized) return + + call gpu_advanceCPML_Ez_back_kernel(this%Ez_d, this%Hy_d, this%sggMiEz_d, & + this%pml_psi_Ezx_back, this%pml_P_be_x_back, this%pml_P_ce_x_back, & + this%g2_d, & + this%pml_back_Ez_ii, this%pml_back_Ez_ij, & + this%pml_back_Ez_ji, this%pml_back_Ez_jj, & + this%pml_back_Ez_ki, this%pml_back_Ez_kj, & + b%Ez%XI-1) + + call gpu_advanceCPML_Ey_back_kernel(this%Ey_d, this%Hz_d, this%sggMiEy_d, & + this%pml_psi_Eyx_back, this%pml_P_be_x_back, this%pml_P_ce_x_back, & + this%g2_d, & + this%pml_back_Ey_ii, this%pml_back_Ey_ij, & + this%pml_back_Ey_ji, this%pml_back_Ey_jj, & + this%pml_back_Ey_ki, this%pml_back_Ey_kj, & + b%Ey%XI-1) + + end subroutine gpu_advanceCPML_E_back + + subroutine gpu_advanceCPML_Ez_back_kernel(Ez_d, Hy_d, sggMiEz_d, psi_Ezx_d, & + P_be_x_d, P_ce_x_d, g2_d, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + real(kind=rkind), device, dimension(:,:,:) :: Ez_d, Hy_d, psi_Ezx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEz_d + real(kind=rkind), device, dimension(:) :: P_be_x_d, P_ce_x_d, g2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiEz_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Ezx_d(i-ii+1,j-ji+1,k-ki+1) = P_be_x_d(i) * psi_Ezx_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hy_d(i,j,k) - Hy_d(i-1,j,k)) * P_ce_x_d(i) + Ez_d(i,j,k) = Ez_d(i,j,k) + g2_d(medio) * psi_Ezx_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Ez_back_kernel + + subroutine gpu_advanceCPML_Ey_back_kernel(Ey_d, Hz_d, sggMiEy_d, psi_Eyx_d, & + P_be_x_d, P_ce_x_d, g2_d, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + real(kind=rkind), device, dimension(:,:,:) :: Ey_d, Hz_d, psi_Eyx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEy_d + real(kind=rkind), device, dimension(:) :: P_be_x_d, P_ce_x_d, g2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiEy_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Eyx_d(i-ii+1,j-ji+1,k-ki+1) = P_be_x_d(i) * psi_Eyx_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hz_d(i,j,k) - Hz_d(i-1,j,k)) * P_ce_x_d(i) + Ey_d(i,j,k) = Ey_d(i,j,k) - g2_d(medio) * psi_Eyx_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Ey_back_kernel + + subroutine gpu_advanceCPML_H_back(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%pml_back_initialized) return + + call gpu_advanceCPML_Hz_back_kernel(this%Hz_d, this%Hy_d, this%sggMiHz_d, & + this%pml_psi_Hzx_back, this%pml_P_bm_x_back, this%pml_P_cm_x_back, & + this%gm2_d, & + this%pml_back_Hz_ii, this%pml_back_Hz_ij, & + this%pml_back_Hz_ji, this%pml_back_Hz_jj, & + this%pml_back_Hz_ki, this%pml_back_Hz_kj, & + b%Hz%XI-1) + + call gpu_advanceCPML_Hy_back_kernel(this%Hy_d, this%Ez_d, this%sggMiHy_d, & + this%pml_psi_Hyx_back, this%pml_P_bm_x_back, this%pml_P_cm_x_back, & + this%gm2_d, & + this%pml_back_Hy_ii, this%pml_back_Hy_ij, & + this%pml_back_Hy_ji, this%pml_back_Hy_jj, & + this%pml_back_Hy_ki, this%pml_back_Hy_kj, & + b%Hy%XI-1) + + end subroutine gpu_advanceCPML_H_back + + subroutine gpu_advanceCPML_Hz_back_kernel(Hz_d, Hy_d, sggMiHz_d, psi_Hzx_d, & + P_bm_x_d, P_cm_x_d, gm2_d, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + real(kind=rkind), device, dimension(:,:,:) :: Hz_d, Hy_d, psi_Hzx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d + real(kind=rkind), device, dimension(:) :: P_bm_x_d, P_cm_x_d, gm2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiHz_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Hzx_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_x_d(i) * psi_Hzx_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hy_d(i,j,k) - Hy_d(i-1,j,k)) * P_cm_x_d(i) + Hz_d(i,j,k) = Hz_d(i,j,k) - gm2_d(medio) * psi_Hzx_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Hz_back_kernel + + subroutine gpu_advanceCPML_Hy_back_kernel(Hy_d, Ez_d, sggMiHy_d, psi_Hyx_d, & + P_bm_x_d, P_cm_x_d, gm2_d, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + real(kind=rkind), device, dimension(:,:,:) :: Hy_d, Ez_d, psi_Hyx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d + real(kind=rkind), device, dimension(:) :: P_bm_x_d, P_cm_x_d, gm2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiHy_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Hyx_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_x_d(i) * psi_Hyx_d(i-ii+1,j-ji+1,k-ki+1) + & + (Ez_d(i,j,k) - Ez_d(i-1,j,k)) * P_cm_x_d(i) + Hy_d(i,j,k) = Hy_d(i,j,k) - gm2_d(medio) * psi_Hyx_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Hy_back_kernel + + !-------------------------------------------------------------------------------- + ! CPML Front Boundary - GPU accelerated kernels (x-dependent, same as back) + !-------------------------------------------------------------------------------- + subroutine gpu_advanceCPML_E_front(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%pml_front_initialized) return + + call gpu_advanceCPML_Ez_front_kernel(this%Ez_d, this%Hy_d, this%sggMiEz_d, & + this%pml_psi_Ezx_back, this%pml_P_be_x_back, this%pml_P_ce_x_back, & + this%g2_d, & + this%pml_front_Ez_ii, this%pml_front_Ez_ij, & + this%pml_front_Ez_ji, this%pml_front_Ez_jj, & + this%pml_front_Ez_ki, this%pml_front_Ez_kj, & + b%Ez%XI-1) + + call gpu_advanceCPML_Ey_front_kernel(this%Ey_d, this%Hz_d, this%sggMiEy_d, & + this%pml_psi_Eyx_back, this%pml_P_be_x_back, this%pml_P_ce_x_back, & + this%g2_d, & + this%pml_front_Ey_ii, this%pml_front_Ey_ij, & + this%pml_front_Ey_ji, this%pml_front_Ey_jj, & + this%pml_front_Ey_ki, this%pml_front_Ey_kj, & + b%Ey%XI-1) + + end subroutine gpu_advanceCPML_E_front + + subroutine gpu_advanceCPML_Ez_front_kernel(Ez_d, Hy_d, sggMiEz_d, psi_Ezx_d, & + P_be_x_d, P_ce_x_d, g2_d, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + real(kind=rkind), device, dimension(:,:,:) :: Ez_d, Hy_d, psi_Ezx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEz_d + real(kind=rkind), device, dimension(:) :: P_be_x_d, P_ce_x_d, g2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiEz_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Ezx_d(i-ii+1,j-ji+1,k-ki+1) = P_be_x_d(i) * psi_Ezx_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hy_d(i,j,k) - Hy_d(i-1,j,k)) * P_ce_x_d(i) + Ez_d(i,j,k) = Ez_d(i,j,k) + g2_d(medio) * psi_Ezx_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Ez_front_kernel + + subroutine gpu_advanceCPML_Ey_front_kernel(Ey_d, Hz_d, sggMiEy_d, psi_Eyx_d, & + P_be_x_d, P_ce_x_d, g2_d, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + real(kind=rkind), device, dimension(:,:,:) :: Ey_d, Hz_d, psi_Eyx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEy_d + real(kind=rkind), device, dimension(:) :: P_be_x_d, P_ce_x_d, g2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiEy_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Eyx_d(i-ii+1,j-ji+1,k-ki+1) = P_be_x_d(i) * psi_Eyx_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hz_d(i,j,k) - Hz_d(i-1,j,k)) * P_ce_x_d(i) + Ey_d(i,j,k) = Ey_d(i,j,k) - g2_d(medio) * psi_Eyx_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Ey_front_kernel + + subroutine gpu_advanceCPML_H_front(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%pml_front_initialized) return + + call gpu_advanceCPML_Hz_front_kernel(this%Hz_d, this%Hy_d, this%sggMiHz_d, & + this%pml_psi_Hzx_back, this%pml_P_bm_x_back, this%pml_P_cm_x_back, & + this%gm2_d, & + this%pml_front_Hz_ii, this%pml_front_Hz_ij, & + this%pml_front_Hz_ji, this%pml_front_Hz_jj, & + this%pml_front_Hz_ki, this%pml_front_Hz_kj, & + b%Hz%XI-1) + + call gpu_advanceCPML_Hy_front_kernel(this%Hy_d, this%Ez_d, this%sggMiHy_d, & + this%pml_psi_Hyx_back, this%pml_P_bm_x_back, this%pml_P_cm_x_back, & + this%gm2_d, & + this%pml_front_Hy_ii, this%pml_front_Hy_ij, & + this%pml_front_Hy_ji, this%pml_front_Hy_jj, & + this%pml_front_Hy_ki, this%pml_front_Hy_kj, & + b%Hy%XI-1) + + end subroutine gpu_advanceCPML_H_front + + subroutine gpu_advanceCPML_Hz_front_kernel(Hz_d, Hy_d, sggMiHz_d, psi_Hzx_d, & + P_bm_x_d, P_cm_x_d, gm2_d, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + real(kind=rkind), device, dimension(:,:,:) :: Hz_d, Hy_d, psi_Hzx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d + real(kind=rkind), device, dimension(:) :: P_bm_x_d, P_cm_x_d, gm2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiHz_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Hzx_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_x_d(i) * psi_Hzx_d(i-ii+1,j-ji+1,k-ki+1) + & + (Hy_d(i,j,k) - Hy_d(i-1,j,k)) * P_cm_x_d(i) + Hz_d(i,j,k) = Hz_d(i,j,k) - gm2_d(medio) * psi_Hzx_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Hz_front_kernel + + subroutine gpu_advanceCPML_Hy_front_kernel(Hy_d, Ez_d, sggMiHy_d, psi_Hyx_d, & + P_bm_x_d, P_cm_x_d, gm2_d, & + ii, ij, ji, jj, ki, kj, xi_offset) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset + real(kind=rkind), device, dimension(:,:,:) :: Hy_d, Ez_d, psi_Hyx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d + real(kind=rkind), device, dimension(:) :: P_bm_x_d, P_cm_x_d, gm2_d + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=ki,kj + do j=ji,jj + do i=ii,ij + medio = sggMiHy_d(i-xi_offset,j-xi_offset,k-xi_offset) + psi_Hyx_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_x_d(i) * psi_Hyx_d(i-ii+1,j-ji+1,k-ki+1) + & + (Ez_d(i,j,k) - Ez_d(i-1,j,k)) * P_cm_x_d(i) + Hy_d(i,j,k) = Hy_d(i,j,k) - gm2_d(medio) * psi_Hyx_d(i-ii+1,j-ji+1,k-ki+1) + end do + end do + end do + end subroutine gpu_advanceCPML_Hy_front_kernel + +end module gpu_cpml_m diff --git a/src_main_pub/gpu_mur_m.F90 b/src_main_pub/gpu_mur_m.F90 new file mode 100644 index 000000000..30285ca3c --- /dev/null +++ b/src_main_pub/gpu_mur_m.F90 @@ -0,0 +1,1115 @@ +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +! GPU MUR KERNELS MODULE - CUDA Fortran (CUF) accelerated MUR boundaries +! First-order Mur ABC for all 6 boundaries with persistent state on device. +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + +module gpu_mur_m + + use FDETYPES_m + use Report_m + use cudafor + use gpu_core_m + + implicit none + +contains + + !-------------------------------------------------------------------------------- + ! MUR Left boundary - GPU accelerated kernels + !-------------------------------------------------------------------------------- + subroutine gpu_advanceMUR_H_left(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call gpu_advanceMUR_Hx_left_kernel(this%Hx_d, this%sggMiHx_d, & + this%mur_past_Hx_left, this%mur_left_CAB1, & + this%mur_left_Hx_ii, this%mur_left_Hx_ij, & + this%mur_left_Hx_ji, this%mur_left_Hx_jj, & + this%mur_left_Hx_ki, this%mur_left_Hx_kj) + + call gpu_advanceMUR_Hz_left_kernel(this%Hz_d, this%sggMiHz_d, & + this%mur_past_Hz_left, this%mur_left_CAB1, & + this%mur_left_Hz_ii, this%mur_left_Hz_ij, & + this%mur_left_Hz_ji, this%mur_left_Hz_jj, & + this%mur_left_Hz_ki, this%mur_left_Hz_kj) + + end subroutine gpu_advanceMUR_H_left + + subroutine gpu_advanceMUR_Hx_left_kernel(Hx_d, sggMiHx_d, past_Hx_d, CAB1_d, & + ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d + real(kind=rkind), device, dimension(:) :: CAB1_d + + integer(kind=4) :: i, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do k=ki,kj + do i=ii,ij + medio = sggMiHx_d(i, ji+1, k) + Hx_d(i, ji, k) = past_Hx_d(i, ji+1, k) + & + CAB1_d(medio) * (Hx_d(i, ji+1, k) - past_Hx_d(i, ji, k)) + end do + end do + end subroutine gpu_advanceMUR_Hx_left_kernel + + subroutine gpu_advanceMUR_Hz_left_kernel(Hz_d, sggMiHz_d, past_Hz_d, CAB1_d, & + ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d + real(kind=rkind), device, dimension(:) :: CAB1_d + + integer(kind=4) :: i, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do k=ki,kj + do i=ii,ij + medio = sggMiHz_d(i, ji+1, k) + Hz_d(i, ji, k) = past_Hz_d(i, ji+1, k) + & + CAB1_d(medio) * (Hz_d(i, ji+1, k) - past_Hz_d(i, ji, k)) + end do + end do + end subroutine gpu_advanceMUR_Hz_left_kernel + + !-------------------------------------------------------------------------------- + ! MUR Right boundary - GPU accelerated kernels + !-------------------------------------------------------------------------------- + subroutine gpu_advanceMUR_H_right(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call gpu_advanceMUR_Hx_right_kernel(this%Hx_d, this%sggMiHx_d, & + this%mur_past_Hx_right, this%mur_right_CAB1, & + this%mur_right_Hx_ii, this%mur_right_Hx_ij, & + this%mur_right_Hx_ji, this%mur_right_Hx_jj, & + this%mur_right_Hx_ki, this%mur_right_Hx_kj) + + call gpu_advanceMUR_Hz_right_kernel(this%Hz_d, this%sggMiHz_d, & + this%mur_past_Hz_right, this%mur_right_CAB1, & + this%mur_right_Hz_ii, this%mur_right_Hz_ij, & + this%mur_right_Hz_ji, this%mur_right_Hz_jj, & + this%mur_right_Hz_ki, this%mur_right_Hz_kj) + + end subroutine gpu_advanceMUR_H_right + + subroutine gpu_advanceMUR_Hx_right_kernel(Hx_d, sggMiHx_d, past_Hx_d, CAB1_d, & + ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d + real(kind=rkind), device, dimension(:) :: CAB1_d + + integer(kind=4) :: i, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do k=ki,kj + do i=ii,ij + medio = sggMiHx_d(i, ji-1, k) + Hx_d(i, ji, k) = past_Hx_d(i, ji-1, k) + & + CAB1_d(medio) * (Hx_d(i, ji-1, k) - past_Hx_d(i, ji, k)) + end do + end do + end subroutine gpu_advanceMUR_Hx_right_kernel + + subroutine gpu_advanceMUR_Hz_right_kernel(Hz_d, sggMiHz_d, past_Hz_d, CAB1_d, & + ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d + real(kind=rkind), device, dimension(:) :: CAB1_d + + integer(kind=4) :: i, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do k=ki,kj + do i=ii,ij + medio = sggMiHz_d(i, ji-1, k) + Hz_d(i, ji, k) = past_Hz_d(i, ji-1, k) + & + CAB1_d(medio) * (Hz_d(i, ji-1, k) - past_Hz_d(i, ji, k)) + end do + end do + end subroutine gpu_advanceMUR_Hz_right_kernel + + !-------------------------------------------------------------------------------- + ! MUR Down boundary - GPU accelerated kernels + !-------------------------------------------------------------------------------- + subroutine gpu_advanceMUR_H_down(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call gpu_advanceMUR_Hy_down_kernel(this%Hy_d, this%sggMiHy_d, & + this%mur_past_Hy_down, this%mur_down_CAB1, & + this%mur_down_Hy_ii, this%mur_down_Hy_ij, & + this%mur_down_Hy_ji, this%mur_down_Hy_jj, & + this%mur_down_Hy_ki, this%mur_down_Hy_kj) + + call gpu_advanceMUR_Hx_down_kernel(this%Hx_d, this%sggMiHx_d, & + this%mur_past_Hx_down, this%mur_down_CAB1, & + this%mur_down_Hx_ii, this%mur_down_Hx_ij, & + this%mur_down_Hx_ji, this%mur_down_Hx_jj, & + this%mur_down_Hx_ki, this%mur_down_Hx_kj) + + end subroutine gpu_advanceMUR_H_down + + subroutine gpu_advanceMUR_Hy_down_kernel(Hy_d, sggMiHy_d, past_Hy_d, CAB1_d, & + ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d + real(kind=rkind), device, dimension(:) :: CAB1_d + + integer(kind=4) :: i, j + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do j=ji,jj + do i=ii,ij + medio = sggMiHy_d(i, j, ki+1) + Hy_d(i, j, ki) = past_Hy_d(i, j, ki+1) + & + CAB1_d(medio) * (Hy_d(i, j, ki+1) - past_Hy_d(i, j, ki)) + end do + end do + end subroutine gpu_advanceMUR_Hy_down_kernel + + subroutine gpu_advanceMUR_Hx_down_kernel(Hx_d, sggMiHx_d, past_Hx_d, CAB1_d, & + ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d + real(kind=rkind), device, dimension(:) :: CAB1_d + + integer(kind=4) :: i, j + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do j=ji,jj + do i=ii,ij + medio = sggMiHx_d(i, j, ki+1) + Hx_d(i, j, ki) = past_Hx_d(i, j, ki+1) + & + CAB1_d(medio) * (Hx_d(i, j, ki+1) - past_Hx_d(i, j, ki)) + end do + end do + end subroutine gpu_advanceMUR_Hx_down_kernel + + !-------------------------------------------------------------------------------- + ! MUR Up boundary - GPU accelerated kernels + !-------------------------------------------------------------------------------- + subroutine gpu_advanceMUR_H_up(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call gpu_advanceMUR_Hy_up_kernel(this%Hy_d, this%sggMiHy_d, & + this%mur_past_Hy_up, this%mur_up_CAB1, & + this%mur_up_Hy_ii, this%mur_up_Hy_ij, & + this%mur_up_Hy_ji, this%mur_up_Hy_jj, & + this%mur_up_Hy_ki, this%mur_up_Hy_kj) + + call gpu_advanceMUR_Hx_up_kernel(this%Hx_d, this%sggMiHx_d, & + this%mur_past_Hx_up, this%mur_up_CAB1, & + this%mur_up_Hx_ii, this%mur_up_Hx_ij, & + this%mur_up_Hx_ji, this%mur_up_Hx_jj, & + this%mur_up_Hx_ki, this%mur_up_Hx_kj) + + end subroutine gpu_advanceMUR_H_up + + subroutine gpu_advanceMUR_Hy_up_kernel(Hy_d, sggMiHy_d, past_Hy_d, CAB1_d, & + ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d + real(kind=rkind), device, dimension(:) :: CAB1_d + + integer(kind=4) :: i, j + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do j=ji,jj + do i=ii,ij + medio = sggMiHy_d(i, j, kj-1) + Hy_d(i, j, kj) = past_Hy_d(i, j, kj-1) + & + CAB1_d(medio) * (Hy_d(i, j, kj-1) - past_Hy_d(i, j, kj)) + end do + end do + end subroutine gpu_advanceMUR_Hy_up_kernel + + subroutine gpu_advanceMUR_Hx_up_kernel(Hx_d, sggMiHx_d, past_Hx_d, CAB1_d, & + ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d + real(kind=rkind), device, dimension(:) :: CAB1_d + + integer(kind=4) :: i, j + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do j=ji,jj + do i=ii,ij + medio = sggMiHx_d(i, j, kj-1) + Hx_d(i, j, kj) = past_Hx_d(i, j, kj-1) + & + CAB1_d(medio) * (Hx_d(i, j, kj-1) - past_Hx_d(i, j, kj)) + end do + end do + end subroutine gpu_advanceMUR_Hx_up_kernel + + !-------------------------------------------------------------------------------- + ! MUR Back boundary - GPU accelerated kernels + !-------------------------------------------------------------------------------- + subroutine gpu_advanceMUR_H_back(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call gpu_advanceMUR_Hz_back_kernel(this%Hz_d, this%sggMiHz_d, & + this%mur_past_Hz_back, this%mur_back_CAB1, & + this%mur_back_Hz_ii, this%mur_back_Hz_ij, & + this%mur_back_Hz_ji, this%mur_back_Hz_jj, & + this%mur_back_Hz_ki, this%mur_back_Hz_kj) + + call gpu_advanceMUR_Hy_back_kernel(this%Hy_d, this%sggMiHy_d, & + this%mur_past_Hy_back, this%mur_back_CAB1, & + this%mur_back_Hy_ii, this%mur_back_Hy_ij, & + this%mur_back_Hy_ji, this%mur_back_Hy_jj, & + this%mur_back_Hy_ki, this%mur_back_Hy_kj) + + end subroutine gpu_advanceMUR_H_back + + subroutine gpu_advanceMUR_Hz_back_kernel(Hz_d, sggMiHz_d, past_Hz_d, CAB1_d, & + ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d + real(kind=rkind), device, dimension(:) :: CAB1_d + + integer(kind=4) :: i, j + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do j=ji,jj + do i=ii,ij + medio = sggMiHz_d(i, j, ki-1) + Hz_d(i, j, ki) = past_Hz_d(i, j, ki-1) + & + CAB1_d(medio) * (Hz_d(i, j, ki-1) - past_Hz_d(i, j, ki)) + end do + end do + end subroutine gpu_advanceMUR_Hz_back_kernel + + subroutine gpu_advanceMUR_Hy_back_kernel(Hy_d, sggMiHy_d, past_Hy_d, CAB1_d, & + ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d + real(kind=rkind), device, dimension(:) :: CAB1_d + + integer(kind=4) :: i, j + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do j=ji,jj + do i=ii,ij + medio = sggMiHy_d(i, j, ki-1) + Hy_d(i, j, ki) = past_Hy_d(i, j, ki-1) + & + CAB1_d(medio) * (Hy_d(i, j, ki-1) - past_Hy_d(i, j, ki)) + end do + end do + end subroutine gpu_advanceMUR_Hy_back_kernel + + !-------------------------------------------------------------------------------- + ! MUR Front boundary - GPU accelerated kernels + !-------------------------------------------------------------------------------- + subroutine gpu_advanceMUR_H_front(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call gpu_advanceMUR_Hz_front_kernel(this%Hz_d, this%sggMiHz_d, & + this%mur_past_Hz_front, this%mur_front_CAB1, & + this%mur_front_Hz_ii, this%mur_front_Hz_ij, & + this%mur_front_Hz_ji, this%mur_front_Hz_jj, & + this%mur_front_Hz_ki, this%mur_front_Hz_kj) + + call gpu_advanceMUR_Hy_front_kernel(this%Hy_d, this%sggMiHy_d, & + this%mur_past_Hy_front, this%mur_front_CAB1, & + this%mur_front_Hy_ii, this%mur_front_Hy_ij, & + this%mur_front_Hy_ji, this%mur_front_Hy_jj, & + this%mur_front_Hy_ki, this%mur_front_Hy_kj) + + end subroutine gpu_advanceMUR_H_front + + subroutine gpu_advanceMUR_Hz_front_kernel(Hz_d, sggMiHz_d, past_Hz_d, CAB1_d, & + ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d + real(kind=rkind), device, dimension(:) :: CAB1_d + + integer(kind=4) :: i, j + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do j=ji,jj + do i=ii,ij + medio = sggMiHz_d(i, j, kj+1) + Hz_d(i, j, kj) = past_Hz_d(i, j, kj+1) + & + CAB1_d(medio) * (Hz_d(i, j, kj+1) - past_Hz_d(i, j, kj)) + end do + end do + end subroutine gpu_advanceMUR_Hz_front_kernel + + subroutine gpu_advanceMUR_Hy_front_kernel(Hy_d, sggMiHy_d, past_Hy_d, CAB1_d, & + ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d + real(kind=rkind), device, dimension(:) :: CAB1_d + + integer(kind=4) :: i, j + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do j=ji,jj + do i=ii,ij + medio = sggMiHy_d(i, j, kj+1) + Hy_d(i, j, kj) = past_Hy_d(i, j, kj+1) + & + CAB1_d(medio) * (Hy_d(i, j, kj+1) - past_Hy_d(i, j, kj)) + end do + end do + end subroutine gpu_advanceMUR_Hy_front_kernel + + !-------------------------------------------------------------------------------- + ! Update MUR past fields on device - copy current Hx/Hy/Hz to past arrays + ! Called after each MUR step to prepare for next timestep + !-------------------------------------------------------------------------------- + subroutine gpu_update_mur_past_left(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call gpu_update_mur_past_Hx_left_kernel(this%Hx_d, this%mur_past_Hx_left, & + this%mur_left_Hx_ii, this%mur_left_Hx_ij, & + this%mur_left_Hx_ji, this%mur_left_Hx_jj, & + this%mur_left_Hx_ki, this%mur_left_Hx_kj) + + call gpu_update_mur_past_Hz_left_kernel(this%Hz_d, this%mur_past_Hz_left, & + this%mur_left_Hz_ii, this%mur_left_Hz_ij, & + this%mur_left_Hz_ji, this%mur_left_Hz_jj, & + this%mur_left_Hz_ki, this%mur_left_Hz_kj) + end subroutine gpu_update_mur_past_left + + subroutine gpu_update_mur_past_Hx_left_kernel(Hx_d, past_Hx_d, ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d + + integer(kind=4) :: i, k + + !$cuf kernel do(2) + do k=ki,kj + do i=ii,ij + past_Hx_d(i, ji, k) = Hx_d(i, ji, k) + end do + end do + end subroutine gpu_update_mur_past_Hx_left_kernel + + subroutine gpu_update_mur_past_Hz_left_kernel(Hz_d, past_Hz_d, ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d + + integer(kind=4) :: i, k + + !$cuf kernel do(2) + do k=ki,kj + do i=ii,ij + past_Hz_d(i, ji, k) = Hz_d(i, ji, k) + end do + end do + end subroutine gpu_update_mur_past_Hz_left_kernel + + subroutine gpu_update_mur_past_right(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call gpu_update_mur_past_Hx_right_kernel(this%Hx_d, this%mur_past_Hx_right, & + this%mur_right_Hx_ii, this%mur_right_Hx_ij, & + this%mur_right_Hx_ji, this%mur_right_Hx_jj, & + this%mur_right_Hx_ki, this%mur_right_Hx_kj) + + call gpu_update_mur_past_Hz_right_kernel(this%Hz_d, this%mur_past_Hz_right, & + this%mur_right_Hz_ii, this%mur_right_Hz_ij, & + this%mur_right_Hz_ji, this%mur_right_Hz_jj, & + this%mur_right_Hz_ki, this%mur_right_Hz_kj) + end subroutine gpu_update_mur_past_right + + subroutine gpu_update_mur_past_Hx_right_kernel(Hx_d, past_Hx_d, ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d + + integer(kind=4) :: i, k + + !$cuf kernel do(2) + do k=ki,kj + do i=ii,ij + past_Hx_d(i, ji, k) = Hx_d(i, ji, k) + end do + end do + end subroutine gpu_update_mur_past_Hx_right_kernel + + subroutine gpu_update_mur_past_Hz_right_kernel(Hz_d, past_Hz_d, ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d + + integer(kind=4) :: i, k + + !$cuf kernel do(2) + do k=ki,kj + do i=ii,ij + past_Hz_d(i, ji, k) = Hz_d(i, ji, k) + end do + end do + end subroutine gpu_update_mur_past_Hz_right_kernel + + subroutine gpu_update_mur_past_down(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call gpu_update_mur_past_Hy_down_kernel(this%Hy_d, this%mur_past_Hy_down, & + this%mur_down_Hy_ii, this%mur_down_Hy_ij, & + this%mur_down_Hy_ji, this%mur_down_Hy_jj, & + this%mur_down_Hy_ki, this%mur_down_Hy_kj) + + call gpu_update_mur_past_Hx_down_kernel(this%Hx_d, this%mur_past_Hx_down, & + this%mur_down_Hx_ii, this%mur_down_Hx_ij, & + this%mur_down_Hx_ji, this%mur_down_Hx_jj, & + this%mur_down_Hx_ki, this%mur_down_Hx_kj) + end subroutine gpu_update_mur_past_down + + subroutine gpu_update_mur_past_Hy_down_kernel(Hy_d, past_Hy_d, ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d + + integer(kind=4) :: i, k + + !$cuf kernel do(2) + do k=ki,kj + do i=ii,ij + past_Hy_d(i, ji, k) = Hy_d(i, ji, k) + end do + end do + end subroutine gpu_update_mur_past_Hy_down_kernel + + subroutine gpu_update_mur_past_Hx_down_kernel(Hx_d, past_Hx_d, ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d + + integer(kind=4) :: i, k + + !$cuf kernel do(2) + do k=ki,kj + do i=ii,ij + past_Hx_d(i, ji, k) = Hx_d(i, ji, k) + end do + end do + end subroutine gpu_update_mur_past_Hx_down_kernel + + subroutine gpu_update_mur_past_up(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call gpu_update_mur_past_Hy_up_kernel(this%Hy_d, this%mur_past_Hy_up, & + this%mur_up_Hy_ii, this%mur_up_Hy_ij, & + this%mur_up_Hy_ji, this%mur_up_Hy_jj, & + this%mur_up_Hy_ki, this%mur_up_Hy_kj) + + call gpu_update_mur_past_Hx_up_kernel(this%Hx_d, this%mur_past_Hx_up, & + this%mur_up_Hx_ii, this%mur_up_Hx_ij, & + this%mur_up_Hx_ji, this%mur_up_Hx_jj, & + this%mur_up_Hx_ki, this%mur_up_Hx_kj) + end subroutine gpu_update_mur_past_up + + subroutine gpu_update_mur_past_Hy_up_kernel(Hy_d, past_Hy_d, ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d + + integer(kind=4) :: i, k + + !$cuf kernel do(2) + do k=ki,kj + do i=ii,ij + past_Hy_d(i, ji, k) = Hy_d(i, ji, k) + end do + end do + end subroutine gpu_update_mur_past_Hy_up_kernel + + subroutine gpu_update_mur_past_Hx_up_kernel(Hx_d, past_Hx_d, ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d + + integer(kind=4) :: i, k + + !$cuf kernel do(2) + do k=ki,kj + do i=ii,ij + past_Hx_d(i, ji, k) = Hx_d(i, ji, k) + end do + end do + end subroutine gpu_update_mur_past_Hx_up_kernel + + subroutine gpu_update_mur_past_back(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call gpu_update_mur_past_Hz_back_kernel(this%Hz_d, this%mur_past_Hz_back, & + this%mur_back_Hz_ii, this%mur_back_Hz_ij, & + this%mur_back_Hz_ji, this%mur_back_Hz_jj, & + this%mur_back_Hz_ki, this%mur_back_Hz_kj) + + call gpu_update_mur_past_Hy_back_kernel(this%Hy_d, this%mur_past_Hy_back, & + this%mur_back_Hy_ii, this%mur_back_Hy_ij, & + this%mur_back_Hy_ji, this%mur_back_Hy_jj, & + this%mur_back_Hy_ki, this%mur_back_Hy_kj) + end subroutine gpu_update_mur_past_back + + subroutine gpu_update_mur_past_Hz_back_kernel(Hz_d, past_Hz_d, ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d + + integer(kind=4) :: i, j + + !$cuf kernel do(2) + do j=ji,jj + do i=ii,ij + past_Hz_d(i, j, kj) = Hz_d(i, j, kj) + end do + end do + end subroutine gpu_update_mur_past_Hz_back_kernel + + subroutine gpu_update_mur_past_Hy_back_kernel(Hy_d, past_Hy_d, ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d + + integer(kind=4) :: i, j + + !$cuf kernel do(2) + do j=ji,jj + do i=ii,ij + past_Hy_d(i, j, kj) = Hy_d(i, j, kj) + end do + end do + end subroutine gpu_update_mur_past_Hy_back_kernel + + subroutine gpu_update_mur_past_front(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call gpu_update_mur_past_Hz_front_kernel(this%Hz_d, this%mur_past_Hz_front, & + this%mur_front_Hz_ii, this%mur_front_Hz_ij, & + this%mur_front_Hz_ji, this%mur_front_Hz_jj, & + this%mur_front_Hz_ki, this%mur_front_Hz_kj) + + call gpu_update_mur_past_Hy_front_kernel(this%Hy_d, this%mur_past_Hy_front, & + this%mur_front_Hy_ii, this%mur_front_Hy_ij, & + this%mur_front_Hy_ji, this%mur_front_Hy_jj, & + this%mur_front_Hy_ki, this%mur_front_Hy_kj) + end subroutine gpu_update_mur_past_front + + subroutine gpu_update_mur_past_Hz_front_kernel(Hz_d, past_Hz_d, ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d + + integer(kind=4) :: i, j + + !$cuf kernel do(2) + do j=ji,jj + do i=ii,ij + past_Hz_d(i, j, kj) = Hz_d(i, j, kj) + end do + end do + end subroutine gpu_update_mur_past_Hz_front_kernel + + subroutine gpu_update_mur_past_Hy_front_kernel(Hy_d, past_Hy_d, ii, ij, ji, jj, ki, kj) + integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj + real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d + + integer(kind=4) :: i, j + + !$cuf kernel do(2) + do j=ji,jj + do i=ii,ij + past_Hy_d(i, j, kj) = Hy_d(i, j, kj) + end do + end do + end subroutine gpu_update_mur_past_Hy_front_kernel + + !-------------------------------------------------------------------------------- + ! FUSED MUR KERNELS — 3 kernels instead of 24 (advance + update_past per field) + ! Reduces kernel launches from 300K to ~30K (3 per timestep) + !-------------------------------------------------------------------------------- + + ! Fused advance for Hx: handles left, right, down, up boundaries + subroutine gpu_fused_mur_advance_hx(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call fused_mur_advance_hx_kernel( & + this%Hx_d, this%sggMiHx_d, & + this%mur_past_Hx_left, this%mur_past_Hx_right, & + this%mur_past_Hx_down, this%mur_past_Hx_up, & + this%mur_left_Hx_ii, this%mur_left_Hx_ij, this%mur_left_Hx_ji, this%mur_left_Hx_jj, this%mur_left_Hx_ki, this%mur_left_Hx_kj, & + this%mur_right_Hx_ii, this%mur_right_Hx_ij, this%mur_right_Hx_ji, this%mur_right_Hx_jj, this%mur_right_Hx_ki, this%mur_right_Hx_kj, & + this%mur_down_Hx_ii, this%mur_down_Hx_ij, this%mur_down_Hx_ji, this%mur_down_Hx_jj, this%mur_down_Hx_ki, this%mur_down_Hx_kj, & + this%mur_up_Hx_ii, this%mur_up_Hx_ij, this%mur_up_Hx_ji, this%mur_up_Hx_jj, this%mur_up_Hx_ki, this%mur_up_Hx_kj, & + this%mur_left_CAB1, this%mur_right_CAB1, this%mur_down_CAB1, this%mur_up_CAB1) + + end subroutine gpu_fused_mur_advance_hx + + ! Fused advance for Hy: handles down, up, back, front boundaries + subroutine gpu_fused_mur_advance_hy(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call fused_mur_advance_hy_kernel( & + this%Hy_d, this%sggMiHy_d, & + this%mur_past_Hy_down, this%mur_past_Hy_up, & + this%mur_past_Hy_back, this%mur_past_Hy_front, & + this%mur_down_Hy_ii, this%mur_down_Hy_ij, this%mur_down_Hy_ji, this%mur_down_Hy_jj, this%mur_down_Hy_ki, this%mur_down_Hy_kj, & + this%mur_up_Hy_ii, this%mur_up_Hy_ij, this%mur_up_Hy_ji, this%mur_up_Hy_jj, this%mur_up_Hy_ki, this%mur_up_Hy_kj, & + this%mur_back_Hy_ii, this%mur_back_Hy_ij, this%mur_back_Hy_ji, this%mur_back_Hy_jj, this%mur_back_Hy_ki, this%mur_back_Hy_kj, & + this%mur_front_Hy_ii, this%mur_front_Hy_ij, this%mur_front_Hy_ji, this%mur_front_Hy_jj, this%mur_front_Hy_ki, this%mur_front_Hy_kj, & + this%mur_down_CAB1, this%mur_up_CAB1, this%mur_back_CAB1, this%mur_front_CAB1) + + end subroutine gpu_fused_mur_advance_hy + + ! Fused advance for Hz: handles left, right, back, front boundaries + subroutine gpu_fused_mur_advance_hz(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call fused_mur_advance_hz_kernel( & + this%Hz_d, this%sggMiHz_d, & + this%mur_past_Hz_left, this%mur_past_Hz_right, & + this%mur_past_Hz_back, this%mur_past_Hz_front, & + this%mur_left_Hz_ii, this%mur_left_Hz_ij, this%mur_left_Hz_ji, this%mur_left_Hz_jj, this%mur_left_Hz_ki, this%mur_left_Hz_kj, & + this%mur_right_Hz_ii, this%mur_right_Hz_ij, this%mur_right_Hz_ji, this%mur_right_Hz_jj, this%mur_right_Hz_ki, this%mur_right_Hz_kj, & + this%mur_back_Hz_ii, this%mur_back_Hz_ij, this%mur_back_Hz_ji, this%mur_back_Hz_jj, this%mur_back_Hz_ki, this%mur_back_Hz_kj, & + this%mur_front_Hz_ii, this%mur_front_Hz_ij, this%mur_front_Hz_ji, this%mur_front_Hz_jj, this%mur_front_Hz_ki, this%mur_front_Hz_kj, & + this%mur_left_CAB1, this%mur_right_CAB1, this%mur_back_CAB1, this%mur_front_CAB1) + + end subroutine gpu_fused_mur_advance_hz + + ! Fused update_past for Hx: left, right, down, up + subroutine gpu_fused_mur_update_past_hx(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call fused_mur_update_past_hx_kernel( & + this%Hx_d, & + this%mur_past_Hx_left, this%mur_past_Hx_right, & + this%mur_past_Hx_down, this%mur_past_Hx_up, & + this%mur_left_Hx_ii, this%mur_left_Hx_ij, this%mur_left_Hx_ji, this%mur_left_Hx_jj, this%mur_left_Hx_ki, this%mur_left_Hx_kj, & + this%mur_right_Hx_ii, this%mur_right_Hx_ij, this%mur_right_Hx_ji, this%mur_right_Hx_jj, this%mur_right_Hx_ki, this%mur_right_Hx_kj, & + this%mur_down_Hx_ii, this%mur_down_Hx_ij, this%mur_down_Hx_ji, this%mur_down_Hx_jj, this%mur_down_Hx_ki, this%mur_down_Hx_kj, & + this%mur_up_Hx_ii, this%mur_up_Hx_ij, this%mur_up_Hx_ji, this%mur_up_Hx_jj, this%mur_up_Hx_ki, this%mur_up_Hx_kj) + + end subroutine gpu_fused_mur_update_past_hx + + ! Fused update_past for Hy: down, up, back, front + subroutine gpu_fused_mur_update_past_hy(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call fused_mur_update_past_hy_kernel( & + this%Hy_d, & + this%mur_past_Hy_down, this%mur_past_Hy_up, & + this%mur_past_Hy_back, this%mur_past_Hy_front, & + this%mur_down_Hy_ii, this%mur_down_Hy_ij, this%mur_down_Hy_ji, this%mur_down_Hy_jj, this%mur_down_Hy_ki, this%mur_down_Hy_kj, & + this%mur_up_Hy_ii, this%mur_up_Hy_ij, this%mur_up_Hy_ji, this%mur_up_Hy_jj, this%mur_up_Hy_ki, this%mur_up_Hy_kj, & + this%mur_back_Hy_ii, this%mur_back_Hy_ij, this%mur_back_Hy_ji, this%mur_back_Hy_jj, this%mur_back_Hy_ki, this%mur_back_Hy_kj, & + this%mur_front_Hy_ii, this%mur_front_Hy_ij, this%mur_front_Hy_ji, this%mur_front_Hy_jj, this%mur_front_Hy_ki, this%mur_front_Hy_kj) + + end subroutine gpu_fused_mur_update_past_hy + + ! Fused update_past for Hz: left, right, back, front + subroutine gpu_fused_mur_update_past_hz(this, b) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: b + + if (.not. this%mur_initialized) return + + call fused_mur_update_past_hz_kernel( & + this%Hz_d, & + this%mur_past_Hz_left, this%mur_past_Hz_right, & + this%mur_past_Hz_back, this%mur_past_Hz_front, & + this%mur_left_Hz_ii, this%mur_left_Hz_ij, this%mur_left_Hz_ji, this%mur_left_Hz_jj, this%mur_left_Hz_ki, this%mur_left_Hz_kj, & + this%mur_right_Hz_ii, this%mur_right_Hz_ij, this%mur_right_Hz_ji, this%mur_right_Hz_jj, this%mur_right_Hz_ki, this%mur_right_Hz_kj, & + this%mur_back_Hz_ii, this%mur_back_Hz_ij, this%mur_back_Hz_ji, this%mur_back_Hz_jj, this%mur_back_Hz_ki, this%mur_back_Hz_kj, & + this%mur_front_Hz_ii, this%mur_front_Hz_ij, this%mur_front_Hz_ji, this%mur_front_Hz_jj, this%mur_front_Hz_ki, this%mur_front_Hz_kj) + + end subroutine gpu_fused_mur_update_past_hz + + !=============================================================================== + ! Fused advance Hx kernel — 4 boundaries: left, right, down, up + !=============================================================================== + subroutine fused_mur_advance_hx_kernel(Hx_d, sggMiHx_d, & + past_Hx_left, past_Hx_right, past_Hx_down, past_Hx_up, & + l_ii, l_ij, l_ji, l_jj, l_ki, l_kj, & + r_ii, r_ij, r_ji, r_jj, r_ki, r_kj, & + d_ii, d_ij, d_ji, d_jj, d_ki, d_kj, & + u_ii, u_ij, u_ji, u_jj, u_ki, u_kj, & + left_CAB1, right_CAB1, down_CAB1, up_CAB1) + + real(kind=rkind), device, dimension(:,:,:) :: Hx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d + real(kind=rkind), device, dimension(:,:,:) :: past_Hx_left, past_Hx_right, past_Hx_down, past_Hx_up + real(kind=rkind), device, dimension(:) :: left_CAB1, right_CAB1, down_CAB1, up_CAB1 + integer(kind=4), intent(in) :: l_ii, l_ij, l_ji, l_jj, l_ki, l_kj + integer(kind=4), intent(in) :: r_ii, r_ij, r_ji, r_jj, r_ki, r_kj + integer(kind=4), intent(in) :: d_ii, d_ij, d_ji, d_jj, d_ki, d_kj + integer(kind=4), intent(in) :: u_ii, u_ij, u_ji, u_jj, u_ki, u_kj + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do k=l_ki,l_kj + do i=l_ii,l_ij + medio = sggMiHx_d(i, l_ji+1, k) + Hx_d(i, l_ji, k) = past_Hx_left(i, l_ji+1, k) + & + left_CAB1(medio) * (Hx_d(i, l_ji+1, k) - past_Hx_left(i, l_ji, k)) + end do + end do + + !$cuf kernel do(2) + do k=r_ki,r_kj + do i=r_ii,r_ij + medio = sggMiHx_d(i, r_ji-1, k) + Hx_d(i, r_ji, k) = past_Hx_right(i, r_ji-1, k) + & + right_CAB1(medio) * (Hx_d(i, r_ji-1, k) - past_Hx_right(i, r_ji, k)) + end do + end do + + !$cuf kernel do(2) + do j=d_ji,d_jj + do i=d_ii,d_ij + medio = sggMiHx_d(i, j, d_ki+1) + Hx_d(i, j, d_ki) = past_Hx_down(i, j, d_ki+1) + & + down_CAB1(medio) * (Hx_d(i, j, d_ki+1) - past_Hx_down(i, j, d_ki)) + end do + end do + + !$cuf kernel do(2) + do j=u_ji,u_jj + do i=u_ii,u_ij + medio = sggMiHx_d(i, j, u_kj-1) + Hx_d(i, j, u_kj) = past_Hx_up(i, j, u_kj-1) + & + up_CAB1(medio) * (Hx_d(i, j, u_kj-1) - past_Hx_up(i, j, u_kj)) + end do + end do + + end subroutine fused_mur_advance_hx_kernel + + !=============================================================================== + ! Fused advance Hy kernel — 4 boundaries: down, up, back, front + !=============================================================================== + subroutine fused_mur_advance_hy_kernel(Hy_d, sggMiHy_d, & + past_Hy_down, past_Hy_up, past_Hy_back, past_Hy_front, & + d_ii, d_ij, d_ji, d_jj, d_ki, d_kj, & + u_ii, u_ij, u_ji, u_jj, u_ki, u_kj, & + b_ii, b_ij, b_ji, b_jj, b_ki, b_kj, & + f_ii, f_ij, f_ji, f_jj, f_ki, f_kj, & + down_CAB1, up_CAB1, back_CAB1, front_CAB1) + + real(kind=rkind), device, dimension(:,:,:) :: Hy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d + real(kind=rkind), device, dimension(:,:,:) :: past_Hy_down, past_Hy_up, past_Hy_back, past_Hy_front + real(kind=rkind), device, dimension(:) :: down_CAB1, up_CAB1, back_CAB1, front_CAB1 + integer(kind=4), intent(in) :: d_ii, d_ij, d_ji, d_jj, d_ki, d_kj + integer(kind=4), intent(in) :: u_ii, u_ij, u_ji, u_jj, u_ki, u_kj + integer(kind=4), intent(in) :: b_ii, b_ij, b_ji, b_jj, b_ki, b_kj + integer(kind=4), intent(in) :: f_ii, f_ij, f_ji, f_jj, f_ki, f_kj + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do j=d_ji,d_jj + do i=d_ii,d_ij + medio = sggMiHy_d(i, j, d_ki+1) + Hy_d(i, j, d_ki) = past_Hy_down(i, j, d_ki+1) + & + down_CAB1(medio) * (Hy_d(i, j, d_ki+1) - past_Hy_down(i, j, d_ki)) + end do + end do + + !$cuf kernel do(2) + do j=u_ji,u_jj + do i=u_ii,u_ij + medio = sggMiHy_d(i, j, u_kj-1) + Hy_d(i, j, u_kj) = past_Hy_up(i, j, u_kj-1) + & + up_CAB1(medio) * (Hy_d(i, j, u_kj-1) - past_Hy_up(i, j, u_kj)) + end do + end do + + !$cuf kernel do(2) + do j=b_ji,b_jj + do i=b_ii,b_ij + medio = sggMiHy_d(i, j, b_ki-1) + Hy_d(i, j, b_ki) = past_Hy_back(i, j, b_ki-1) + & + back_CAB1(medio) * (Hy_d(i, j, b_ki-1) - past_Hy_back(i, j, b_ki)) + end do + end do + + !$cuf kernel do(2) + do j=f_ji,f_jj + do i=f_ii,f_ij + medio = sggMiHy_d(i, j, f_kj+1) + Hy_d(i, j, f_kj) = past_Hy_front(i, j, f_kj+1) + & + front_CAB1(medio) * (Hy_d(i, j, f_kj+1) - past_Hy_front(i, j, f_kj)) + end do + end do + + end subroutine fused_mur_advance_hy_kernel + + !=============================================================================== + ! Fused advance Hz kernel — 4 boundaries: left, right, back, front + !=============================================================================== + subroutine fused_mur_advance_hz_kernel(Hz_d, sggMiHz_d, & + past_Hz_left, past_Hz_right, past_Hz_back, past_Hz_front, & + l_ii, l_ij, l_ji, l_jj, l_ki, l_kj, & + r_ii, r_ij, r_ji, r_jj, r_ki, r_kj, & + b_ii, b_ij, b_ji, b_jj, b_ki, b_kj, & + f_ii, f_ij, f_ji, f_jj, f_ki, f_kj, & + left_CAB1, right_CAB1, back_CAB1, front_CAB1) + + real(kind=rkind), device, dimension(:,:,:) :: Hz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d + real(kind=rkind), device, dimension(:,:,:) :: past_Hz_left, past_Hz_right, past_Hz_back, past_Hz_front + real(kind=rkind), device, dimension(:) :: left_CAB1, right_CAB1, back_CAB1, front_CAB1 + integer(kind=4), intent(in) :: l_ii, l_ij, l_ji, l_jj, l_ki, l_kj + integer(kind=4), intent(in) :: r_ii, r_ij, r_ji, r_jj, r_ki, r_kj + integer(kind=4), intent(in) :: b_ii, b_ij, b_ji, b_jj, b_ki, b_kj + integer(kind=4), intent(in) :: f_ii, f_ij, f_ji, f_jj, f_ki, f_kj + + integer(kind=4) :: i, j, k + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(2) + do k=l_ki,l_kj + do i=l_ii,l_ij + medio = sggMiHz_d(i, l_ji+1, k) + Hz_d(i, l_ji, k) = past_Hz_left(i, l_ji+1, k) + & + left_CAB1(medio) * (Hz_d(i, l_ji+1, k) - past_Hz_left(i, l_ji, k)) + end do + end do + + !$cuf kernel do(2) + do k=r_ki,r_kj + do i=r_ii,r_ij + medio = sggMiHz_d(i, r_ji-1, k) + Hz_d(i, r_ji, k) = past_Hz_right(i, r_ji-1, k) + & + right_CAB1(medio) * (Hz_d(i, r_ji-1, k) - past_Hz_right(i, r_ji, k)) + end do + end do + + !$cuf kernel do(2) + do j=b_ji,b_jj + do i=b_ii,b_ij + medio = sggMiHz_d(i, j, b_ki-1) + Hz_d(i, j, b_ki) = past_Hz_back(i, j, b_ki-1) + & + back_CAB1(medio) * (Hz_d(i, j, b_ki-1) - past_Hz_back(i, j, b_ki)) + end do + end do + + !$cuf kernel do(2) + do j=f_ji,f_jj + do i=f_ii,f_ij + medio = sggMiHz_d(i, j, f_kj+1) + Hz_d(i, j, f_kj) = past_Hz_front(i, j, f_kj+1) + & + front_CAB1(medio) * (Hz_d(i, j, f_kj+1) - past_Hz_front(i, j, f_kj)) + end do + end do + + end subroutine fused_mur_advance_hz_kernel + + !=============================================================================== + ! Fused update_past Hx kernel — 4 boundaries: left, right, down, up + !=============================================================================== + subroutine fused_mur_update_past_hx_kernel(Hx_d, & + past_Hx_left, past_Hx_right, past_Hx_down, past_Hx_up, & + l_ii, l_ij, l_ji, l_jj, l_ki, l_kj, & + r_ii, r_ij, r_ji, r_jj, r_ki, r_kj, & + d_ii, d_ij, d_ji, d_jj, d_ki, d_kj, & + u_ii, u_ij, u_ji, u_jj, u_ki, u_kj) + + real(kind=rkind), device, dimension(:,:,:) :: Hx_d + real(kind=rkind), device, dimension(:,:,:) :: past_Hx_left, past_Hx_right, past_Hx_down, past_Hx_up + integer(kind=4), intent(in) :: l_ii, l_ij, l_ji, l_jj, l_ki, l_kj + integer(kind=4), intent(in) :: r_ii, r_ij, r_ji, r_jj, r_ki, r_kj + integer(kind=4), intent(in) :: d_ii, d_ij, d_ji, d_jj, d_ki, d_kj + integer(kind=4), intent(in) :: u_ii, u_ij, u_ji, u_jj, u_ki, u_kj + + integer(kind=4) :: i, j, k + + !$cuf kernel do(2) + do k=l_ki,l_kj + do i=l_ii,l_ij + past_Hx_left(i, l_ji, k) = Hx_d(i, l_ji, k) + end do + end do + + !$cuf kernel do(2) + do k=r_ki,r_kj + do i=r_ii,r_ij + past_Hx_right(i, r_ji, k) = Hx_d(i, r_ji, k) + end do + end do + + !$cuf kernel do(2) + do k=d_ki,d_kj + do i=d_ii,d_ij + past_Hx_down(i, d_ji, k) = Hx_d(i, d_ji, k) + end do + end do + + !$cuf kernel do(2) + do k=u_ki,u_kj + do i=u_ii,u_ij + past_Hx_up(i, u_ji, k) = Hx_d(i, u_ji, k) + end do + end do + + end subroutine fused_mur_update_past_hx_kernel + + !=============================================================================== + ! Fused update_past Hy kernel — 4 boundaries: down, up, back, front + !=============================================================================== + subroutine fused_mur_update_past_hy_kernel(Hy_d, & + past_Hy_down, past_Hy_up, past_Hy_back, past_Hy_front, & + d_ii, d_ij, d_ji, d_jj, d_ki, d_kj, & + u_ii, u_ij, u_ji, u_jj, u_ki, u_kj, & + b_ii, b_ij, b_ji, b_jj, b_ki, b_kj, & + f_ii, f_ij, f_ji, f_jj, f_ki, f_kj) + + real(kind=rkind), device, dimension(:,:,:) :: Hy_d + real(kind=rkind), device, dimension(:,:,:) :: past_Hy_down, past_Hy_up, past_Hy_back, past_Hy_front + integer(kind=4), intent(in) :: d_ii, d_ij, d_ji, d_jj, d_ki, d_kj + integer(kind=4), intent(in) :: u_ii, u_ij, u_ji, u_jj, u_ki, u_kj + integer(kind=4), intent(in) :: b_ii, b_ij, b_ji, b_jj, b_ki, b_kj + integer(kind=4), intent(in) :: f_ii, f_ij, f_ji, f_jj, f_ki, f_kj + + integer(kind=4) :: i, j, k + + !$cuf kernel do(2) + do k=d_ki,d_kj + do i=d_ii,d_ij + past_Hy_down(i, d_ji, k) = Hy_d(i, d_ji, k) + end do + end do + + !$cuf kernel do(2) + do k=u_ki,u_kj + do i=u_ii,u_ij + past_Hy_up(i, u_ji, k) = Hy_d(i, u_ji, k) + end do + end do + + !$cuf kernel do(2) + do j=b_ji,b_jj + do i=b_ii,b_ij + past_Hy_back(i, j, b_kj) = Hy_d(i, j, b_kj) + end do + end do + + !$cuf kernel do(2) + do j=f_ji,f_jj + do i=f_ii,f_ij + past_Hy_front(i, j, f_kj) = Hy_d(i, j, f_kj) + end do + end do + + end subroutine fused_mur_update_past_hy_kernel + + !=============================================================================== + ! Fused update_past Hz kernel — 4 boundaries: left, right, back, front + !=============================================================================== + subroutine fused_mur_update_past_hz_kernel(Hz_d, & + past_Hz_left, past_Hz_right, past_Hz_back, past_Hz_front, & + l_ii, l_ij, l_ji, l_jj, l_ki, l_kj, & + r_ii, r_ij, r_ji, r_jj, r_ki, r_kj, & + b_ii, b_ij, b_ji, b_jj, b_ki, b_kj, & + f_ii, f_ij, f_ji, f_jj, f_ki, f_kj) + + real(kind=rkind), device, dimension(:,:,:) :: Hz_d + real(kind=rkind), device, dimension(:,:,:) :: past_Hz_left, past_Hz_right, past_Hz_back, past_Hz_front + integer(kind=4), intent(in) :: l_ii, l_ij, l_ji, l_jj, l_ki, l_kj + integer(kind=4), intent(in) :: r_ii, r_ij, r_ji, r_jj, r_ki, r_kj + integer(kind=4), intent(in) :: b_ii, b_ij, b_ji, b_jj, b_ki, b_kj + integer(kind=4), intent(in) :: f_ii, f_ij, f_ji, f_jj, f_ki, f_kj + + integer(kind=4) :: i, j, k + + !$cuf kernel do(2) + do k=l_ki,l_kj + do i=l_ii,l_ij + past_Hz_left(i, l_ji, k) = Hz_d(i, l_ji, k) + end do + end do + + !$cuf kernel do(2) + do k=r_ki,r_kj + do i=r_ii,r_ij + past_Hz_right(i, r_ji, k) = Hz_d(i, r_ji, k) + end do + end do + + !$cuf kernel do(2) + do j=b_ji,b_jj + do i=b_ii,b_ij + past_Hz_back(i, j, b_kj) = Hz_d(i, j, b_kj) + end do + end do + + !$cuf kernel do(2) + do j=f_ji,f_jj + do i=f_ii,f_ij + past_Hz_front(i, j, f_kj) = Hz_d(i, j, f_kj) + end do + end do + + end subroutine fused_mur_update_past_hz_kernel + + end module gpu_mur_m diff --git a/src_main_pub/gpu_nf2ff_m.F90 b/src_main_pub/gpu_nf2ff_m.F90 new file mode 100644 index 000000000..65ea86a74 --- /dev/null +++ b/src_main_pub/gpu_nf2ff_m.F90 @@ -0,0 +1,66 @@ +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +! GPU NF2FF MODULE - CUDA Fortran (CUF) accelerated near-to-far-field +! Infrastructure for GPU-accelerated far-field pattern computation. +! NOTE: GPU kernel not yet implemented — falls through to CPU path. +! Implementation pending: full face-specific indexing from farfield.F90. +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + +module gpu_nf2ff_m + + use FDETYPES_m + use Report_m + use cudafor + use gpu_core_m + + implicit none + +contains + + !-------------------------------------------------------------------------------- + ! Device function: compute average of two complex values (geometric or arithmetic) + !-------------------------------------------------------------------------------- + attributes(device) function AverageNF2FF(pasadas, z1, z2) result(z) + complex(kind=rkind), intent(in) :: z1, z2 + integer, intent(in) :: pasadas + complex(kind=rkind) :: z + real(kind=rkind) :: phi1, phi2, nphi1, nphi2 + + z = (0.0_rkind, 0.0_rkind) + if (pasadas == 2) then + phi1 = atan2(imag(z1), real(z1)) + phi2 = atan2(imag(z2), real(z2)) + nphi1 = phi1 + nphi2 = phi2 + if ((phi1 < -pi/2.0_rkind) .and. (phi2 > pi/2.0_rkind)) nphi1 = phi1 - 2.0_rkind * pi + if ((phi2 < -pi/2.0_rkind) .and. (phi1 > pi/2.0_rkind)) nphi2 = phi2 - 2.0_rkind * pi + z = sqrt(abs(z1*z2)) * exp((0.0_rkind, 1.0_rkind) * (nphi1 + nphi2) / 2.0_rkind) + elseif (pasadas == 1) then + z = (z1 + z2) / 2.0_rkind + endif + end function AverageNF2FF + + !-------------------------------------------------------------------------------- + ! Public interface: GPU far-field flush + ! NOTE: GPU kernel not yet implemented. Falls through to CPU. + ! The gpu_state_t%nf2ff_initialized flag is set during init, + ! but the actual GPU computation is deferred until the full + ! face-specific indexing from farfield.F90 is implemented. + !-------------------------------------------------------------------------------- + subroutine gpu_flush_nf2ff(this, Etheta_h, Ephi_h, RCS_h) + class(gpu_state_t), intent(inout) :: this + real(kind=rkind), dimension(:), intent(out) :: Etheta_h, Ephi_h, RCS_h + + ! GPU kernel not yet implemented — CPU path handles this. + ! The nf2ff_initialized flag is set during init, but the actual + ! GPU flush is a no-op until full implementation. + ! TODO: Implement gpu_flush_nf2ff_kernel when NVHPC argument + ! limit is addressed (use derived type or split into multiple kernels). + + ! Zero output buffers (CPU fallback will overwrite) + Etheta_h = 0.0_rkind + Ephi_h = 0.0_rkind + RCS_h = 0.0_rkind + + end subroutine gpu_flush_nf2ff + +end module gpu_nf2ff_m diff --git a/src_main_pub/gpu_sgbc_core_m.F90 b/src_main_pub/gpu_sgbc_core_m.F90 new file mode 100644 index 000000000..faa288270 --- /dev/null +++ b/src_main_pub/gpu_sgbc_core_m.F90 @@ -0,0 +1,461 @@ +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +! GPU SGBC CORE MODULE - CUDA Fortran (CUF) accelerated SGBC +! Flattened arrays for all SGBC node data (no pointer indirection) +! Handles init/upload/download/destroy for non-dispersive SGBC only. +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + +module gpu_sgbc_core_m + + use FDETYPES_m + use Report_m + use cudafor + + implicit none + + integer, parameter :: SGBC_MAX_DEPTH = 16 + + type gpu_state_sgbc_t + integer(kind=4) :: numNodes, maxDepth + + ! Per-node constants (contiguous 1D arrays, indexed by node) + integer(kind=4), device, allocatable, dimension(:) :: d_node + real(kind=rkind), device, allocatable, dimension(:) :: gm2_ext + integer(kind=4), device, allocatable, dimension(:) :: jmed_node + real(kind=rkind), device, allocatable, dimension(:) :: transE, transH, alignedH + real(kind=rkind), device, allocatable, dimension(:) :: g1_val, g2a_val, g2b_val + real(kind=rkind), device, allocatable, dimension(:) :: gm2_externo + logical, device, allocatable, dimension(:) :: correct_ha, correct_hb, crank, filo_placa + integer(kind=4), device, allocatable, dimension(:) :: depth_node + + ! Per-node state (contiguous 2D: [2*maxDepth+1, numNodes]) + real(kind=rkind), device, allocatable, dimension(:,:) :: E_state, H_state, E_past_state + real(kind=rkind), device, allocatable, dimension(:,:) :: D_state + real(kind=rkind), device, allocatable, dimension(:,:) :: G1_int, G2_int, GM1_int, GM2_int + real(kind=rkind), device, allocatable, dimension(:,:) :: a_coef, b_coef, c_coef + real(kind=rkind), device, allocatable, dimension(:,:) :: rb_coef, rh_coef, rhm1_coef + integer(kind=4), device, allocatable, dimension(:,:) :: capa_state + real(kind=rkind), device, allocatable, dimension(:,:) :: delta_state + + ! Per-node tridiag boundary constants + real(kind=rkind), device, allocatable, dimension(:) :: tridiag_a1, tridiag_b1, tridiag_c1 + real(kind=rkind), device, allocatable, dimension(:) :: tridiag_an, tridiag_bn, tridiag_cn + + ! Per-timestep field values (device, synced each step) + real(kind=rkind), device, allocatable, dimension(:) :: Efield_val, Ha_Plus_val, Ha_Minu_val + real(kind=rkind), device, allocatable, dimension(:) :: Hb_Plus_val, Hb_Minu_val + real(kind=rkind), device, allocatable, dimension(:) :: Hyee_left, Hyee_right + + ! Host-side references for field sync + real(kind=rkind), pointer, dimension(:,:,:) :: Ex, Ey, Ez, Hx, Hy, Hz + integer(kind=4), pointer, dimension(:,:,:) :: Idxe, Idye, Idze, Idxh, Idyh, Idzh + + logical :: initialized = .false. + logical :: fields_on_device = .false. + end type + +contains + + !-------------------------------------------------------------------------------- + ! Initialize GPU SGBC state - called once at SGBC init + !-------------------------------------------------------------------------------- + subroutine gpu_init_sgbc(this, numNodes, maxDepth, dt, sgbcFreq, SGBCcrank, SGBCDispersive) + class(gpu_state_sgbc_t), intent(inout) :: this + integer(kind=4), intent(in) :: numNodes, maxDepth + real(kind=rkind), intent(in) :: dt, sgbcFreq + logical, intent(in) :: SGBCcrank, SGBCDispersive + + integer(kind=4) :: ndev, cuda_status, env_status + character(len=16) :: enable_cuf + integer(kind=4) :: i + + call get_environment_variable("SEMBA_FDTD_ENABLE_CUF_RUNTIME", enable_cuf, status=env_status) + if (env_status /= 0 .or. trim(enable_cuf) /= "1") then + this%initialized = .false. + return + endif + + cuda_status = cudaGetDeviceCount(ndev) + if (cuda_status /= cudaSuccess .or. ndev <= 0) then + this%initialized = .false. + return + endif + + this%numNodes = numNodes + this%maxDepth = maxDepth + + ! Allocate per-node constant arrays + allocate(this%d_node(numNodes)) + allocate(this%gm2_ext(numNodes)) + allocate(this%jmed_node(numNodes)) + allocate(this%transE(numNodes)) + allocate(this%transH(numNodes)) + allocate(this%alignedH(numNodes)) + allocate(this%g1_val(numNodes)) + allocate(this%g2a_val(numNodes)) + allocate(this%g2b_val(numNodes)) + allocate(this%gm2_externo(numNodes)) + allocate(this%correct_ha(numNodes)) + allocate(this%correct_hb(numNodes)) + allocate(this%crank(numNodes)) + allocate(this%filo_placa(numNodes)) + allocate(this%depth_node(numNodes)) + + ! Allocate per-node state arrays [2*maxDepth+1, numNodes] + allocate(this%E_state(2*maxDepth+1, numNodes)) + allocate(this%H_state(2*maxDepth+1, numNodes)) + allocate(this%E_past_state(2*maxDepth+1, numNodes)) + allocate(this%D_state(2*maxDepth+1, numNodes)) + allocate(this%G1_int(2*maxDepth+1, numNodes)) + allocate(this%G2_int(2*maxDepth+1, numNodes)) + allocate(this%GM1_int(2*maxDepth+1, numNodes)) + allocate(this%GM2_int(2*maxDepth+1, numNodes)) + allocate(this%a_coef(2*maxDepth+1, numNodes)) + allocate(this%b_coef(2*maxDepth+1, numNodes)) + allocate(this%c_coef(2*maxDepth+1, numNodes)) + allocate(this%rb_coef(2*maxDepth+1, numNodes)) + allocate(this%rh_coef(2*maxDepth+1, numNodes)) + allocate(this%rhm1_coef(2*maxDepth+1, numNodes)) + allocate(this%capa_state(2*maxDepth+1, numNodes)) + allocate(this%delta_state(2*maxDepth+1, numNodes)) + + ! Allocate tridiag boundary constants + allocate(this%tridiag_a1(numNodes)) + allocate(this%tridiag_b1(numNodes)) + allocate(this%tridiag_c1(numNodes)) + allocate(this%tridiag_an(numNodes)) + allocate(this%tridiag_bn(numNodes)) + allocate(this%tridiag_cn(numNodes)) + + ! Allocate per-timestep field arrays + allocate(this%Efield_val(numNodes)) + allocate(this%Ha_Plus_val(numNodes)) + allocate(this%Ha_Minu_val(numNodes)) + allocate(this%Hb_Plus_val(numNodes)) + allocate(this%Hb_Minu_val(numNodes)) + allocate(this%Hyee_left(numNodes)) + allocate(this%Hyee_right(numNodes)) + + ! Initialize to zero + this%E_state = 0.0_rkind + this%H_state = 0.0_rkind + this%E_past_state = 0.0_rkind + this%D_state = 0.0_rkind + this%G1_int = 0.0_rkind + this%G2_int = 0.0_rkind + this%GM1_int = 0.0_rkind + this%GM2_int = 0.0_rkind + this%a_coef = 0.0_rkind + this%b_coef = 0.0_rkind + this%c_coef = 0.0_rkind + this%rb_coef = 0.0_rkind + this%rh_coef = 0.0_rkind + this%rhm1_coef = 0.0_rkind + this%capa_state = 0 + this%delta_state = 0.0_rkind + + this%initialized = .true. + this%fields_on_device = .false. + + end subroutine gpu_init_sgbc + + !-------------------------------------------------------------------------------- + ! Upload per-node constants from host to device + ! Called once after InitSGBCs completes + !-------------------------------------------------------------------------------- + subroutine gpu_upload_sgbc_constants(this, depth_arr, gm2_ext_arr, jmed_arr, & + transE_arr, transH_arr, alignedH_arr, & + g1_arr, g2a_arr, g2b_arr, gm2_ext_val_arr, & + correct_ha_arr, correct_hb_arr, crank_arr, & + filo_placa_arr, depth_node_arr) + class(gpu_state_sgbc_t), intent(inout) :: this + integer(kind=4), intent(in) :: depth_arr(:), jmed_arr(:), depth_node_arr(:) + real(kind=rkind), intent(in) :: gm2_ext_arr(:), transE_arr(:), transH_arr(:), alignedH_arr(:) + real(kind=rkind), intent(in) :: g1_arr(:), g2a_arr(:), g2b_arr(:), gm2_ext_val_arr(:) + logical, intent(in) :: correct_ha_arr(:), correct_hb_arr(:), crank_arr(:), filo_placa_arr(:) + + integer(kind=4) :: n + + if (.not. this%initialized) return + + n = this%numNodes + this%d_node(1:n) = depth_arr(1:n) + this%gm2_ext(1:n) = gm2_ext_arr(1:n) + this%jmed_node(1:n) = jmed_arr(1:n) + this%transE(1:n) = transE_arr(1:n) + this%transH(1:n) = transH_arr(1:n) + this%alignedH(1:n) = alignedH_arr(1:n) + this%g1_val(1:n) = g1_arr(1:n) + this%g2a_val(1:n) = g2a_arr(1:n) + this%g2b_val(1:n) = g2b_arr(1:n) + this%gm2_externo(1:n) = gm2_ext_val_arr(1:n) + this%correct_ha(1:n) = correct_ha_arr(1:n) + this%correct_hb(1:n) = correct_hb_arr(1:n) + this%crank(1:n) = crank_arr(1:n) + this%filo_placa(1:n) = filo_placa_arr(1:n) + this%depth_node(1:n) = depth_node_arr(1:n) + + end subroutine gpu_upload_sgbc_constants + + !-------------------------------------------------------------------------------- + ! Upload per-node state arrays from host to device + ! Called once after InitSGBCs completes + !-------------------------------------------------------------------------------- + subroutine gpu_upload_sgbc_state(this, E_arr, H_arr, E_past_arr, D_arr, & + G1_int_arr, G2_int_arr, GM1_int_arr, GM2_int_arr, & + a_arr, b_arr, c_arr, rb_arr, rh_arr, rhm1_arr, & + capa_arr, delta_arr, & + tridiag_a1_arr, tridiag_b1_arr, tridiag_c1_arr, & + tridiag_an_arr, tridiag_bn_arr, tridiag_cn_arr, & + Hyee_left_arr, Hyee_right_arr, & + offset) + class(gpu_state_sgbc_t), intent(inout) :: this + real(kind=rkind), intent(in) :: E_arr(:,:), H_arr(:,:), E_past_arr(:,:), D_arr(:,:) + real(kind=rkind), intent(in) :: G1_int_arr(:,:), G2_int_arr(:,:), GM1_int_arr(:,:), GM2_int_arr(:,:) + real(kind=rkind), intent(in) :: a_arr(:,:), b_arr(:,:), c_arr(:,:), rb_arr(:,:), rh_arr(:,:), rhm1_arr(:,:) + integer(kind=4), intent(in) :: capa_arr(:,:) + real(kind=rkind), intent(in) :: delta_arr(:,:) + real(kind=rkind), intent(in) :: tridiag_a1_arr(:), tridiag_b1_arr(:), tridiag_c1_arr(:) + real(kind=rkind), intent(in) :: tridiag_an_arr(:), tridiag_bn_arr(:), tridiag_cn_arr(:) + real(kind=rkind), intent(in) :: Hyee_left_arr(:), Hyee_right_arr(:) + integer(kind=4), intent(in) :: offset ! array offset (-depth) + + integer(kind=4) :: n, sz, i, j + + if (.not. this%initialized) return + + n = this%numNodes + sz = 2*this%maxDepth + 1 + + do j = 1, n + do i = 1, sz + this%E_state(i, j) = E_arr(i + offset, j) + this%H_state(i, j) = H_arr(i + offset, j) + this%E_past_state(i, j) = E_past_arr(i + offset, j) + this%D_state(i, j) = D_arr(i + offset, j) + this%G1_int(i, j) = G1_int_arr(i + offset, j) + this%G2_int(i, j) = G2_int_arr(i + offset, j) + this%GM1_int(i, j) = GM1_int_arr(i + offset, j) + this%GM2_int(i, j) = GM2_int_arr(i + offset, j) + this%a_coef(i, j) = a_arr(i + offset, j) + this%b_coef(i, j) = b_arr(i + offset, j) + this%c_coef(i, j) = c_arr(i + offset, j) + this%rb_coef(i, j) = rb_arr(i + offset, j) + this%rh_coef(i, j) = rh_arr(i + offset, j) + this%rhm1_coef(i, j) = rhm1_arr(i + offset, j) + this%capa_state(i, j) = capa_arr(i + offset, j) + this%delta_state(i, j) = delta_arr(i + offset, j) + end do + this%tridiag_a1(j) = tridiag_a1_arr(j) + this%tridiag_b1(j) = tridiag_b1_arr(j) + this%tridiag_c1(j) = tridiag_c1_arr(j) + this%tridiag_an(j) = tridiag_an_arr(j) + this%tridiag_bn(j) = tridiag_bn_arr(j) + this%tridiag_cn(j) = tridiag_cn_arr(j) + this%Hyee_left(j) = Hyee_left_arr(j) + this%Hyee_right(j) = Hyee_right_arr(j) + end do + + this%fields_on_device = .true. + + end subroutine gpu_upload_sgbc_state + + !-------------------------------------------------------------------------------- + ! Upload per-timestep field values from host to device + ! Called before each SGBC E-advance + !-------------------------------------------------------------------------------- + subroutine gpu_upload_sgbc_fields(this, Efield_arr, Ha_Plus_arr, Ha_Minu_arr, & + Hb_Plus_arr, Hb_Minu_arr) + class(gpu_state_sgbc_t), intent(inout) :: this + real(kind=rkind), intent(in) :: Efield_arr(:), Ha_Plus_arr(:), Ha_Minu_arr(:) + real(kind=rkind), intent(in) :: Hb_Plus_arr(:), Hb_Minu_arr(:) + + integer(kind=4) :: n + + if (.not. this%initialized) return + + n = this%numNodes + this%Efield_val(1:n) = Efield_arr(1:n) + this%Ha_Plus_val(1:n) = Ha_Plus_arr(1:n) + this%Ha_Minu_val(1:n) = Ha_Minu_arr(1:n) + this%Hb_Plus_val(1:n) = Hb_Plus_arr(1:n) + this%Hb_Minu_val(1:n) = Hb_Minu_arr(1:n) + + end subroutine gpu_upload_sgbc_fields + + !-------------------------------------------------------------------------------- + ! Download per-timestep field values from device to host + ! Called after each SGBC H-advance + !-------------------------------------------------------------------------------- + subroutine gpu_download_sgbc_fields(this, Ha_Plus_arr, Ha_Minu_arr, & + Hb_Plus_arr, Hb_Minu_arr, & + Hyee_left_arr, Hyee_right_arr, & + Efield_arr) + class(gpu_state_sgbc_t), intent(inout) :: this + real(kind=rkind), intent(out) :: Ha_Plus_arr(:), Ha_Minu_arr(:) + real(kind=rkind), intent(out) :: Hb_Plus_arr(:), Hb_Minu_arr(:) + real(kind=rkind), intent(out) :: Hyee_left_arr(:), Hyee_right_arr(:) + real(kind=rkind), intent(out) :: Efield_arr(:) + + integer(kind=4) :: n + + if (.not. this%initialized) return + + n = this%numNodes + Ha_Plus_arr(1:n) = this%Ha_Plus_val(1:n) + Ha_Minu_arr(1:n) = this%Ha_Minu_val(1:n) + Hb_Plus_arr(1:n) = this%Hb_Plus_val(1:n) + Hb_Minu_arr(1:n) = this%Hb_Minu_val(1:n) + Hyee_left_arr(1:n) = this%Hyee_left(1:n) + Hyee_right_arr(1:n) = this%Hyee_right(1:n) + Efield_arr(1:n) = this%Efield_val(1:n) + + end subroutine gpu_download_sgbc_fields + + !-------------------------------------------------------------------------------- + ! Download per-node state from device to host + ! Called after each SGBC E-advance (for H-update) and at output times + !-------------------------------------------------------------------------------- + subroutine gpu_download_sgbc_state(this, E_arr, H_arr, E_past_arr, D_arr, & + Hyee_left_arr, Hyee_right_arr, & + offset) + class(gpu_state_sgbc_t), intent(inout) :: this + real(kind=rkind), intent(out) :: E_arr(:,:), H_arr(:,:), E_past_arr(:,:), D_arr(:,:) + real(kind=rkind), intent(out) :: Hyee_left_arr(:), Hyee_right_arr(:) + integer(kind=4), intent(in) :: offset ! array offset (-depth) + + integer(kind=4) :: n, sz, i, j + + if (.not. this%initialized) return + + n = this%numNodes + sz = 2*this%maxDepth + 1 + + do j = 1, n + do i = 1, sz + E_arr(i + offset, j) = this%E_state(i, j) + H_arr(i + offset, j) = this%H_state(i, j) + E_past_arr(i + offset, j) = this%E_past_state(i, j) + D_arr(i + offset, j) = this%D_state(i, j) + end do + Hyee_left_arr(j) = this%Hyee_left(j) + Hyee_right_arr(j) = this%Hyee_right(j) + end do + + this%fields_on_device = .false. + + end subroutine gpu_download_sgbc_state + + !-------------------------------------------------------------------------------- + ! Upload per-node coefficients from host to device + ! Called each timestep after calc_SGBCconstants + !-------------------------------------------------------------------------------- + subroutine gpu_upload_sgbc_coeffs(this, G1_int_arr, G2_int_arr, GM1_int_arr, GM2_int_arr, & + a_arr, b_arr, c_arr, rb_arr, rh_arr, rhm1_arr, & + tridiag_a1_arr, tridiag_b1_arr, tridiag_c1_arr, & + tridiag_an_arr, tridiag_bn_arr, tridiag_cn_arr, & + offset) + class(gpu_state_sgbc_t), intent(inout) :: this + real(kind=rkind), intent(in) :: G1_int_arr(:,:), G2_int_arr(:,:), GM1_int_arr(:,:), GM2_int_arr(:,:) + real(kind=rkind), intent(in) :: a_arr(:,:), b_arr(:,:), c_arr(:,:), rb_arr(:,:), rh_arr(:,:), rhm1_arr(:,:) + real(kind=rkind), intent(in) :: tridiag_a1_arr(:), tridiag_b1_arr(:), tridiag_c1_arr(:) + real(kind=rkind), intent(in) :: tridiag_an_arr(:), tridiag_bn_arr(:), tridiag_cn_arr(:) + integer(kind=4), intent(in) :: offset + + integer(kind=4) :: n, sz, i, j + + if (.not. this%initialized) return + + n = this%numNodes + sz = 2*this%maxDepth + 1 + + do j = 1, n + do i = 1, sz + this%G1_int(i, j) = G1_int_arr(i + offset, j) + this%G2_int(i, j) = G2_int_arr(i + offset, j) + this%GM1_int(i, j) = GM1_int_arr(i + offset, j) + this%GM2_int(i, j) = GM2_int_arr(i + offset, j) + this%a_coef(i, j) = a_arr(i + offset, j) + this%b_coef(i, j) = b_arr(i + offset, j) + this%c_coef(i, j) = c_arr(i + offset, j) + this%rb_coef(i, j) = rb_arr(i + offset, j) + this%rh_coef(i, j) = rh_arr(i + offset, j) + this%rhm1_coef(i, j) = rhm1_arr(i + offset, j) + end do + this%tridiag_a1(j) = tridiag_a1_arr(j) + this%tridiag_b1(j) = tridiag_b1_arr(j) + this%tridiag_c1(j) = tridiag_c1_arr(j) + this%tridiag_an(j) = tridiag_an_arr(j) + this%tridiag_bn(j) = tridiag_bn_arr(j) + this%tridiag_cn(j) = tridiag_cn_arr(j) + end do + + end subroutine gpu_upload_sgbc_coeffs + + !-------------------------------------------------------------------------------- + ! Destroy GPU SGBC state + !-------------------------------------------------------------------------------- + subroutine gpu_destroy_sgbc(this) + class(gpu_state_sgbc_t), intent(inout) :: this + + if (.not. this%initialized) return + + ! Deallocate device arrays + if (this%initialized) deallocate(this%d_node) + if (this%initialized) deallocate(this%gm2_ext) + if (this%initialized) deallocate(this%jmed_node) + if (this%initialized) deallocate(this%transE) + if (this%initialized) deallocate(this%transH) + if (this%initialized) deallocate(this%alignedH) + if (this%initialized) deallocate(this%g1_val) + if (this%initialized) deallocate(this%g2a_val) + if (this%initialized) deallocate(this%g2b_val) + if (this%initialized) deallocate(this%gm2_externo) + if (this%initialized) deallocate(this%correct_ha) + if (this%initialized) deallocate(this%correct_hb) + if (this%initialized) deallocate(this%crank) + if (this%initialized) deallocate(this%filo_placa) + if (this%initialized) deallocate(this%depth_node) + + if (this%initialized) deallocate(this%E_state) + if (this%initialized) deallocate(this%H_state) + if (this%initialized) deallocate(this%E_past_state) + if (this%initialized) deallocate(this%D_state) + if (this%initialized) deallocate(this%G1_int) + if (this%initialized) deallocate(this%G2_int) + if (this%initialized) deallocate(this%GM1_int) + if (this%initialized) deallocate(this%GM2_int) + if (this%initialized) deallocate(this%a_coef) + if (this%initialized) deallocate(this%b_coef) + if (this%initialized) deallocate(this%c_coef) + if (this%initialized) deallocate(this%rb_coef) + if (this%initialized) deallocate(this%rh_coef) + if (this%initialized) deallocate(this%rhm1_coef) + if (this%initialized) deallocate(this%capa_state) + if (this%initialized) deallocate(this%delta_state) + + if (this%initialized) deallocate(this%tridiag_a1) + if (this%initialized) deallocate(this%tridiag_b1) + if (this%initialized) deallocate(this%tridiag_c1) + if (this%initialized) deallocate(this%tridiag_an) + if (this%initialized) deallocate(this%tridiag_bn) + if (this%initialized) deallocate(this%tridiag_cn) + + if (this%initialized) deallocate(this%Efield_val) + if (this%initialized) deallocate(this%Ha_Plus_val) + if (this%initialized) deallocate(this%Ha_Minu_val) + if (this%initialized) deallocate(this%Hb_Plus_val) + if (this%initialized) deallocate(this%Hb_Minu_val) + if (this%initialized) deallocate(this%Hyee_left) + if (this%initialized) deallocate(this%Hyee_right) + + nullify(this%Ex); nullify(this%Ey); nullify(this%Ez) + nullify(this%Hx); nullify(this%Hy); nullify(this%Hz) + nullify(this%Idxe); nullify(this%Idye); nullify(this%Idze) + nullify(this%Idxh); nullify(this%Idyh); nullify(this%Idzh) + + this%initialized = .false. + this%fields_on_device = .false. + + end subroutine gpu_destroy_sgbc + +end module gpu_sgbc_core_m diff --git a/src_main_pub/gpu_sgbc_e_m.F90 b/src_main_pub/gpu_sgbc_e_m.F90 new file mode 100644 index 000000000..3806a896d --- /dev/null +++ b/src_main_pub/gpu_sgbc_e_m.F90 @@ -0,0 +1,241 @@ +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +! GPU SGBC E-FIELD KERNELS - CUDA Fortran (CUF) +! Non-dispersive SGBC E-field advance (YEE + Crank-Nicolson) +! + Tridiagonal solver kernel (Thomas algorithm) +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + +module gpu_sgbc_e_m + + use FDETYPES_m + use Report_m + use cudafor + use gpu_sgbc_core_m + + implicit none + +contains + + !-------------------------------------------------------------------------------- + ! Advance SGBC E-field - GPU accelerated + ! Each thread handles one node + !-------------------------------------------------------------------------------- + subroutine gpu_advance_sgbc_e(this, dt) + class(gpu_state_sgbc_t), intent(inout) :: this + real(kind=rkind), intent(in) :: dt + + integer(kind=4) :: n + + if (.not. this%initialized) return + + n = this%numNodes + + call sgbc_e_kernel(this%E_state, this%H_state, this%E_past_state, this%D_state, & + this%G1_int, this%G2_int, this%GM1_int, this%GM2_int, & + this%a_coef, this%b_coef, this%c_coef, this%rb_coef, this%rh_coef, this%rhm1_coef, & + this%tridiag_a1, this%tridiag_b1, this%tridiag_c1, & + this%tridiag_an, this%tridiag_bn, this%tridiag_cn, & + this%g1_val, this%g2a_val, this%g2b_val, & + this%gm2_externo, this%transE, this%transH, this%alignedH, & + this%correct_ha, this%correct_hb, this%crank, this%filo_placa, & + this%depth_node, this%capa_state, this%delta_state, & + this%Efield_val, this%Ha_Plus_val, this%Ha_Minu_val, & + this%Hb_Plus_val, this%Hb_Minu_val, this%Hyee_left, this%Hyee_right, & + n, dt) + + this%fields_on_device = .true. + + end subroutine gpu_advance_sgbc_e + + !-------------------------------------------------------------------------------- + ! Fused SGBC E-field advance kernel + ! Each thread handles one node + !-------------------------------------------------------------------------------- + subroutine sgbc_e_kernel(E_state, H_state, E_past_state, D_state, & + G1_int, G2_int, GM1_int, GM2_int, & + a_coef, b_coef, c_coef, rb_coef, rh_coef, rhm1_coef, & + tridiag_a1, tridiag_b1, tridiag_c1, & + tridiag_an, tridiag_bn, tridiag_cn, & + g1_val, g2a_val, g2b_val, & + gm2_ext, transE, transH, alignedH, & + correct_ha, correct_hb, crank, filo_placa, & + depth_node, capa_state, delta_state, & + Efield_val, Ha_Plus_val, Ha_Minu_val, & + Hb_Plus_val, Hb_Minu_val, Hyee_left, Hyee_right, & + n, dt) + integer(kind=4), intent(in) :: n + real(kind=rkind), intent(in) :: dt + real(kind=rkind), device, dimension(:,:) :: E_state, H_state, E_past_state, D_state + real(kind=rkind), device, dimension(:,:) :: G1_int, G2_int, GM1_int, GM2_int + real(kind=rkind), device, dimension(:,:) :: a_coef, b_coef, c_coef, rb_coef, rh_coef, rhm1_coef + real(kind=rkind), device, dimension(:) :: tridiag_a1, tridiag_b1, tridiag_c1 + real(kind=rkind), device, dimension(:) :: tridiag_an, tridiag_bn, tridiag_cn + real(kind=rkind), device, dimension(:) :: g1_val, g2a_val, g2b_val, gm2_ext + real(kind=rkind), device, dimension(:) :: transE, transH, alignedH + logical, device, dimension(:) :: correct_ha, correct_hb, crank, filo_placa + integer(kind=4), device, dimension(:) :: depth_node + integer(kind=4), device, dimension(:,:) :: capa_state + real(kind=rkind), device, dimension(:,:) :: delta_state + real(kind=rkind), device, dimension(:) :: Efield_val, Ha_Plus_val, Ha_Minu_val + real(kind=rkind), device, dimension(:) :: Hb_Plus_val, Hb_Minu_val + real(kind=rkind), device, dimension(:) :: Hyee_left, Hyee_right + + integer(kind=4) :: node, i, j, depth, sz + real(kind=rkind) :: g1, g2, g1i, g2i, gm1, gm2, delta_e + real(kind=rkind) :: Ha_Plus, Ha_Minu, Hb_Plus, Hb_Minu, Efield, Hyee_l, Hyee_r + logical :: ha, hb, cr, fp + + !$cuf kernel do(1) <<<*, *>>> + do node = 1, n + depth = depth_node(node) + sz = 2*depth + 1 + ha = correct_ha(node) + hb = correct_hb(node) + cr = crank(node) + fp = filo_placa(node) + gm2 = gm2_ext(node) + g1 = g1_val(node) + g2 = g2a_val(node) + Efield = Efield_val(node) + Ha_Plus = Ha_Plus_val(node) + Ha_Minu = Ha_Minu_val(node) + Hb_Plus = Hb_Plus_val(node) + Hb_Minu = Hb_Minu_val(node) + Hyee_l = Hyee_left(node) + Hyee_r = Hyee_right(node) + + if (depth == 0) then + ! depth=0 case: simple boundary update + if (ha) then + E_state(1, node) = g1 * E_state(1, node) + & + g2 * (Ha_Plus - Ha_Minu) - g2b_val(node) * (Hb_Plus - Hb_Minu) + elseif (hb) then + E_state(1, node) = g1 * E_state(1, node) + & + g2 * (Ha_Plus - Ha_Minu) - g2b_val(node) * (Hb_Plus - Hb_Minu) + else + E_state(1, node) = g1 * E_state(1, node) + & + g2 * (Ha_Plus - Ha_Minu) - g2b_val(node) * (Hb_Plus - Hb_Minu) + end if + else + ! depth>0 case: boundary + interior + ! Boundary cells: E(depth) and E(-depth) + if (.not. cr) then + ! YEE mode + if (ha) then + E_state(depth + 1, node) = g1_val(1) * E_state(depth + 1, node) + & + (g2a_val(1) * (Ha_Plus - Hyee_r) - g2b_val(1) * (Hb_Plus - Hb_Minu)) + E_state(1, node) = g1_val(0) * E_state(1, node) + & + (g2a_val(0) * (Hyee_l - Ha_Minu) - g2b_val(0) * (Hb_Plus - Hb_Minu)) + elseif (hb) then + E_state(depth + 1, node) = g1_val(1) * E_state(depth + 1, node) + & + (g2a_val(1) * (Ha_Plus - Ha_Minu) - g2b_val(1) * (Hb_Plus - Hyee_r)) + E_state(1, node) = g1_val(0) * E_state(1, node) + & + (g2a_val(0) * (Ha_Plus - Ha_Minu) - g2b_val(0) * (Hyee_l - Hb_Minu)) + end if + + ! Interior cells + do i = 2, depth + g1i = G1_int(i, node) + g2i = G2_int(i, node) + delta_e = 0.5_rkind * (delta_state(i, node) + delta_state(i-1, node)) + E_state(i, node) = g1i * E_state(i, node) + g2i / delta_e * & + (H_state(i, node) - H_state(i-1, node)) + end do + else + ! Crank-Nicolson mode + ! Copy E to E_past + do i = 1, sz + E_past_state(i, node) = E_state(i, node) + end do + + ! Compute D vector for boundaries + if (ha) then + i = depth + 1 + D_state(i, node) = -tridiag_an(node) * E_past_state(i-1, node) + & + tridiag_bn(node) * E_past_state(i, node) + & + g2a_val(1) * (Ha_Plus - Hyee_r) - g2b_val(1) * (Hb_Plus - Hb_Minu) + i = 1 + D_state(i, node) = -tridiag_c1(node) * E_past_state(i+1, node) + & + tridiag_bn(node) * E_past_state(i, node) + & + g2a_val(0) * (Hyee_l - Ha_Minu) - g2b_val(0) * (Hb_Plus - Hb_Minu) + elseif (hb) then + i = depth + 1 + D_state(i, node) = -tridiag_an(node) * E_past_state(i-1, node) + & + tridiag_bn(node) * E_past_state(i, node) + & + g2a_val(1) * (Ha_Plus - Ha_Minu) - g2b_val(1) * (Hb_Plus - Hyee_r) + i = 1 + D_state(i, node) = -tridiag_c1(node) * E_past_state(i+1, node) + & + tridiag_bn(node) * E_past_state(i, node) + & + g2a_val(0) * (Ha_Plus - Ha_Minu) - g2b_val(0) * (Hyee_l - Hb_Minu) + end if + + ! Compute D vector for interior cells + do i = 2, depth + D_state(i, node) = -a_coef(i, node) * E_past_state(i-1, node) - & + c_coef(i, node) * E_past_state(i+1, node) + & + rb_coef(i, node) * E_past_state(i, node) + & + rh_coef(i, node) * H_state(i, node) - & + rhm1_coef(i, node) * H_state(i-1, node) + end do + + ! Solve tridiagonal system + call sgbc_solve_tridiag(D_state, E_state, a_coef, b_coef, c_coef, & + tridiag_a1(node), tridiag_b1(node), tridiag_c1(node), & + tridiag_an(node), tridiag_bn(node), tridiag_cn(node), & + sz, node) + end if + end if + + ! Update Efield = average of boundary cells + Efield_val(node) = 0.5_rkind * (E_state(1, node) + E_state(depth + 1, node)) + end do + + end subroutine sgbc_e_kernel + + !-------------------------------------------------------------------------------- + ! Tridiagonal solver (Thomas algorithm) - GPU kernel + ! Each thread solves one node's system + !-------------------------------------------------------------------------------- + subroutine sgbc_solve_tridiag(d, x, a, b, c, a1, b1, c1, an, bn, cn, n, node) + integer(kind=4), intent(in) :: n, node + real(kind=rkind), device, dimension(:,:), intent(inout) :: d, x + real(kind=rkind), device, dimension(:,:), intent(in) :: a, b, c + real(kind=rkind), intent(in) :: a1, b1, c1, an, bn, cn + + real(kind=rkind), dimension(:), allocatable, device :: aa, bb, cc + real(kind=rkind), dimension(:), allocatable, device :: cp, dp + real(kind=rkind) :: m + integer(kind=4) :: i + + allocate(aa(n), bb(n), cc(n)) + allocate(cp(n), dp(n)) + + aa(1) = a1 + bb(1) = b1 + cc(1) = c1 + aa(n) = an + bb(n) = bn + cc(n) = cn + aa(2:n-1) = a(2:n-1, node) + bb(2:n-1) = b(2:n-1, node) + cc(2:n-1) = c(2:n-1, node) + + ! Forward elimination + cp(1) = cc(1) / bb(1) + dp(1) = d(1, node) / bb(1) + do i = 2, n + m = bb(i) - cp(i-1) * aa(i) + cp(i) = cc(i) / m + dp(i) = (d(i, node) - dp(i-1) * aa(i)) / m + end do + + ! Back substitution + x(n, node) = dp(n) + do i = n-1, 1, -1 + x(i, node) = dp(i) - cp(i) * x(i+1, node) + end do + + deallocate(aa, bb, cc, cp, dp) + + end subroutine sgbc_solve_tridiag + +end module gpu_sgbc_e_m diff --git a/src_main_pub/gpu_sgbc_h_m.F90 b/src_main_pub/gpu_sgbc_h_m.F90 new file mode 100644 index 000000000..e5838b37d --- /dev/null +++ b/src_main_pub/gpu_sgbc_h_m.F90 @@ -0,0 +1,134 @@ +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +! GPU SGBC H-FIELD KERNELS - CUDA Fortran (CUF) +! Non-dispersive SGBC H-field advance + boundary correction +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + +module gpu_sgbc_h_m + + use FDETYPES_m + use Report_m + use cudafor + use gpu_sgbc_core_m + + implicit none + +contains + + !-------------------------------------------------------------------------------- + ! Advance SGBC H-field - GPU accelerated + ! Each thread handles one node + !-------------------------------------------------------------------------------- + subroutine gpu_advance_sgbc_h(this, dt) + class(gpu_state_sgbc_t), intent(inout) :: this + real(kind=rkind), intent(in) :: dt + + integer(kind=4) :: n + + if (.not. this%initialized) return + + n = this%numNodes + + call sgbc_h_kernel(this%H_state, this%E_state, this%E_past_state, & + this%GM1_int, this%GM2_int, & + this%crank, this%depth_node, & + this%delta_state, & + this%Ha_Plus_val, this%Ha_Minu_val, & + this%Hb_Plus_val, this%Hb_Minu_val, & + this%Efield_val, this%Hyee_left, this%Hyee_right, & + this%gm2_ext, this%correct_ha, this%correct_hb, & + n, dt) + + end subroutine gpu_advance_sgbc_h + + !-------------------------------------------------------------------------------- + ! Fused SGBC H-field advance + boundary correction kernel + ! Each thread handles one node + !-------------------------------------------------------------------------------- + subroutine sgbc_h_kernel(H_state, E_state, E_past_state, & + GM1_int, GM2_int, & + crank, depth_node, & + delta_state, & + Ha_Plus_val, Ha_Minu_val, & + Hb_Plus_val, Hb_Minu_val, & + Efield_val, Hyee_left, Hyee_right, & + gm2_ext, correct_ha, correct_hb, & + n, dt) + integer(kind=4), intent(in) :: n + real(kind=rkind), intent(in) :: dt + real(kind=rkind), device, dimension(:,:) :: H_state, E_state, E_past_state + real(kind=rkind), device, dimension(:,:) :: GM1_int, GM2_int + integer(kind=4), device, dimension(:,:) :: delta_state + logical, device, dimension(:) :: crank, correct_ha, correct_hb + integer(kind=4), device, dimension(:) :: depth_node + real(kind=rkind), device, dimension(:) :: Ha_Plus_val, Ha_Minu_val + real(kind=rkind), device, dimension(:) :: Hb_Plus_val, Hb_Minu_val + real(kind=rkind), device, dimension(:) :: Efield_val, Hyee_left, Hyee_right + real(kind=rkind), device, dimension(:) :: gm2_ext + + integer(kind=4) :: node, i, depth, sz + real(kind=rkind) :: gm1, gm2, delta_e + real(kind=rkind) :: Ha_Plus, Ha_Minu, Hb_Plus, Hb_Minu, Efield + real(kind=rkind) :: E_i1, E_i, E_past_i1, E_past_i + logical :: cr, ha, hb + + !$cuf kernel do(1) <<<*, *>>> + do node = 1, n + depth = depth_node(node) + sz = 2*depth + 1 + cr = crank(node) + ha = correct_ha(node) + hb = correct_hb(node) + + if (depth > 0) then + ! Update internal H cells + if (cr) then + ! Crank-Nicolson: half-step advance + do i = 1, depth + gm1 = GM1_int(i, node) + gm2 = GM2_int(i, node) + E_i1 = E_state(i+1, node) + E_i = E_state(i, node) + E_past_i1 = E_past_state(i+1, node) + E_past_i = E_past_state(i, node) + H_state(i, node) = gm1 * H_state(i, node) + & + 0.5_rkind * gm2 * (E_i1 - E_i + E_past_i1 - E_past_i) + end do + ! Update Hyee values + H_state(1, node) = Hyee_left(node) + H_state(depth, node) = Hyee_right(node) + else + ! YEE mode + do i = 1, depth + gm1 = GM1_int(i, node) + gm2 = GM2_int(i, node) + E_i1 = E_state(i+1, node) + E_i = E_state(i, node) + H_state(i, node) = gm1 * H_state(i, node) + gm2 * (E_i1 - E_i) + end do + Hyee_left(node) = H_state(1, node) + Hyee_right(node) = H_state(depth, node) + end if + end if + + ! Boundary correction (AdvanceSGBCH) + Ha_Plus = Ha_Plus_val(node) + Ha_Minu = Ha_Minu_val(node) + Hb_Plus = Hb_Plus_val(node) + Hb_Minu = Hb_Minu_val(node) + Efield = Efield_val(node) + + if (ha) then + Ha_Plus_val(node) = Ha_Plus + gm2_ext(node) * (Efield - E_state(depth + 1, node)) + Ha_Minu_val(node) = Ha_Minu - gm2_ext(node) * (Efield - E_state(1, node)) + elseif (hb) then + Hb_Plus_val(node) = Hb_Plus - gm2_ext(node) * (Efield - E_state(depth + 1, node)) + Hb_Minu_val(node) = Hb_Minu + gm2_ext(node) * (Efield - E_state(1, node)) + end if + + ! Update Efield to average + Efield_val(node) = 0.5_rkind * (E_state(1, node) + E_state(depth + 1, node)) + end do + + end subroutine sgbc_h_kernel + +end module gpu_sgbc_h_m diff --git a/src_main_pub/gpu_yee_m.F90 b/src_main_pub/gpu_yee_m.F90 new file mode 100644 index 000000000..2654ead01 --- /dev/null +++ b/src_main_pub/gpu_yee_m.F90 @@ -0,0 +1,398 @@ +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +! GPU YEE KERNELS MODULE - CUDA Fortran (CUF) accelerated YEE kernels +! Fields stay on device between timesteps. Only kernel wrappers here. +! Split from gpu_kernels_cuf.F90 to avoid NVHPC compiler file-size limit. +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + +module gpu_yee_m + + use FDETYPES_m + use Report_m + use cudafor + use gpu_core_m + + implicit none + +contains + + !-------------------------------------------------------------------------------- + ! Advance Ex field - GPU accelerated YEE kernel (fields on device, no copy) + !-------------------------------------------------------------------------------- + subroutine gpu_advanceEx(this, bounds) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: bounds + + if (.not. this%initialized) return + + call gpu_advanceEx_kernel(this%Ex_d, this%Hy_d, this%Hz_d, this%sggMiEx_d, & + this%Idyh_d, this%Idzh_d, this%g1_d, this%g2_d, & + bounds%sweepEx%NX, bounds%sweepEx%NY, bounds%sweepEx%NZ) + + end subroutine gpu_advanceEx + + subroutine gpu_advanceEx_kernel(Ex_d, Hy_d, Hz_d, sggMiEx_d, Idyh_d, Idzh_d, g1_d, g2_d, nx, ny, nz) + integer(kind=4), intent(in) :: nx, ny, nz + real(kind=rkind), device, dimension(:,:,:) :: Ex_d, Hy_d, Hz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEx_d + real(kind=rkind), device, dimension(:) :: Idyh_d, Idzh_d, g1_d, g2_d + + integer(kind=4) :: i, j, k + real(kind=rkind) :: Idzhk, Idyhj + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=1,nz + do j=1,ny + do i=1,nx + Idzhk = Idzh_d(k) + Idyhj = Idyh_d(j) + medio = sggMiEx_d(i,j,k) + Ex_d(i,j,k) = g1_d(medio)*Ex_d(i,j,k) + g2_d(medio)* & + ((Hz_d(i,j,k) - Hz_d(i,j-1,k))*Idyhj - (Hy_d(i,j,k) - Hy_d(i,j,k-1))*Idzhk) + end do + end do + end do + end subroutine gpu_advanceEx_kernel + + !-------------------------------------------------------------------------------- + ! Advance Ey field - GPU accelerated YEE kernel (fields on device, no copy) + !-------------------------------------------------------------------------------- + subroutine gpu_advanceEy(this, bounds) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: bounds + + if (.not. this%initialized) return + + call gpu_advanceEy_kernel(this%Ey_d, this%Hz_d, this%Hx_d, this%sggMiEy_d, & + this%Idzh_d, this%Idxh_d, this%g1_d, this%g2_d, & + bounds%sweepEy%NX, bounds%sweepEy%NY, bounds%sweepEy%NZ) + + end subroutine gpu_advanceEy + + subroutine gpu_advanceEy_kernel(Ey_d, Hz_d, Hx_d, sggMiEy_d, Idzh_d, Idxh_d, g1_d, g2_d, nx, ny, nz) + integer(kind=4), intent(in) :: nx, ny, nz + real(kind=rkind), device, dimension(:,:,:) :: Ey_d, Hz_d, Hx_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEy_d + real(kind=rkind), device, dimension(:) :: Idzh_d, Idxh_d, g1_d, g2_d + + integer(kind=4) :: i, j, k + real(kind=rkind) :: Idzhk + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=1,nz + do j=1,ny + do i=1,nx + Idzhk = Idzh_d(k) + medio = sggMiEy_d(i,j,k) + Ey_d(i,j,k) = g1_d(medio)*Ey_d(i,j,k) + g2_d(medio)* & + ((Hx_d(i,j,k)-Hx_d(i,j,k-1))*Idzhk - (Hz_d(i,j,k)-Hz_d(i-1,j,k))*Idxh_d(i)) + end do + end do + end do + end subroutine gpu_advanceEy_kernel + + !-------------------------------------------------------------------------------- + ! Advance Ez field - GPU accelerated YEE kernel (fields on device, no copy) + !-------------------------------------------------------------------------------- + subroutine gpu_advanceEz(this, bounds) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: bounds + + if (.not. this%initialized) return + + call gpu_advanceEz_kernel(this%Ez_d, this%Hx_d, this%Hy_d, this%sggMiEz_d, & + this%Idyh_d, this%Idxh_d, this%g1_d, this%g2_d, & + bounds%sweepEz%NX, bounds%sweepEz%NY, bounds%sweepEz%NZ) + + end subroutine gpu_advanceEz + + subroutine gpu_advanceEz_kernel(Ez_d, Hx_d, Hy_d, sggMiEz_d, Idyh_d, Idxh_d, g1_d, g2_d, nx, ny, nz) + integer(kind=4), intent(in) :: nx, ny, nz + real(kind=rkind), device, dimension(:,:,:) :: Ez_d, Hx_d, Hy_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEz_d + real(kind=rkind), device, dimension(:) :: Idyh_d, Idxh_d, g1_d, g2_d + + integer(kind=4) :: i, j, k + real(kind=rkind) :: Idyhj + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=1,nz + do j=1,ny + do i=1,nx + Idyhj = Idyh_d(j) + medio = sggMiEz_d(i,j,k) + Ez_d(i,j,k) = g1_d(medio)*Ez_d(i,j,k) + g2_d(medio)* & + ((Hy_d(i,j,k)-Hy_d(i-1,j,k))*Idxh_d(i) - (Hx_d(i,j,k)-Hx_d(i,j-1,k))*Idyhj) + end do + end do + end do + end subroutine gpu_advanceEz_kernel + + !-------------------------------------------------------------------------------- + ! Advance Hx field - GPU accelerated YEE kernel (fields on device, no copy) + !-------------------------------------------------------------------------------- + subroutine gpu_advanceHx(this, bounds) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: bounds + + if (.not. this%initialized) return + + call gpu_advanceHx_kernel(this%Hx_d, this%Ey_d, this%Ez_d, this%sggMiHx_d, & + this%Idye_d, this%Idze_d, this%gm1_d, this%gm2_d, & + bounds%sweepHx%NX, bounds%sweepHx%NY, bounds%sweepHx%NZ) + + end subroutine gpu_advanceHx + + subroutine gpu_advanceHx_kernel(Hx_d, Ey_d, Ez_d, sggMiHx_d, Idye_d, Idze_d, gm1_d, gm2_d, nx, ny, nz) + integer(kind=4), intent(in) :: nx, ny, nz + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, Ey_d, Ez_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d + real(kind=rkind), device, dimension(:) :: Idye_d, Idze_d, gm1_d, gm2_d + + integer(kind=4) :: i, j, k + real(kind=rkind) :: Idzek, Idyej + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=1,nz + do j=1,ny + do i=1,nx + Idzek = Idze_d(k) + Idyej = Idye_d(j) + medio = sggMiHx_d(i,j,k) + Hx_d(i,j,k) = gm1_d(medio)*Hx_d(i,j,k) + gm2_d(medio)* & + ((Ey_d(i,j,k+1)-Ey_d(i,j,k))*Idzek - (Ez_d(i,j+1,k)-Ez_d(i,j,k))*Idyej) + end do + end do + end do + end subroutine gpu_advanceHx_kernel + + !-------------------------------------------------------------------------------- + ! Advance Hy field - GPU accelerated YEE kernel (fields on device, no copy) + !-------------------------------------------------------------------------------- + subroutine gpu_advanceHy(this, bounds) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: bounds + + if (.not. this%initialized) return + + call gpu_advanceHy_kernel(this%Hy_d, this%Ez_d, this%Ex_d, this%sggMiHy_d, & + this%Idze_d, this%Idxe_d, this%gm1_d, this%gm2_d, & + bounds%sweepHy%NX, bounds%sweepHy%NY, bounds%sweepHy%NZ) + + end subroutine gpu_advanceHy + + subroutine gpu_advanceHy_kernel(Hy_d, Ez_d, Ex_d, sggMiHy_d, Idze_d, Idxe_d, gm1_d, gm2_d, nx, ny, nz) + integer(kind=4), intent(in) :: nx, ny, nz + real(kind=rkind), device, dimension(:,:,:) :: Hy_d, Ez_d, Ex_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d + real(kind=rkind), device, dimension(:) :: Idze_d, Idxe_d, gm1_d, gm2_d + + integer(kind=4) :: i, j, k + real(kind=rkind) :: Idzek + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=1,nz + do j=1,ny + do i=1,nx + Idzek = Idze_d(k) + medio = sggMiHy_d(i,j,k) + Hy_d(i,j,k) = gm1_d(medio)*Hy_d(i,j,k) + gm2_d(medio)* & + ((Ez_d(i+1,j,k)-Ez_d(i,j,k))*Idxe_d(i) - (Ex_d(i,j,k+1)-Ex_d(i,j,k))*Idzek) + end do + end do + end do + end subroutine gpu_advanceHy_kernel + + !-------------------------------------------------------------------------------- + ! Advance Hz field - GPU accelerated YEE kernel (fields on device, no copy) + !-------------------------------------------------------------------------------- + subroutine gpu_advanceHz(this, bounds) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: bounds + + if (.not. this%initialized) return + + call gpu_advanceHz_kernel(this%Hz_d, this%Ex_d, this%Ey_d, this%sggMiHz_d, & + this%Idye_d, this%Idxe_d, this%gm1_d, this%gm2_d, & + bounds%sweepHz%NX, bounds%sweepHz%NY, bounds%sweepHz%NZ) + + end subroutine gpu_advanceHz + + subroutine gpu_advanceHz_kernel(Hz_d, Ex_d, Ey_d, sggMiHz_d, Idye_d, Idxe_d, gm1_d, gm2_d, nx, ny, nz) + integer(kind=4), intent(in) :: nx, ny, nz + real(kind=rkind), device, dimension(:,:,:) :: Hz_d, Ex_d, Ey_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d + real(kind=rkind), device, dimension(:) :: Idye_d, Idxe_d, gm1_d, gm2_d + + integer(kind=4) :: i, j, k + real(kind=rkind) :: Idyej + integer(kind=integersizeofmediamatrices) :: medio + + !$cuf kernel do(3) <<<*, *>>> + do k=1,nz + do j=1,ny + do i=1,nx + Idyej = Idye_d(j) + medio = sggMiHz_d(i,j,k) + Hz_d(i,j,k) = gm1_d(medio)*Hz_d(i,j,k) + gm2_d(medio)* & + ((Ex_d(i,j+1,k)-Ex_d(i,j,k))*Idyej - (Ey_d(i+1,j,k)-Ey_d(i,j,k))*Idxe_d(i)) + end do + end do + end do + end subroutine gpu_advanceHz_kernel + + !-------------------------------------------------------------------------------- + ! Fused E-field YEE kernel — updates Ex, Ey, Ez in single kernel launch + !-------------------------------------------------------------------------------- + subroutine gpu_advanceE_all(this, bounds) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: bounds + + if (.not. this%initialized) return + + call gpu_advanceE_all_kernel(this%Ex_d, this%Ey_d, this%Ez_d, & + this%Hx_d, this%Hy_d, this%Hz_d, & + this%sggMiEx_d, this%sggMiEy_d, this%sggMiEz_d, & + this%Idxh_d, this%Idxe_d, & + this%Idyh_d, this%Idzh_d, this%Idze_d, this%Idye_d, & + this%g1_d, this%g2_d, & + bounds%sweepEx%NX, bounds%sweepEx%NY, bounds%sweepEx%NZ, & + bounds%sweepEy%NX, bounds%sweepEy%NY, bounds%sweepEy%NZ, & + bounds%sweepEz%NX, bounds%sweepEz%NY, bounds%sweepEz%NZ) + + end subroutine gpu_advanceE_all + + subroutine gpu_advanceE_all_kernel(Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d, & + sggMiEx_d, sggMiEy_d, sggMiEz_d, & + Idxh_d, Idxe_d, & + Idyh_d, Idzh_d, Idze_d, Idye_d, & + g1_d, g2_d, & + Ex_nx, Ex_ny, Ex_nz, & + Ey_nx, Ey_ny, Ey_nz, & + Ez_nx, Ez_ny, Ez_nz) + integer(kind=4), intent(in) :: Ex_nx, Ex_ny, Ex_nz, Ey_nx, Ey_ny, Ey_nz, Ez_nx, Ez_ny, Ez_nz + real(kind=rkind), device, dimension(:,:,:) :: Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEx_d, sggMiEy_d, sggMiEz_d + real(kind=rkind), device, dimension(:) :: Idxh_d, Idxe_d, Idyh_d, Idzh_d, Idze_d, Idye_d, g1_d, g2_d + + integer(kind=4) :: i, j, k + real(kind=rkind) :: Idzhk, Idyhj, Idzhk2, Idxh, Idzek, Idyej, Idzek2, Idyej2 + + !$cuf kernel do(3) <<<*, *>>> + do k=1,Ex_nz + do j=1,Ex_ny + do i=1,Ex_nx + Idzhk = Idzh_d(k) + Idyhj = Idyh_d(j) + Ex_d(i,j,k) = g1_d(sggMiEx_d(i,j,k))*Ex_d(i,j,k) + g2_d(sggMiEx_d(i,j,k))* & + ((Hz_d(i,j,k) - Hz_d(i,j-1,k))*Idyhj - (Hy_d(i,j,k) - Hy_d(i,j,k-1))*Idzhk) + end do + end do + end do + + !$cuf kernel do(3) <<<*, *>>> + do k=1,Ey_nz + do j=1,Ey_ny + do i=1,Ey_nx + Idzhk2 = Idzh_d(k) + Idxh = Idxh_d(i) + Ey_d(i,j,k) = g1_d(sggMiEy_d(i,j,k))*Ey_d(i,j,k) + g2_d(sggMiEy_d(i,j,k))* & + ((Hx_d(i,j,k)-Hx_d(i,j,k-1))*Idzhk2 - (Hz_d(i,j,k)-Hz_d(i-1,j,k))*Idxh) + end do + end do + end do + + !$cuf kernel do(3) <<<*, *>>> + do k=1,Ez_nz + do j=1,Ez_ny + do i=1,Ez_nx + Idyej2 = Idye_d(j) + Idzek2 = Idze_d(k) + Ez_d(i,j,k) = g1_d(sggMiEz_d(i,j,k))*Ez_d(i,j,k) + g2_d(sggMiEz_d(i,j,k))* & + ((Hy_d(i,j,k)-Hy_d(i-1,j,k))*Idxh_d(i) - (Hx_d(i,j,k)-Hx_d(i,j-1,k))*Idyej2) + end do + end do + end do + + end subroutine gpu_advanceE_all_kernel + + !-------------------------------------------------------------------------------- + ! Fused H-field YEE kernel — updates Hx, Hy, Hz in single kernel launch + !-------------------------------------------------------------------------------- + subroutine gpu_advanceH_all(this, bounds) + class(gpu_state_t), intent(inout) :: this + type(bounds_t), intent(in) :: bounds + + if (.not. this%initialized) return + + call gpu_advanceH_all_kernel(this%Hx_d, this%Hy_d, this%Hz_d, & + this%Ey_d, this%Ez_d, this%Ex_d, & + this%sggMiHx_d, this%sggMiHy_d, this%sggMiHz_d, & + this%Idye_d, this%Idze_d, & + this%Idyh_d, this%Idzh_d, this%Idxh_d, this%Idxe_d, & + this%gm1_d, this%gm2_d, & + bounds%sweepHx%NX, bounds%sweepHx%NY, bounds%sweepHx%NZ, & + bounds%sweepHy%NX, bounds%sweepHy%NY, bounds%sweepHy%NZ, & + bounds%sweepHz%NX, bounds%sweepHz%NY, bounds%sweepHz%NZ) + + end subroutine gpu_advanceH_all + + subroutine gpu_advanceH_all_kernel(Hx_d, Hy_d, Hz_d, Ex_d, Ey_d, Ez_d, & + sggMiHx_d, sggMiHy_d, sggMiHz_d, & + Idye_d, Idze_d, & + Idyh_d, Idzh_d, Idxh_d, Idxe_d, & + gm1_d, gm2_d, & + Hx_nx, Hx_ny, Hx_nz, & + Hy_nx, Hy_ny, Hy_nz, & + Hz_nx, Hz_ny, Hz_nz) + integer(kind=4), intent(in) :: Hx_nx, Hx_ny, Hx_nz, Hy_nx, Hy_ny, Hy_nz, Hz_nx, Hz_ny, Hz_nz + real(kind=rkind), device, dimension(:,:,:) :: Hx_d, Hy_d, Hz_d, Ex_d, Ey_d, Ez_d + integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d, sggMiHy_d, sggMiHz_d + real(kind=rkind), device, dimension(:) :: Idye_d, Idze_d, Idyh_d, Idzh_d, Idxh_d, Idxe_d, gm1_d, gm2_d + + integer(kind=4) :: i, j, k + real(kind=rkind) :: Idzek, Idyej, Idzek2, Idxh, Idyej2, Idxh2 + + !$cuf kernel do(3) <<<*, *>>> + do k=1,Hx_nz + do j=1,Hx_ny + do i=1,Hx_nx + Idzek = Idze_d(k) + Idyej = Idye_d(j) + Hx_d(i,j,k) = gm1_d(sggMiHx_d(i,j,k))*Hx_d(i,j,k) + gm2_d(sggMiHx_d(i,j,k))* & + ((Ey_d(i,j,k+1)-Ey_d(i,j,k))*Idzek - (Ez_d(i,j+1,k)-Ez_d(i,j,k))*Idyej) + end do + end do + end do + + !$cuf kernel do(3) <<<*, *>>> + do k=1,Hy_nz + do j=1,Hy_ny + do i=1,Hy_nx + Idzek2 = Idze_d(k) + Idxh = Idxh_d(i) + Hy_d(i,j,k) = gm1_d(sggMiHy_d(i,j,k))*Hy_d(i,j,k) + gm2_d(sggMiHy_d(i,j,k))* & + ((Ez_d(i+1,j,k)-Ez_d(i,j,k))*Idxh - (Ex_d(i,j,k+1)-Ex_d(i,j,k))*Idzek2) + end do + end do + end do + + !$cuf kernel do(3) <<<*, *>>> + do k=1,Hz_nz + do j=1,Hz_ny + do i=1,Hz_nx + Idyej2 = Idye_d(j) + Idxh2 = Idxh_d(i) + Hz_d(i,j,k) = gm1_d(sggMiHz_d(i,j,k))*Hz_d(i,j,k) + gm2_d(sggMiHz_d(i,j,k))* & + ((Ex_d(i,j+1,k)-Ex_d(i,j,k))*Idyej2 - (Ey_d(i+1,j,k)-Ey_d(i,j,k))*Idxh2) + end do + end do + end do + + end subroutine gpu_advanceH_all_kernel + +end module gpu_yee_m diff --git a/src_main_pub/maloney_nostoch.F90 b/src_main_pub/maloney_nostoch.F90 index 10226019d..fa616aa52 100755 --- a/src_main_pub/maloney_nostoch.F90 +++ b/src_main_pub/maloney_nostoch.F90 @@ -87,9 +87,151 @@ module SGBC_nostoch_m !!! public Malon_t,SGBCSurface_t !el tipo es publico public AdvanceSGBCE,AdvanceSGBCH,InitSGBCs,DestroySGBCs,StoreFieldsSGBCs,calc_SGBCconstants,GetSGBCs +public GetSGBCConstants,GetSGBCState,GetSGBCFieldPointers +public malon contains +!-------------------------------------------------------------------------------- +! Export per-node constants for GPU +!-------------------------------------------------------------------------------- +subroutine GetSGBCConstants(depth_arr, gm2_ext_arr, jmed_arr, transE_arr, transH_arr, alignedH_arr, & + g1_arr, g2a_arr, g2b_arr, gm2_ext_val_arr, & + correct_ha_arr, correct_hb_arr, crank_arr, filo_placa_arr, & + depth_node_arr, G1_int_arr, G2_int_arr, GM1_int_arr, GM2_int_arr, & + a_arr, b_arr, c_arr, rb_arr, rh_arr, rhm1_arr, & + tridiag_a1_arr, tridiag_b1_arr, tridiag_c1_arr, & + tridiag_an_arr, tridiag_bn_arr, tridiag_cn_arr, & + capa_arr, delta_arr, Hyee_left_arr, Hyee_right_arr, & + numNodes_out, maxDepth_out) + integer(kind=4), intent(out) :: depth_arr(:), jmed_arr(:), depth_node_arr(:) + real(kind=RKIND), intent(out) :: gm2_ext_arr(:), transE_arr(:), transH_arr(:), alignedH_arr(:) + real(kind=RKIND), intent(out) :: g1_arr(:), g2a_arr(:), g2b_arr(:), gm2_ext_val_arr(:) + logical, intent(out) :: correct_ha_arr(:), correct_hb_arr(:), crank_arr(:), filo_placa_arr(:) + real(kind=RKIND), intent(out) :: G1_int_arr(:,:), G2_int_arr(:,:), GM1_int_arr(:,:), GM2_int_arr(:,:) + real(kind=RKIND), intent(out) :: a_arr(:,:), b_arr(:,:), c_arr(:,:), rb_arr(:,:), rh_arr(:,:), rhm1_arr(:,:) + real(kind=RKIND), intent(out) :: tridiag_a1_arr(:), tridiag_b1_arr(:), tridiag_c1_arr(:) + real(kind=RKIND), intent(out) :: tridiag_an_arr(:), tridiag_bn_arr(:), tridiag_cn_arr(:) + integer(kind=4), intent(out) :: capa_arr(:,:) + real(kind=RKIND), intent(out) :: delta_arr(:,:) + real(kind=RKIND), intent(out) :: Hyee_left_arr(:), Hyee_right_arr(:) + integer(kind=4), intent(out) :: numNodes_out, maxDepth_out + + type(SGBCSurface_t), pointer :: compo + integer(kind=4) :: conta, i, maxD + + maxD = 0 + do conta = 1, malon%NumNodes + compo => malon%Nodes(conta) + if (compo%depth > maxD) maxD = compo%depth + end do + maxD = max(maxD, 1) + + numNodes_out = malon%NumNodes + maxDepth_out = maxD + + do conta = 1, malon%NumNodes + compo => malon%Nodes(conta) + depth_arr(conta) = 1 + jmed_arr(conta) = compo%jmed + depth_node_arr(conta) = compo%depth + gm2_ext_arr(conta) = compo%GM2_externo + transE_arr(conta) = compo%transversalDeltaE + transH_arr(conta) = compo%transversalDeltaH + alignedH_arr(conta) = compo%alignedlDeltaH + g1_arr(conta) = compo%g1(0) + g2a_arr(conta) = compo%g2a(0) + g2b_arr(conta) = compo%g2b(0) + gm2_ext_val_arr(conta) = compo%GM2_externo + correct_ha_arr(conta) = compo%correct_ha + correct_hb_arr(conta) = compo%correct_hb + crank_arr(conta) = compo%SGBCCrank + filo_placa_arr(conta) = compo%es_unfilo_placa + + ! Copy per-node state arrays + do i = -compo%depth, compo%depth + G1_int_arr(i + compo%depth + 1, conta) = compo%G1_interno(i) + G2_int_arr(i + compo%depth + 1, conta) = compo%G2_interno(i) + GM1_int_arr(i + compo%depth + 1, conta) = compo%GM1_interno(i) + GM2_int_arr(i + compo%depth + 1, conta) = compo%GM2_interno(i) + a_arr(i + compo%depth + 1, conta) = compo%a(i) + b_arr(i + compo%depth + 1, conta) = compo%b(i) + c_arr(i + compo%depth + 1, conta) = compo%c(i) + rb_arr(i + compo%depth + 1, conta) = compo%rb(i) + rh_arr(i + compo%depth + 1, conta) = compo%rh(i) + rhm1_arr(i + compo%depth + 1, conta) = compo%rhm1(i) + capa_arr(i + compo%depth + 1, conta) = compo%capa(i) + delta_arr(i + compo%depth + 1, conta) = compo%delta_entreEinterno(i) + end do + + ! Tridiag boundary constants + tridiag_a1_arr(conta) = compo%a1 + tridiag_b1_arr(conta) = compo%b1 + tridiag_c1_arr(conta) = compo%c1 + tridiag_an_arr(conta) = compo%an + tridiag_bn_arr(conta) = compo%bn + tridiag_cn_arr(conta) = compo%cn + Hyee_left_arr(conta) = compo%Hyee__left + Hyee_right_arr(conta) = compo%Hyee_right + end do + +end subroutine GetSGBCConstants + +!-------------------------------------------------------------------------------- +! Export per-node state (E, H, E_past, D) for GPU +!-------------------------------------------------------------------------------- +subroutine GetSGBCState(E_arr, H_arr, E_past_arr, D_arr, offset) + real(kind=RKIND), intent(out) :: E_arr(:,:), H_arr(:,:), E_past_arr(:,:), D_arr(:,:) + integer(kind=4), intent(in) :: offset + + type(SGBCSurface_t), pointer :: compo + integer(kind=4) :: conta, i + + do conta = 1, malon%NumNodes + compo => malon%Nodes(conta) + do i = -compo%depth, compo%depth + E_arr(i + offset, conta) = compo%E(i) + E_past_arr(i + offset, conta) = compo%E_past(i) + end do + if (compo%depth > 0) then + do i = -compo%depth, compo%depth-1 + H_arr(i + offset, conta) = compo%H(i) + D_arr(i + offset, conta) = compo%d(i) + end do + end if + ! Zero out D for depth=0 nodes + if (compo%depth == 0) then + D_arr(1 + offset, conta) = 0.0_RKIND + end if + end do + +end subroutine GetSGBCState + +!-------------------------------------------------------------------------------- +! Export per-timestep field pointers for GPU upload +!-------------------------------------------------------------------------------- +subroutine GetSGBCFieldPointers(Efield_arr, Ha_Plus_arr, Ha_Minu_arr, & + Hb_Plus_arr, Hb_Minu_arr, numNodes_out) + real(kind=RKIND), intent(out) :: Efield_arr(:), Ha_Plus_arr(:), Ha_Minu_arr(:) + real(kind=RKIND), intent(out) :: Hb_Plus_arr(:), Hb_Minu_arr(:) + integer(kind=4), intent(out) :: numNodes_out + + type(SGBCSurface_t), pointer :: compo + integer(kind=4) :: conta + + numNodes_out = malon%NumNodes + do conta = 1, malon%NumNodes + compo => malon%Nodes(conta) + Efield_arr(conta) = compo%Efield + Ha_Plus_arr(conta) = compo%Ha_Plus + Ha_Minu_arr(conta) = compo%Ha_Minu + Hb_Plus_arr(conta) = compo%Hb_Plus + Hb_Minu_arr(conta) = compo%Hb_Minu + end do + +end subroutine GetSGBCFieldPointers + + !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! subroutine to initialize the parameters !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! diff --git a/src_main_pub/observation.F90 b/src_main_pub/observation.F90 index 5d8db95e2..22f23706d 100755 --- a/src_main_pub/observation.F90 +++ b/src_main_pub/observation.F90 @@ -108,8 +108,8 @@ module Observa_m real(kind=RKIND), pointer, dimension(:), save :: InvEps, InvMu type(output_t), pointer, dimension(:), save :: output - public InitObservation, FlushObservationFiles, UpdateObservation, DestroyObservation, CloseObservationFiles, unpacksinglefiles, & - GetOutput, preprocess_observation +public InitObservation, FlushObservationFiles, UpdateObservation, DestroyObservation, CloseObservationFiles, unpacksinglefiles, & + GetOutput, preprocess_observation, UpdateProbeResultsFromGPU public output_t, item_t, Serialized_t, dtft !Required at least in tests public fieldo @@ -3686,12 +3686,61 @@ subroutine updateJ(field, jdir) end select end subroutine - end subroutine UpdateObservation +end subroutine UpdateObservation + + !-------------------------------------------------------------------------------- + ! Update probe results from GPU-sampled values + ! Called when GPU samples probes directly instead of using UpdateObservation + !-------------------------------------------------------------------------------- + subroutine UpdateProbeResultsFromGPU(sgg, nTime, nInit, pointResults, blockResults, nPointProbes, nBlockProbes) + type(SGGFDTDINFO_t), intent(in) :: sgg + integer(kind=4), intent(in) :: nTime, nInit + real(kind=RKIND), dimension(:), intent(in) :: pointResults + real(kind=RKIND), dimension(:), intent(in) :: blockResults + integer(kind=4), intent(in) :: nPointProbes, nBlockProbes + integer(kind=4) :: ii, i, idx, nTimeOffset + integer(kind=4) :: pointObservationCases(6), blockObservationCases(6) + + pointObservationCases = [iEx, iEy, iEz, iHx, iHy, iHz] + blockObservationCases = [iBloqueJx, iBloqueJy, iBloqueJz, iBloqueMx, iBloqueMy, iBloqueMz] + nTimeOffset = nTime - nInit + + ! Update point probe results + if (nPointProbes > 0) then + idx = 0 + do ii = 1, sgg%NumberRequest + if (.not. sgg%Observation(ii)%TimeDomain) cycle + do i = 1, sgg%Observation(ii)%nP + if (sgg%Observation(ii)%P(i)%what == nothing) cycle + if (any(sgg%Observation(ii)%P(i)%what == pointObservationCases)) then + idx = idx + 1 + output(ii)%item(i)%valor(nTimeOffset) = pointResults(idx) + end if + end do + end do + end if + + ! Update block probe results + if (nBlockProbes > 0) then + idx = 0 + do ii = 1, sgg%NumberRequest + if (.not. sgg%Observation(ii)%TimeDomain) cycle + do i = 1, sgg%Observation(ii)%nP + if (sgg%Observation(ii)%P(i)%what == nothing) cycle + if (any(sgg%Observation(ii)%P(i)%what == blockObservationCases)) then + idx = idx + 1 + output(ii)%item(i)%valor(nTimeOffset) = blockResults(idx) + end if + end do + end do + end if - !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! - !!! Flushes the observed magnitudes to disk - !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! - subroutine FlushObservationFiles(sgg,nInit,FinalInstant,layoutnumber,num_procs, dxe,dye,dze,dxh,dyh,dzh,b,singlefilewrite,facesNF2FF,flushff) + end subroutine UpdateProbeResultsFromGPU + + !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + !!! Flushes the observed magnitudes to disk + !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + subroutine FlushObservationFiles(sgg,nInit,FinalInstant,layoutnumber,num_procs, dxe,dye,dze,dxh,dyh,dzh,b,singlefilewrite,facesNF2FF,flushff) use ilumina_m !is needed to also calculate the incident field in the observed points !solo lo precisa de entrada farfield type(bounds_t) :: b @@ -3716,7 +3765,7 @@ subroutine FlushObservationFiles(sgg,nInit,FinalInstant,layoutnumber,num_procs, logical :: called_fromobservation, dummy_logical integer :: my_iostat - character(len=BUFSIZE) :: whoami + character(len=BUFSIZE) :: whoami !!! write (whoami, '(a,i5,a,i5,a)') '(', layoutnumber + 1, '/', num_procs, ') ' called_fromobservation = .true. diff --git a/src_main_pub/planewaves.F90 b/src_main_pub/planewaves.F90 index 55cf6a72e..43166b5b2 100755 --- a/src_main_pub/planewaves.F90 +++ b/src_main_pub/planewaves.F90 @@ -879,19 +879,19 @@ subroutine AdvancePlaneWaveE( sgg, timeinstant, b, g2, Idxh, Idyh, Idzh, Ex, Ey, #ifdef CompileWithOpenMP !$OMP PARALLEL do DEFAULT(SHARED) private (incidente,j,k,j_m,k_m) #endif - do k = TrFr(jjj)%K%com%Ez, TrFr(jjj)%K%fin%Ez - k_m = k - b%Ez%ZI - do j = TrFr(jjj)%J%com%Ez, TrFr(jjj)%J%fin%Ez - j_m = j - b%Ez%YI - !---> - incidente = Incid(sgg,jjj, iHy, timei, i-1, j, k,still_planewave_time,called_fromobservation) - Ez( i_m, j_m, k_m) = Ez( i_m, j_m, k_m) - G2_1 * incidente * Id - end do - end do + do k = TrFr(jjj)%K%com%Ez, TrFr(jjj)%K%fin%Ez + k_m = k - b%Ez%ZI + do j = TrFr(jjj)%J%com%Ez, TrFr(jjj)%J%fin%Ez + j_m = j - b%Ez%YI + !---> + incidente = Incid(sgg,jjj, iHy, timei, i-1, j, k,still_planewave_time,called_fromobservation) + Ez( i_m, j_m, k_m) = Ez( i_m, j_m, k_m) - G2_1 * incidente * Id + end do + end do #ifdef CompileWithOpenMP !$OMP END PARALLEL DO #endif - !Ey Back + !Ey Back i = TrFr(jjj)%I%tra%Ey !Back i_m = i - b%Ey%XI Id = Idxh( i_m ) @@ -899,21 +899,21 @@ subroutine AdvancePlaneWaveE( sgg, timeinstant, b, g2, Idxh, Idyh, Idzh, Ex, Ey, #ifdef CompileWithOpenMP !$OMP PARALLEL do DEFAULT(SHARED) private (incidente,j,k,j_m,k_m) #endif - do k = TrFr(jjj)%K%com%Ey, TrFr(jjj)%K%fin%Ey - k_m = k - b%Ey%ZI - do j = TrFr(jjj)%J%com%Ey, TrFr(jjj)%J%fin%Ey - j_m = j - b%Ey%YI - !---> - incidente = Incid(sgg,jjj, iHz, timei, i-1, j, k,still_planewave_time,called_fromobservation) - Ey( i_m, j_m, k_m) = Ey( i_m, j_m, k_m) + G2_1 * incidente * Id - End do - end do + do k = TrFr(jjj)%K%com%Ey, TrFr(jjj)%K%fin%Ey + k_m = k - b%Ey%ZI + do j = TrFr(jjj)%J%com%Ey, TrFr(jjj)%J%fin%Ey + j_m = j - b%Ey%YI + !---> + incidente = Incid(sgg,jjj, iHz, timei, i-1, j, k,still_planewave_time,called_fromobservation) + Ey( i_m, j_m, k_m) = Ey( i_m, j_m, k_m) + G2_1 * incidente * Id + End do + end do #ifdef CompileWithOpenMP !$OMP END PARALLEL DO #endif - end if - !---> - If( IluminaFr(jjj)) then + end if + !---> + If( IluminaFr(jjj)) then !Ez Front i = TrFr(jjj)%I%fro%Ez !Front i_m = i - b%Ez%XI @@ -922,19 +922,19 @@ subroutine AdvancePlaneWaveE( sgg, timeinstant, b, g2, Idxh, Idyh, Idzh, Ex, Ey, #ifdef CompileWithOpenMP !$OMP PARALLEL do DEFAULT(SHARED) private (incidente,j,k,j_m,k_m) #endif - do k = TrFr(jjj)%K%com%Ez, TrFr(jjj)%K%fin%Ez - k_m = k - b%Ez%ZI - do j = TrFr(jjj)%J%com%Ez, TrFr(jjj)%J%fin%Ez - j_m = j - b%Ez%YI - !---> - incidente = Incid(sgg,jjj, iHy, timei, i, j, k,still_planewave_time,called_fromobservation) - Ez( i_m, j_m, k_m) = Ez( i_m, j_m, k_m) + G2_1 * incidente * Id - end do - end do + do k = TrFr(jjj)%K%com%Ez, TrFr(jjj)%K%fin%Ez + k_m = k - b%Ez%ZI + do j = TrFr(jjj)%J%com%Ez, TrFr(jjj)%J%fin%Ez + j_m = j - b%Ez%YI + !---> + incidente = Incid(sgg,jjj, iHy, timei, i, j, k,still_planewave_time,called_fromobservation) + Ez( i_m, j_m, k_m) = Ez( i_m, j_m, k_m) + G2_1 * incidente * Id + end do + end do #ifdef CompileWithOpenMP !$OMP END PARALLEL DO #endif - !Ey Front + !Ey Front i = TrFr(jjj)%I%fro%Ey !Front i_m = i - b%Ey%XI Id = Idxh( i_m ) diff --git a/src_main_pub/timestepping.F90 b/src_main_pub/timestepping.F90 index 546ca5ce9..a9cbd6107 100755 --- a/src_main_pub/timestepping.F90 +++ b/src_main_pub/timestepping.F90 @@ -45,6 +45,17 @@ module Solver_m use sgbc_stoch #else use SGBC_nostoch_m +#endif +#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + ! SGBC GPU not compatible with NVHPC 26.3 + ! use gpu_sgbc_core_m + ! use gpu_sgbc_e_m + ! use gpu_sgbc_h_m + use gpu_core_probe_m +#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + use farfield_m + use gpu_nf2ff_m +#endif #endif use EDispersives_m use Mdispersives_m @@ -79,6 +90,12 @@ module Solver_m !! #ifdef CompileWithProfiling use nvtx +#endif +#if defined(SEMBA_FDTD_ENABLE_ACC) || defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + use gpu_core_m + use gpu_yee_m + use gpu_cpml_m + use gpu_mur_m #endif implicit none @@ -114,6 +131,11 @@ module Solver_m type(mtln_t) :: mtln_parsed #endif +#if defined(SEMBA_FDTD_ENABLE_ACC) || defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + type(gpu_state_t) :: gpu + logical :: gpu_initialized = .false. +#endif + contains procedure :: init => solver_init procedure :: run => solver_run @@ -465,6 +487,22 @@ subroutine solver_init(this) call this%init_fields() Ex => this%Ex; Ey => this%Ey; Ez => this%Ez; Hx => this%Hx; Hy => this%Hy; Hz => this%Hz + +#if defined(SEMBA_FDTD_ENABLE_ACC) || defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) || defined(CompileWithACC) + if (.not.this%gpu_initialized) then + call gpu_init(this%gpu, this%Ex, this%Ey, this%Ez, this%Hx, this%Hy, this%Hz, & + this%media%sggMiEx, this%media%sggMiEy, this%media%sggMiEz, & + this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, & + this%g%g1, this%g%g2, this%g%gm1, this%g%gm2, & + this%Idxe, this%Idye, this%Idze, this%Idxh, this%Idyh, this%Idzh, & + this%dxe, this%dye, this%dze, this%dxh, this%dyh, this%dzh) + this%gpu_initialized = this%gpu%initialized + ! Initialize probe buffers for on-device sampling + if (this%gpu_initialized .and. this%thereAre%Observation) then + call gpu_init_probe_buffers(this%gpu, this%sgg) + end if + endif +#endif !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!! Init the local variables and observation stuff needed by each module, taking into account resume status @@ -588,6 +626,12 @@ subroutine solver_init(this) call fillMtag(this%sgg, this%media%sggMiEx, this%media%sggMiEy, this%media%sggMiEz, this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz,this%media%sggMtag, this%bounds, this%tag_numbers) call initializeObservation() +#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + ! Initialize GPU NF2FF buffers after InitFarField (deferred — full implementation pending) + ! if (this%gpu_initialized .and. associated(FF%ExIz)) then + ! call gpu_init_nf2ff_buffers(...) + ! endif +#endif !!!!voy a jugar con fuego !!!210815 sincronizo las matrices de medios porque a veces se precisan. Reutilizo rutinas viejas mias NO CRAY. Solo se usan aqui !MPI initialization @@ -1028,13 +1072,25 @@ subroutine reportSimulationOptions() end if end subroutine - subroutine initializeBorders() - character(len=BUFSIZE) :: dubuf - logical :: l_auxinput, l_auxoutput -#ifdef CompileWithMPI - integer(kind=4) :: ierr -#endif - write(dubuf,*) 'Init Other Borders...'; call print11(this%control%layoutnumber,dubuf) + subroutine initializeBorders() + character(len=BUFSIZE) :: dubuf + logical :: l_auxinput, l_auxoutput + #ifdef CompileWithMPI + integer(kind=4) :: ierr + #endif + integer(kind=4) :: left_Hx_ii, left_Hx_ij, left_Hx_ji, left_Hx_jj, left_Hx_ki, left_Hx_kj + integer(kind=4) :: left_Hz_ii, left_Hz_ij, left_Hz_ji, left_Hz_jj, left_Hz_ki, left_Hz_kj + integer(kind=4) :: right_Hx_ii, right_Hx_ij, right_Hx_ji, right_Hx_jj, right_Hx_ki, right_Hx_kj + integer(kind=4) :: right_Hz_ii, right_Hz_ij, right_Hz_ji, right_Hz_jj, right_Hz_ki, right_Hz_kj + integer(kind=4) :: down_Hy_ii, down_Hy_ij, down_Hy_ji, down_Hy_jj, down_Hy_ki, down_Hy_kj + integer(kind=4) :: down_Hx_ii, down_Hx_ij, down_Hx_ji, down_Hx_jj, down_Hx_ki, down_Hx_kj + integer(kind=4) :: up_Hy_ii, up_Hy_ij, up_Hy_ji, up_Hy_jj, up_Hy_ki, up_Hy_kj + integer(kind=4) :: up_Hx_ii, up_Hx_ij, up_Hx_ji, up_Hx_jj, up_Hx_ki, up_Hx_kj + integer(kind=4) :: back_Hz_ii, back_Hz_ij, back_Hz_ji, back_Hz_jj, back_Hz_ki, back_Hz_kj + integer(kind=4) :: back_Hy_ii, back_Hy_ij, back_Hy_ji, back_Hy_jj, back_Hy_ki, back_Hy_kj + integer(kind=4) :: front_Hz_ii, front_Hz_ij, front_Hz_ji, front_Hz_jj, front_Hz_ki, front_Hz_kj + integer(kind=4) :: front_Hy_ii, front_Hy_ij, front_Hy_ji, front_Hy_jj, front_Hy_ki, front_Hy_kj + write(dubuf,*) 'Init Other Borders...'; call print11(this%control%layoutnumber,dubuf) call InitOtherBorders (this%sgg,this%thereAre) l_auxinput=this%thereAre%PECBorders.or.this%thereAre%PMCBorders.or.this%thereAre%PeriodicBorders l_auxoutput=l_auxinput @@ -1052,10 +1108,62 @@ subroutine initializeBorders() call MPI_Barrier(SUBCOMM_MPI,ierr) #endif write(dubuf,*) 'Init CPML Borders...'; call print11(this%control%layoutnumber,dubuf) - call InitCPMLBorders (this%sgg,this%sinPML_fullsize,this%thereAre%PMLBorders,this%control, & - dxe,dye,dze,dxh,dyh,dzh,Idxe,Idye,Idze,Idxh,Idyh,Idzh,this%eps0,this%mu0) - - l_auxinput=this%thereAre%PMLBorders + call InitCPMLBorders (this%sgg,this%sinPML_fullsize,this%thereAre%PMLBorders,this%control, & + dxe,dye,dze,dxh,dyh,dzh,Idxe,Idye,Idze,Idxh,Idyh,Idzh,this%eps0,this%mu0) + +#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + if (this%gpu_initialized .and. this%thereAre%PMLBorders) then + write(dubuf,*) 'Init CPML GPU left boundary...'; call print11(this%control%layoutnumber,dubuf) + call gpu_init_pml_left(this%gpu, P_be_y, P_ce_y, P_bm_y, P_cm_y, & + PMLc(iEx)%XI(left), PMLc(iEx)%XE(left), PMLc(iEx)%YI(left), PMLc(iEx)%YE(left), PMLc(iEx)%ZI(left), PMLc(iEx)%ZE(left), & + PMLc(iEz)%XI(left), PMLc(iEz)%XE(left), PMLc(iEz)%YI(left), PMLc(iEz)%YE(left), PMLc(iEz)%ZI(left), PMLc(iEz)%ZE(left), & + PMLc(iHx)%XI(left), PMLc(iHx)%XE(left), PMLc(iHx)%YI(left), PMLc(iHx)%YE(left), PMLc(iHx)%ZI(left), PMLc(iHx)%ZE(left), & + PMLc(iHz)%XI(left), PMLc(iHz)%XE(left), PMLc(iHz)%YI(left), PMLc(iHz)%YE(left), PMLc(iHz)%ZI(left), PMLc(iHz)%ZE(left), & + this%gpu%Ex_ny, this%gpu%Ex_nz, this%gpu%Ez_ny, this%gpu%Ez_nz, this%gpu%Hx_ny, this%gpu%Hx_nz, this%gpu%Hz_ny, this%gpu%Hz_nz) + if (this%gpu%pml_left_initialized) then + write(dubuf,*) 'Init CPML GPU right boundary...'; call print11(this%control%layoutnumber,dubuf) + call gpu_init_pml_right(this%gpu, & + PMLc(iEx)%XI(right), PMLc(iEx)%XE(right), PMLc(iEx)%YI(right), PMLc(iEx)%YE(right), PMLc(iEx)%ZI(right), PMLc(iEx)%ZE(right), & + PMLc(iEz)%XI(right), PMLc(iEz)%XE(right), PMLc(iEz)%YI(right), PMLc(iEz)%YE(right), PMLc(iEz)%ZI(right), PMLc(iEz)%ZE(right), & + PMLc(iHx)%XI(right), PMLc(iHx)%XE(right), PMLc(iHx)%YI(right), PMLc(iHx)%YE(right), PMLc(iHx)%ZI(right), PMLc(iHx)%ZE(right), & + PMLc(iHz)%XI(right), PMLc(iHz)%XE(right), PMLc(iHz)%YI(right), PMLc(iHz)%YE(right), PMLc(iHz)%ZI(right), PMLc(iHz)%ZE(right)) + endif + if (this%gpu%pml_right_initialized) then + write(dubuf,*) 'Init CPML GPU down boundary...'; call print11(this%control%layoutnumber,dubuf) + call gpu_init_pml_down(this%gpu, P_be_z, P_ce_z, P_bm_z, P_cm_z, & + PMLc(iEx)%XI(down), PMLc(iEx)%XE(down), PMLc(iEx)%YI(down), PMLc(iEx)%YE(down), PMLc(iEx)%ZI(down), PMLc(iEx)%ZE(down), & + PMLc(iEy)%XI(down), PMLc(iEy)%XE(down), PMLc(iEy)%YI(down), PMLc(iEy)%YE(down), PMLc(iEy)%ZI(down), PMLc(iEy)%ZE(down), & + PMLc(iHx)%XI(down), PMLc(iHx)%XE(down), PMLc(iHx)%YI(down), PMLc(iHx)%YE(down), PMLc(iHx)%ZI(down), PMLc(iHx)%ZE(down), & + PMLc(iHy)%XI(down), PMLc(iHy)%XE(down), PMLc(iHy)%YI(down), PMLc(iHy)%YE(down), PMLc(iHy)%ZI(down), PMLc(iHy)%ZE(down)) + if (this%gpu%pml_down_initialized) then + write(dubuf,*) 'Init CPML GPU up boundary...'; call print11(this%control%layoutnumber,dubuf) + call gpu_init_pml_up(this%gpu, & + PMLc(iEx)%XI(up), PMLc(iEx)%XE(up), PMLc(iEx)%YI(up), PMLc(iEx)%YE(up), PMLc(iEx)%ZI(up), PMLc(iEx)%ZE(up), & + PMLc(iEy)%XI(up), PMLc(iEy)%XE(up), PMLc(iEy)%YI(up), PMLc(iEy)%YE(up), PMLc(iEy)%ZI(up), PMLc(iEy)%ZE(up), & + PMLc(iHx)%XI(up), PMLc(iHx)%XE(up), PMLc(iHx)%YI(up), PMLc(iHx)%YE(up), PMLc(iHx)%ZI(up), PMLc(iHx)%ZE(up), & + PMLc(iHy)%XI(up), PMLc(iHy)%XE(up), PMLc(iHy)%YI(up), PMLc(iHy)%YE(up), PMLc(iHy)%ZI(up), PMLc(iHy)%ZE(up)) + endif + if (this%gpu%pml_up_initialized) then + write(dubuf,*) 'Init CPML GPU back boundary...'; call print11(this%control%layoutnumber,dubuf) + call gpu_init_pml_back(this%gpu, P_be_x, P_ce_x, P_bm_x, P_cm_x, & + PMLc(iEz)%XI(back), PMLc(iEz)%XE(back), PMLc(iEz)%YI(back), PMLc(iEz)%YE(back), PMLc(iEz)%ZI(back), PMLc(iEz)%ZE(back), & + PMLc(iEy)%XI(back), PMLc(iEy)%XE(back), PMLc(iEy)%YI(back), PMLc(iEy)%YE(back), PMLc(iEy)%ZI(back), PMLc(iEy)%ZE(back), & + PMLc(iHz)%XI(back), PMLc(iHz)%XE(back), PMLc(iHz)%YI(back), PMLc(iHz)%YE(back), PMLc(iHz)%ZI(back), PMLc(iHz)%ZE(back), & + PMLc(iHy)%XI(back), PMLc(iHy)%XE(back), PMLc(iHy)%YI(back), PMLc(iHy)%YE(back), PMLc(iHy)%ZI(back), PMLc(iHy)%ZE(back)) + if (this%gpu%pml_back_initialized) then + write(dubuf,*) 'Init CPML GPU front boundary...'; call print11(this%control%layoutnumber,dubuf) + call gpu_init_pml_front(this%gpu, & + PMLc(iEz)%XI(front), PMLc(iEz)%XE(front), PMLc(iEz)%YI(front), PMLc(iEz)%YE(front), PMLc(iEz)%ZI(front), PMLc(iEz)%ZE(front), & + PMLc(iEy)%XI(front), PMLc(iEy)%XE(front), PMLc(iEy)%YI(front), PMLc(iEy)%YE(front), PMLc(iEy)%ZI(front), PMLc(iEy)%ZE(front), & + PMLc(iHz)%XI(front), PMLc(iHz)%XE(front), PMLc(iHz)%YI(front), PMLc(iHz)%YE(front), PMLc(iHz)%ZI(front), PMLc(iHz)%ZE(front), & + PMLc(iHy)%XI(front), PMLc(iHy)%XE(front), PMLc(iHy)%YI(front), PMLc(iHy)%YE(front), PMLc(iHy)%ZI(front), PMLc(iHy)%ZE(front)) + endif + endif + endif + endif +#endif + + l_auxinput=this%thereAre%PMLBorders l_auxoutput=l_auxinput #ifdef CompileWithMPI call MPI_Barrier(SUBCOMM_MPI,ierr) @@ -1086,9 +1194,104 @@ subroutine initializeBorders() #ifdef CompileWithMPI call MPI_Barrier(SUBCOMM_MPI,ierr) #endif - write(dubuf,*) 'Init Mur Borders...'; call print11(this%control%layoutnumber,dubuf) - call InitMURBorders (this%sgg,this%thereAre%MURBorders,this%control%resume,Idxh,Idyh,Idzh,this%eps0,this%mu0) - l_auxinput= this%thereAre%MURBorders + write(dubuf,*) 'Init Mur Borders...'; call print11(this%control%layoutnumber,dubuf) + call InitMURBorders (this%sgg,this%thereAre%MURBorders,this%control%resume,Idxh,Idyh,Idzh,this%eps0,this%mu0) +#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + if (this%gpu_initialized) then + call gpu_init_mur_coeffs(this%gpu, this%sgg%numMedia, & + left_CAB1, left_CAB3, left_cab4, & + right_CAB1, right_CAB3, right_cab4, & + down_CAB1, down_CAB3, down_cab4, & + up_CAB1, up_CAB3, up_cab4, & + back_CAB1, back_CAB3, back_cab4, & + front_CAB1, front_CAB3, front_cab4) + ! Get MUR domain limits for GPU — each boundary uses 2 fields + call get_mur_limits(4, 1, & + left_Hx_ii, left_Hx_ij, left_Hx_ji, left_Hx_jj, left_Hx_ki, left_Hx_kj) + call get_mur_limits(5, 1, & + left_Hz_ii, left_Hz_ij, left_Hz_ji, left_Hz_jj, left_Hz_ki, left_Hz_kj) + call get_mur_limits(4, 2, & + right_Hx_ii, right_Hx_ij, right_Hx_ji, right_Hx_jj, right_Hx_ki, right_Hx_kj) + call get_mur_limits(5, 2, & + right_Hz_ii, right_Hz_ij, right_Hz_ji, right_Hz_jj, right_Hz_ki, right_Hz_kj) + call get_mur_limits(6, 3, & + down_Hy_ii, down_Hy_ij, down_Hy_ji, down_Hy_jj, down_Hy_ki, down_Hy_kj) + call get_mur_limits(4, 3, & + down_Hx_ii, down_Hx_ij, down_Hx_ji, down_Hx_jj, down_Hx_ki, down_Hx_kj) + call get_mur_limits(6, 4, & + up_Hy_ii, up_Hy_ij, up_Hy_ji, up_Hy_jj, up_Hy_ki, up_Hy_kj) + call get_mur_limits(4, 4, & + up_Hx_ii, up_Hx_ij, up_Hx_ji, up_Hx_jj, up_Hx_ki, up_Hx_kj) + call get_mur_limits(5, 5, & + back_Hz_ii, back_Hz_ij, back_Hz_ji, back_Hz_jj, back_Hz_ki, back_Hz_kj) + call get_mur_limits(6, 5, & + back_Hy_ii, back_Hy_ij, back_Hy_ji, back_Hy_jj, back_Hy_ki, back_Hy_kj) + call get_mur_limits(5, 6, & + front_Hz_ii, front_Hz_ij, front_Hz_ji, front_Hz_jj, front_Hz_ki, front_Hz_kj) + call get_mur_limits(6, 6, & + front_Hy_ii, front_Hy_ij, front_Hy_ji, front_Hy_jj, front_Hy_ki, front_Hy_kj) + call gpu_init_mur_limits(this%gpu, & + left_Hx_ii, left_Hx_ij, left_Hx_ji, left_Hx_jj, left_Hx_ki, left_Hx_kj, & + left_Hz_ii, left_Hz_ij, left_Hz_ji, left_Hz_jj, left_Hz_ki, left_Hz_kj, & + right_Hx_ii, right_Hx_ij, right_Hx_ji, right_Hx_jj, right_Hx_ki, right_Hx_kj, & + right_Hz_ii, right_Hz_ij, right_Hz_ji, right_Hz_jj, right_Hz_ki, right_Hz_kj, & + down_Hy_ii, down_Hy_ij, down_Hy_ji, down_Hy_jj, down_Hy_ki, down_Hy_kj, & + down_Hx_ii, down_Hx_ij, down_Hx_ji, down_Hx_jj, down_Hx_ki, down_Hx_kj, & + up_Hy_ii, up_Hy_ij, up_Hy_ji, up_Hy_jj, up_Hy_ki, up_Hy_kj, & + up_Hx_ii, up_Hx_ij, up_Hx_ji, up_Hx_jj, up_Hx_ki, up_Hx_kj, & + back_Hz_ii, back_Hz_ij, back_Hz_ji, back_Hz_jj, back_Hz_ki, back_Hz_kj, & + back_Hy_ii, back_Hy_ij, back_Hy_ji, back_Hy_jj, back_Hy_ki, back_Hy_kj, & + front_Hz_ii, front_Hz_ij, front_Hz_ji, front_Hz_jj, front_Hz_ki, front_Hz_kj, & + front_Hy_ii, front_Hy_ij, front_Hy_ji, front_Hy_jj, front_Hy_ki, front_Hy_kj) + write(dubuf,*) 'Init MUR GPU past fields...' + call print11(this%control%layoutnumber,dubuf) + call gpu_init_mur_past_fields(this%gpu, & + ubound(regLR(left)%Past_Hx,1)-lbound(regLR(left)%Past_Hx,1)+1, & + ubound(regLR(left)%Past_Hx,2)-lbound(regLR(left)%Past_Hx,2)+1, & + ubound(regLR(left)%Past_Hx,3)-lbound(regLR(left)%Past_Hx,3)+1, & + ubound(regLR(left)%Past_Hz,1)-lbound(regLR(left)%Past_Hz,1)+1, & + ubound(regLR(left)%Past_Hz,2)-lbound(regLR(left)%Past_Hz,2)+1, & + ubound(regLR(left)%Past_Hz,3)-lbound(regLR(left)%Past_Hz,3)+1, & + ubound(regLR(right)%Past_Hx,1)-lbound(regLR(right)%Past_Hx,1)+1, & + ubound(regLR(right)%Past_Hx,2)-lbound(regLR(right)%Past_Hx,2)+1, & + ubound(regLR(right)%Past_Hx,3)-lbound(regLR(right)%Past_Hx,3)+1, & + ubound(regLR(right)%Past_Hz,1)-lbound(regLR(right)%Past_Hz,1)+1, & + ubound(regLR(right)%Past_Hz,2)-lbound(regLR(right)%Past_Hz,2)+1, & + ubound(regLR(right)%Past_Hz,3)-lbound(regLR(right)%Past_Hz,3)+1, & + ubound(regDU(down)%Past_Hy,1)-lbound(regDU(down)%Past_Hy,1)+1, & + ubound(regDU(down)%Past_Hy,2)-lbound(regDU(down)%Past_Hy,2)+1, & + ubound(regDU(down)%Past_Hy,3)-lbound(regDU(down)%Past_Hy,3)+1, & + ubound(regDU(down)%Past_Hx,1)-lbound(regDU(down)%Past_Hx,1)+1, & + ubound(regDU(down)%Past_Hx,2)-lbound(regDU(down)%Past_Hx,2)+1, & + ubound(regDU(down)%Past_Hx,3)-lbound(regDU(down)%Past_Hx,3)+1, & + ubound(regDU(up)%Past_Hy,1)-lbound(regDU(up)%Past_Hy,1)+1, & + ubound(regDU(up)%Past_Hy,2)-lbound(regDU(up)%Past_Hy,2)+1, & + ubound(regDU(up)%Past_Hy,3)-lbound(regDU(up)%Past_Hy,3)+1, & + ubound(regDU(up)%Past_Hx,1)-lbound(regDU(up)%Past_Hx,1)+1, & + ubound(regDU(up)%Past_Hx,2)-lbound(regDU(up)%Past_Hx,2)+1, & + ubound(regDU(up)%Past_Hx,3)-lbound(regDU(up)%Past_Hx,3)+1, & + ubound(regBF(back)%Past_Hz,1)-lbound(regBF(back)%Past_Hz,1)+1, & + ubound(regBF(back)%Past_Hz,2)-lbound(regBF(back)%Past_Hz,2)+1, & + ubound(regBF(back)%Past_Hz,3)-lbound(regBF(back)%Past_Hz,3)+1, & + ubound(regBF(back)%Past_Hy,1)-lbound(regBF(back)%Past_Hy,1)+1, & + ubound(regBF(back)%Past_Hy,2)-lbound(regBF(back)%Past_Hy,2)+1, & + ubound(regBF(back)%Past_Hy,3)-lbound(regBF(back)%Past_Hy,3)+1, & + ubound(regBF(front)%Past_Hz,1)-lbound(regBF(front)%Past_Hz,1)+1, & + ubound(regBF(front)%Past_Hz,2)-lbound(regBF(front)%Past_Hz,2)+1, & + ubound(regBF(front)%Past_Hz,3)-lbound(regBF(front)%Past_Hz,3)+1, & + ubound(regBF(front)%Past_Hy,1)-lbound(regBF(front)%Past_Hy,1)+1, & + ubound(regBF(front)%Past_Hy,2)-lbound(regBF(front)%Past_Hy,2)+1, & + ubound(regBF(front)%Past_Hy,3)-lbound(regBF(front)%Past_Hy,3)+1, & + regLR(left)%Past_Hx, regLR(left)%Past_Hz, & + regLR(right)%Past_Hx, regLR(right)%Past_Hz, & + regDU(down)%Past_Hy, regDU(down)%Past_Hx, & + regDU(up)%Past_Hy, regDU(up)%Past_Hx, & + regBF(back)%Past_Hz, regBF(back)%Past_Hy, & + regBF(front)%Past_Hz, regBF(front)%Past_Hy) + endif +#endif + ! CPU MUR path (AdvanceMagneticMUR) is used as fallback + l_auxinput= this%thereAre%MURBorders l_auxoutput=l_auxinput #ifdef CompileWithMPI call MPI_Barrier(SUBCOMM_MPI,ierr) @@ -1729,10 +1932,12 @@ subroutine solver_run(this) real(kind=rkind), pointer, dimension(:) :: Idxe, Idye, Idze, Idxh, Idyh, Idzh, dxe, dye, dze, dxh, dyh, dzh logical :: call_timing, l_aux, flushFF, somethingdone, newsomethingdone - integer :: i - real(kind=rkind) :: pscale_alpha - real(kind=rkind_tiempo) :: at - character(len=bufsize) :: dubuf + integer :: i + real(kind=rkind) :: pscale_alpha + real(kind=rkind_tiempo) :: at + character(len=bufsize) :: dubuf + ! NF2FF GPU output buffers + real(kind=rkind), dimension(:), allocatable :: Etheta_out, Ephi_out, RCS_out #ifdef CompileWithMPI integer(kind=4) :: ierr #endif @@ -1795,13 +2000,18 @@ subroutine solver_run(this) call MPI_AllReduce( l_aux, this%perform%postprocess, 1_4, MPI_LOGICAL, MPI_LOR, SUBCOMM_MPI, ierr) #endif !!!!!!!!!!!! - if (this%perform%flushFIELDS) then - write(dubuf,*) SEPARADOR,trim(adjustl(this%control%nentradaroot)),separador - call print11(this%control%layoutnumber,dubuf) - write(dubuf,*) 'INIT FLUSHING OF RESTARTING FIELDS n=',this%n - call print11(this%control%layoutnumber,dubuf) - call flush_and_save_resume(this%sgg, this%bounds, this%control%layoutnumber, this%control%num_procs, this%control%nentradaroot, this%control%nresumeable2, this%thereare, this%n,this%eps0,this%mu0, this%everflushed, & - Ex, Ey, Ez, Hx, Hy, Hz,this%control%wiresflavor,this%control%simu_devia,this%control%stochastic) + if (this%perform%flushFIELDS) then +#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + if (this%gpu_initialized .and. this%gpu%fields_on_device) then + call gpu_download(this%gpu) + endif +#endif + write(dubuf,*) SEPARADOR,trim(adjustl(this%control%nentradaroot)),separador + call print11(this%control%layoutnumber,dubuf) + write(dubuf,*) 'INIT FLUSHING OF RESTARTING FIELDS n=',this%n + call print11(this%control%layoutnumber,dubuf) + call flush_and_save_resume(this%sgg, this%bounds, this%control%layoutnumber, this%control%num_procs, this%control%nentradaroot, this%control%nresumeable2, this%thereare, this%n,this%eps0,this%mu0, this%everflushed, & + Ex, Ey, Ez, Hx, Hy, Hz,this%control%wiresflavor,this%control%simu_devia,this%control%stochastic) #ifdef CompileWithMPI call MPI_Barrier(SUBCOMM_MPI,ierr) #endif @@ -1824,8 +2034,18 @@ subroutine solver_run(this) call print11(this%control%layoutnumber,dubuf) call print11(this%control%layoutnumber,SEPARADOR//separador//separador) !! - if (this%thereAre%Observation) call FlushObservationFiles(this%sgg,this%ini_save, this%n,this%control%layoutnumber, this%control%num_procs, dxe, dye, dze, dxh, dyh, dzh,this%bounds,this%control%singlefilewrite,this%control%facesNF2FF,flushFF) - !! +#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + if (this%gpu_initialized .and. this%gpu%fields_on_device) then + call gpu_download(this%gpu) + endif + ! GPU NF2FF flush — replaces CPU far-field pattern computation + ! GPU NF2FF flush deferred — CPU path handles far-field computation + ! if (this%gpu_initialized .and. this%gpu%nf2ff_initialized .and. this%thereAre%FarFields) then + ! call gpu_flush_nf2ff(this%gpu, Etheta_out, Ephi_out, RCS_out) + ! endif +#endif + if (this%thereAre%Observation) call FlushObservationFiles(this%sgg,this%ini_save, this%n,this%control%layoutnumber, this%control%num_procs, dxe, dye, dze, dxh, dyh, dzh,this%bounds,this%control%singlefilewrite,this%control%facesNF2FF,flushFF) + !! #ifdef CompileWithMPI call MPI_Barrier(SUBCOMM_MPI,ierr) #endif @@ -1865,12 +2085,17 @@ subroutine solver_run(this) end if !! if (this%perform%flushvtk) then - write(dubuf,'(a,i9)') ' Post-processing .vtk files n= ',this%n - call print11(this%control%layoutnumber,SEPARADOR//separador//separador) - call print11(this%control%layoutnumber,dubuf) - call print11(this%control%layoutnumber,SEPARADOR//separador//separador) - somethingdone=.false. - if (this%thereAre%Observation) call createvtkOnTheFly(this%control%layoutnumber,this%control%num_procs,this%sgg,this%control%vtkindex,somethingdone,this%control%mpidir,this%media%sggMtag,this%control%dontwritevtk) +#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + if (this%gpu_initialized .and. this%gpu%fields_on_device) then + call gpu_download(this%gpu) + endif +#endif + write(dubuf,'(a,i9)') ' Post-processing .vtk files n= ',this%n + call print11(this%control%layoutnumber,SEPARADOR//separador//separador) + call print11(this%control%layoutnumber,dubuf) + call print11(this%control%layoutnumber,SEPARADOR//separador//separador) + somethingdone=.false. + if (this%thereAre%Observation) call createvtkOnTheFly(this%control%layoutnumber,this%control%num_procs,this%sgg,this%control%vtkindex,somethingdone,this%control%mpidir,this%media%sggMtag,this%control%dontwritevtk) #ifdef CompileWithMPI call MPI_Barrier(SUBCOMM_MPI,ierr) call MPI_AllReduce( somethingdone, newsomethingdone, 1_4, MPI_LOGICAL, MPI_LOR, SUBCOMM_MPI, ierr) @@ -1971,16 +2196,78 @@ subroutine solver_run(this) end do ciclo_temporal ! End of the time-stepping loop contains - subroutine updateAndFlush() - integer(kind=4) :: mindum - if (this%thereAre%Observation) then - call UpdateObservation(this%sgg,this%media,this%tag_numbers, this%n,this%ini_save, Ex, Ey, Ez, Hx, Hy, Hz, dxe, dye, dze, dxh, dyh, dzh,this%control%wiresflavor,this%sinPML_fullsize,this%control%wirecrank, this%control%noconformalmapvtk,this%bounds) +subroutine updateAndFlush() + integer(kind=4) :: mindum, ii, i, idx + real(kind=rkind), allocatable, dimension(:) :: point_results, block_results + integer(kind=4) :: pointCount, blockCount + integer(kind=4) :: pointObservationCases(6), blockObservationCases(6) + #if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + ! GPU optimization: keep fields on device between timesteps. + ! Sample probes directly on GPU — no field download needed. + if (this%gpu_initialized .and. this%gpu%fields_on_device .and. this%thereAre%Observation) then + pointObservationCases = [iEx, iEy, iEz, iHx, iHy, iHz] + blockObservationCases = [iBloqueJx, iBloqueJy, iBloqueJz, iBloqueMx, iBloqueMy, iBloqueMz] + pointCount = 0 + blockCount = 0 + + ! Count probes by type + do ii = 1, this%sgg%NumberRequest + if (.not. this%sgg%Observation(ii)%TimeDomain) cycle + do i = 1, this%sgg%Observation(ii)%nP + if (this%sgg%Observation(ii)%P(i)%what == nothing) cycle + if (any(this%sgg%Observation(ii)%P(i)%what == pointObservationCases)) then + pointCount = pointCount + 1 + else if (any(this%sgg%Observation(ii)%P(i)%what == blockObservationCases)) then + blockCount = blockCount + 1 + end if + end do + end do + + ! Sample point probes on GPU + if (pointCount > 0) then + allocate(point_results(pointCount)) + call gpu_sample_point_probes(this%gpu, point_results, this%n) + call UpdateProbeResultsFromGPU(this%sgg, this%n, this%ini_save, point_results, block_results, pointCount, 0) + deallocate(point_results) + end if + + ! Sample block probes on GPU + if (blockCount > 0) then + allocate(block_results(blockCount)) + call gpu_sample_block_probes(this%gpu, block_results, this%n) + call UpdateProbeResultsFromGPU(this%sgg, this%n, this%ini_save, point_results, block_results, 0, blockCount) + deallocate(block_results) + end if + + ! Fused probe sampling (alternative — single kernel launch for both types) + ! if (pointCount > 0 .or. blockCount > 0) then + ! if (pointCount > 0) allocate(point_results(pointCount)) + ! if (blockCount > 0) allocate(block_results(blockCount)) + ! call gpu_sample_all_probes(this%gpu, point_results, block_results, this%n) + ! call UpdateProbeResultsFromGPU(this%sgg, this%n, this%ini_save, point_results, block_results, pointCount, blockCount) + ! if (pointCount > 0) deallocate(point_results) + ! if (blockCount > 0) deallocate(block_results) + ! end if + + else + ! CPU path — download fields and call UpdateObservation + if (this%gpu_initialized .and. this%gpu%fields_on_device) then + call gpu_download(this%gpu) + endif + if (this%thereAre%Observation) then + call UpdateObservation(this%sgg,this%media,this%tag_numbers, this%n,this%ini_save, Ex, Ey, Ez, Hx, Hy, Hz, dxe, dye, dze, dxh, dyh, dzh,this%control%wiresflavor,this%sinPML_fullsize,this%control%wirecrank, this%control%noconformalmapvtk,this%bounds) + end if + endif + #else + if (this%thereAre%Observation) then + call UpdateObservation(this%sgg,this%media,this%tag_numbers, this%n,this%ini_save, Ex, Ey, Ez, Hx, Hy, Hz, dxe, dye, dze, dxh, dyh, dzh,this%control%wiresflavor,this%sinPML_fullsize,this%control%wirecrank, this%control%noconformalmapvtk,this%bounds) + endif + #endif if (this%n>=this%ini_save+BuffObse) then mindum=min(this%control%finaltimestep,this%ini_save+BuffObse) call FlushObservationFiles(this%sgg,this%ini_save,mindum,this%control%layoutnumber,this%control%num_procs, dxe, dye, dze, dxh, dyh, dzh,this%bounds,this%control%singlefilewrite,this%control%facesNF2FF,.FALSE.) !no se flushean los farfields ahora end if - end if - end subroutine + end subroutine subroutine singleUnpack() character(len=BUFSIZE) :: dubuf @@ -2066,8 +2353,6 @@ subroutine step(this) call this%advancePlaneWaveH() call this%advanceNodalH() call this%advanceWiresH() - call this%MinusCloneMagneticPMC() - call this%CloneMagneticPeriodic() #ifdef CompileWithMPI !!Flush all the MPI (esto estaba justo al principo del bucle temporal diciendo que era necesario para correcto resuming) @@ -2158,29 +2443,34 @@ subroutine init_MPIConformalProbes(this) end subroutine init_MPIConformalProbes #endif - subroutine advanceE(this) - class(solver_t) :: this -#ifdef CompileWithProfiling - call nvtxStartRange("Antes del bucle EX") +subroutine advanceE(this) + class(solver_t) :: this +#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN + if (this%gpu_initialized) then + this%gpu%gpu_e_fused_launched = .false. + endif #endif - call this%advanceEx(this%media%sggMiEx) #ifdef CompileWithProfiling - call nvtxEndRange + call nvtxStartRange("Antes del bucle EX") +#endif + call this%advanceEx(this%media%sggMiEx) +#ifdef CompileWithProfiling + call nvtxEndRange - call nvtxStartRange("Antes del bucle EY") + call nvtxStartRange("Antes del bucle EY") #endif - call this%advanceEy(this%media%sggMiEy) - -#ifdef CompileWithProfiling - call nvtxEndRange + call this%advanceEy(this%media%sggMiEy) - call nvtxStartRange("Antes del bucle EZ") +#ifdef CompileWithProfiling + call nvtxEndRange + + call nvtxStartRange("Antes del bucle EZ") #endif - call this%advanceEz(this%media%sggMiEz) -#ifdef CompileWithProfiling - call nvtxEndRange + call this%advanceEz(this%media%sggMiEz) +#ifdef CompileWithProfiling + call nvtxEndRange #endif - end subroutine + end subroutine subroutine advanceEx(this, sggMiEx) class(solver_t) :: this @@ -2196,6 +2486,16 @@ subroutine advanceEx(this, sggMiEx) integer(kind=4) :: i, j, k integer(kind=integersizeofmediamatrices) :: medio +#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN + if (this%gpu_initialized) then + if (.not. this%gpu%gpu_e_fused_launched) then + call gpu_advanceE_all(this%gpu, this%bounds) + this%gpu%gpu_e_fused_launched = .true. + endif + return + endif +#endif + Ex(0:this%bounds%Ex%NX-1,0:this%bounds%Ex%NY-1,0:this%bounds%Ex%NZ-1) => this%Ex Hy(0:this%bounds%Hy%NX-1,0:this%bounds%Hy%NY-1,0:this%bounds%Hy%NZ-1) => this%Hy Hz(0:this%bounds%Hz%NX-1,0:this%bounds%Hz%NY-1,0:this%bounds%Hz%NZ-1) => this%Hz @@ -2206,10 +2506,7 @@ subroutine advanceEx(this, sggMiEx) #ifdef CompileWithOpenMP !$OMP PARALLEL do DEFAULT(SHARED) collapse (2) private (i,j,k,medio,Idzhk,Idyhj) #endif -#ifdef CompileWithACC -!$ACC parallel loop DEFAULT(present) collapse (2) private (i,j,k,medio,Idzhk,Idyhj) copyin(Ex,sggMiEx,Hy,Hz,Idyh,Idzh,b,G1,G2) copyout(Ex) -#endif - Do k=1,this%bounds%sweepEx%NZ + Do k=1,this%bounds%sweepEx%NZ Do j=1,this%bounds%sweepEx%NY Do i=1,this%bounds%sweepEx%NX Idzhk=Idzh(k) @@ -2239,6 +2536,16 @@ subroutine advanceEy(this,sggMiEy) integer(kind=4) :: i, j, k integer(kind=integersizeofmediamatrices) :: medio +#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN + if (this%gpu_initialized) then + if (.not. this%gpu%gpu_e_fused_launched) then + call gpu_advanceE_all(this%gpu, this%bounds) + this%gpu%gpu_e_fused_launched = .true. + endif + return + endif +#endif + Ey(0:this%bounds%Ey%NX-1,0:this%bounds%Ey%NY-1,0:this%bounds%Ey%NZ-1) => this%Ey Hz(0:this%bounds%Hz%NX-1,0:this%bounds%Hz%NY-1,0:this%bounds%Hz%NZ-1) => this%Hz Hx(0:this%bounds%Hx%NX-1,0:this%bounds%Hx%NY-1,0:this%bounds%Hx%NZ-1) => this%Hx @@ -2248,9 +2555,6 @@ subroutine advanceEy(this,sggMiEy) #ifdef CompileWithOpenMP !$OMP PARALLEL do DEFAULT(SHARED) collapse (2) private (i,j,k,medio,Idzhk) -#endif -#ifdef CompileWithACC -!$ACC parallel loop DEFAULT(present) collapse (2) private (i,j,k,medio,Idzhk) copyin(Ey,sggMiEy,Hz,Hx,Idzh,Idxh,b,G1,G2) copyout(Ey) #endif Do k=1,this%bounds%sweepEy%NZ Do j=1,this%bounds%sweepEy%NY @@ -2284,6 +2588,16 @@ subroutine advanceEz(this,sggMiEz) integer(kind = 4) :: i, j, k integer(kind = INTEGERSIZEOFMEDIAMATRICES) :: medio +#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN + if (this%gpu_initialized) then + if (.not. this%gpu%gpu_e_fused_launched) then + call gpu_advanceE_all(this%gpu, this%bounds) + this%gpu%gpu_e_fused_launched = .true. + endif + return + endif +#endif + Ez(0:this%bounds%Ez%NX-1,0:this%bounds%Ez%NY-1,0:this%bounds%Ez%NZ-1) => this%Ez Hx(0:this%bounds%HX%NX-1,0:this%bounds%HX%NY-1,0:this%bounds%HX%NZ-1) => this%Hx @@ -2295,10 +2609,7 @@ subroutine advanceEz(this,sggMiEz) #ifdef CompileWithOpenMP !$OMP PARALLEL do DEFAULT(SHARED) collapse (2) private (i,j,k,medio,Idyhj) #endif -#ifdef CompileWithACC -!$ACC parallel loop DEFAULT(present) collapse (2) private (i,j,k,medio,Idyhj) copyin(Ez,sggMiEz,Hx,Hy,Idxh,Idyh,b,G1,G2) copyout(Ez) -#endif - Do k=1,this%bounds%sweepEz%NZ + Do k=1,this%bounds%sweepEz%NZ Do j=1,this%bounds%sweepEz%NY Do i=1,this%bounds%sweepEz%NX Idyhj=Idyh(j) @@ -2347,6 +2658,16 @@ subroutine advanceHx(this, sggMiHx) integer(kind=4) :: i, j, k integer(kind=integersizeofmediamatrices) :: medio +#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN + if (this%gpu_initialized) then + if (.not. this%gpu%gpu_h_fused_launched) then + call gpu_advanceH_all(this%gpu, this%bounds) + this%gpu%gpu_h_fused_launched = .true. + endif + return + endif +#endif + Hx(0:this%bounds%Hx%NX-1,0:this%bounds%Hx%NY-1,0:this%bounds%Hx%NZ-1) => this%Hx Ey(0:this%bounds%Ey%NX-1,0:this%bounds%Ey%NY-1,0:this%bounds%Ey%NZ-1) => this%Ey Ez(0:this%bounds%Ez%NX-1,0:this%bounds%Ez%NY-1,0:this%bounds%Ez%NZ-1) => this%Ez @@ -2358,10 +2679,7 @@ subroutine advanceHx(this, sggMiHx) #ifdef CompileWithOpenMP !$OMP PARALLEL do DEFAULT(SHARED) collapse (2) private (i,j,k,medio,Idzek,Idyej) #endif -#ifdef CompileWithACC -!$ACC parallel loop DEFAULT(present) collapse (2) private (i,j,k,medio,Idzek,Idyej) copyin(Hx,sggMiHx,Ey,Ez,Idye,Idze,b,GM1,GM2) copyout(Hx) -#endif - Do k=1,this%bounds%sweepHx%NZ + Do k=1,this%bounds%sweepHx%NZ Do j=1,this%bounds%sweepHx%NY Do i=1,this%bounds%sweepHx%NX Idzek=Idze(k) @@ -2390,6 +2708,16 @@ subroutine advanceHy(this, sggMiHy) integer(kind=4) :: i, j, k integer(kind=integersizeofmediamatrices) :: medio +#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN + if (this%gpu_initialized) then + if (.not. this%gpu%gpu_h_fused_launched) then + call gpu_advanceH_all(this%gpu, this%bounds) + this%gpu%gpu_h_fused_launched = .true. + endif + return + endif +#endif + Hy(0:this%bounds%Hy%NX-1,0:this%bounds%Hy%NY-1,0:this%bounds%Hy%NZ-1) => this%Hy Ez(0:this%bounds%Ez%NX-1,0:this%bounds%Ez%NY-1,0:this%bounds%Ez%NZ-1) => this%Ez Ex(0:this%bounds%Ex%NX-1,0:this%bounds%Ex%NY-1,0:this%bounds%Ex%NZ-1) => this%Ex @@ -2400,10 +2728,7 @@ subroutine advanceHy(this, sggMiHy) #ifdef CompileWithOpenMP !$OMP PARALLEL do DEFAULT(SHARED) collapse (2) private (i,j,k,medio,Idzek) #endif -#ifdef CompileWithACC -!$ACC parallel loop DEFAULT(present) collapse (2) private (i,j,k,medio,Idzek) copyin(Hy,sggMiHy,Ez,Ex,Idze,Idxe,b,GM1,GM2) copyout(Hy) -#endif - Do k=1,this%bounds%sweepHy%NZ + Do k=1,this%bounds%sweepHy%NZ Do j=1,this%bounds%sweepHy%NY Do i=1,this%bounds%sweepHy%NX Idzek=Idze(k) @@ -2421,6 +2746,7 @@ end subroutine advanceHy subroutine advanceHz(this, sggMiHz) class(solver_t) :: this integer(kind=integersizeofmediamatrices), dimension(0:this%bounds%sggMiHz%NX-1,0:this%bounds%sggMiHz%NY-1,0:this%bounds%sggMiHz%NZ-1), intent(in) :: sggMiHz + real(kind=rkind), dimension(:,:,:), pointer, contiguous :: Hz real(kind=rkind), dimension(:,:,:), pointer, contiguous :: Ex real(kind=rkind), dimension(:,:,:), pointer, contiguous :: Ey @@ -2430,6 +2756,17 @@ subroutine advanceHz(this, sggMiHz) real(kind = RKIND) :: Idyej integer(kind = 4) :: i, j, k integer(kind = INTEGERSIZEOFMEDIAMATRICES) :: medio + +#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN + if (this%gpu_initialized) then + if (.not. this%gpu%gpu_h_fused_launched) then + call gpu_advanceH_all(this%gpu, this%bounds) + this%gpu%gpu_h_fused_launched = .true. + endif + return + endif +#endif + Hz(0:this%bounds%Hz%NX-1,0:this%bounds%Hz%NY-1,0:this%bounds%Hz%NZ-1) => this%Hz Ex(0:this%bounds%EX%NX-1,0:this%bounds%EX%NY-1,0:this%bounds%EX%NZ-1) => this%Ex Ey(0:this%bounds%Ey%NX-1,0:this%bounds%Ey%NY-1,0:this%bounds%Ey%NZ-1) => this%Ey @@ -2439,10 +2776,7 @@ subroutine advanceHz(this, sggMiHz) #ifdef CompileWithOpenMP !$OMP PARALLEL do DEFAULT(SHARED) collapse (2) private (i,j,k,medio,Idyej) #endif -#ifdef CompileWithACC -!$ACC parallel loop DEFAULT(present) collapse (2) private (i,j,k,medio,Idyej) copyin(Hz,sggMiHz,Ex,Ey,Idxe,Idye,b,GM1,GM2) copyout(Hz) -#endif - Do k=1,this%bounds%sweepHz%NZ + Do k=1,this%bounds%sweepHz%NZ Do j=1,this%bounds%sweepHz%NY Do i=1,this%bounds%sweepHz%NX Idyej=Idye(j) @@ -2452,13 +2786,15 @@ subroutine advanceHz(this, sggMiHz) End do End do #ifdef CompileWithOpenMP -!$OMP END PARALLEL DO + !$OMP END PARALLEL DO #endif - return - end subroutine advanceHz + return + end subroutine advanceHz - subroutine solver_advanceEDispersiveE(this) + + + subroutine solver_advanceEDispersiveE(this) class(solver_t) :: this if (this%thereAre%Edispersives) call AdvanceEDispersiveE(this%sgg) end subroutine @@ -2537,13 +2873,37 @@ subroutine solver_advancePMLbodyH(this) if (this%thereAre%PMLbodies) call AdvancePMLbodyH() end subroutine - subroutine solver_advanceMagneticCPML(this) - class(solver_t) :: this - If (this%thereAre%PMLBorders) call advanceMagneticCPML(this%sgg%numMedia, this%bounds, & - this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, & - this%g%gm2, this%Hx, this%Hy, this%Hz, & - this%Ex, this%Ey, this%Ez) - end subroutine + subroutine solver_advanceMagneticCPML(this) + class(solver_t) :: this + If (this%thereAre%PMLBorders) then +#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + if (this%gpu_initialized) then + this%gpu%gpu_h_fused_launched = .false. + endif + if (this%gpu_initialized .and. this%gpu%pml_left_initialized .and. this%gpu%pml_right_initialized .and. & + this%gpu%pml_down_initialized .and. this%gpu%pml_up_initialized .and. & + this%gpu%pml_back_initialized .and. this%gpu%pml_front_initialized) then + ! PML coefficients are constant - already set in gpu_init_pml_* at startup + call gpu_advanceCPML_H_left(this%gpu, this%bounds) + call gpu_advanceCPML_H_right(this%gpu, this%bounds, this%sgg%numMedia) + call gpu_advanceCPML_H_down(this%gpu, this%bounds) + call gpu_advanceCPML_H_up(this%gpu, this%bounds) + call gpu_advanceCPML_H_back(this%gpu, this%bounds) + call gpu_advanceCPML_H_front(this%gpu, this%bounds) + else + call advanceMagneticCPML(this%sgg%numMedia, this%bounds, & + this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, & + this%g%gm2, this%Hx, this%Hy, this%Hz, & + this%Ex, this%Ey, this%Ez) + endif +#else + call advanceMagneticCPML(this%sgg%numMedia, this%bounds, & + this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, & + this%g%gm2, this%Hx, this%Hy, this%Hz, & + this%Ex, this%Ey, this%Ez) +#endif + end if + end subroutine subroutine solver_MinusCloneMagneticPMC(this) class(solver_t) :: this @@ -2559,30 +2919,49 @@ subroutine solver_CloneMagneticPeriodic(this) subroutine solver_advancePMLE(this) - class (solver_t) :: this - If (this%thereAre%PMLbodies) then !waveport absorbers - call AdvancePMLbodyE() - end if - If (this%thereAre%PMLBorders) then - call AdvanceelectricCPML(this%sgg%numMedia, this%bounds,this%media%sggMiEx,this%media%sggMiEy,this%media%sggMiEz, & - this%g%G2, this%Ex, this%Ey, this%Ez, this%Hx, this%Hy, this%Hz) - end if - end subroutine - - subroutine solver_advancesgbcE(this) - class(solver_t) :: this - if (this%thereAre%sgbcs.and.(this%control%sgbc)) then - call AdvancesgbcE(real(this%sgg%dt,RKIND),this%control%sgbcDispersive, & - this%control%simu_devia,this%control%stochastic) - end if - end subroutine - - subroutine solver_advancesgbcH(this) - class(solver_t) :: this - if (this%thereAre%sgbcs.and.(this%control%sgbc)) call AdvancesgbcH() - end subroutine - - subroutine solver_advanceWiresE(this) + class (solver_t) :: this + If (this%thereAre%PMLbodies) then !waveport absorbers + call AdvancePMLbodyE() + end if + If (this%thereAre%PMLBorders) then +#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + if (this%gpu_initialized .and. this%gpu%pml_left_initialized .and. this%gpu%pml_right_initialized .and. & + this%gpu%pml_down_initialized .and. this%gpu%pml_up_initialized .and. & + this%gpu%pml_back_initialized .and. this%gpu%pml_front_initialized) then + ! PML coefficients are constant - already set in gpu_init_pml_* at startup + call gpu_advanceCPML_E_left(this%gpu, this%bounds) + call gpu_advanceCPML_E_right(this%gpu, this%bounds, this%sgg%numMedia) + call gpu_advanceCPML_E_down(this%gpu, this%bounds) + call gpu_advanceCPML_E_up(this%gpu, this%bounds) + call gpu_advanceCPML_E_back(this%gpu, this%bounds) + call gpu_advanceCPML_E_front(this%gpu, this%bounds) + else + call AdvanceelectricCPML(this%sgg%numMedia, this%bounds,this%media%sggMiEx,this%media%sggMiEy,this%media%sggMiEz, & + this%g%G2, this%Ex, this%Ey, this%Ez, this%Hx, this%Hy, this%Hz) + endif +#else + call AdvanceelectricCPML(this%sgg%numMedia, this%bounds,this%media%sggMiEx,this%media%sggMiEy,this%media%sggMiEz, & + this%g%G2, this%Ex, this%Ey, this%Ez, this%Hx, this%Hy, this%Hz) +#endif + end if + end subroutine + + subroutine solver_advancesgbcE(this) + class(solver_t) :: this + if (this%thereAre%sgbcs.and.(this%control%sgbc)) then + call AdvancesgbcE(real(this%sgg%dt,RKIND),this%control%sgbcDispersive, & + this%control%simu_devia,this%control%stochastic) + end if + end subroutine + + subroutine solver_advancesgbcH(this) + class(solver_t) :: this + if (this%thereAre%sgbcs.and.(this%control%sgbc)) then + call AdvancesgbcH() + end if + end subroutine + + subroutine solver_advanceWiresE(this) class(solver_t) :: this character(len=bufsize) :: buff @@ -2631,26 +3010,47 @@ subroutine solver_advancewiresH(this) end subroutine - subroutine solver_advanceMagneticMUR(this) - class(solver_t) :: this -#ifdef CompileWithMPI - integer(kind=4) :: ierr -#endif - If (this%thereAre%MURBorders) then - call AdvanceMagneticMUR(this%bounds, this%sgg, & - this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, & - this%Hx, this%Hy, this%Hz, & - this%control%mur_second) -#ifdef CompileWithMPI - if (this%control%mur_second) then - if (this%control%num_procs>1) then - call MPI_Barrier(SUBCOMM_MPI,ierr) - call FlushMPI_H_Cray - end if - end if -#endif - end if - end subroutine +subroutine solver_advanceMagneticMUR(this) + class(solver_t) :: this +#ifdef CompileWithMPI + integer(kind=4) :: ierr +#endif +#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN + if (this%gpu_initialized) then + this%gpu%gpu_h_fused_launched = .false. + endif +#endif + If (this%thereAre%MURBorders) then +#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + if (this%gpu_initialized) then + call gpu_fused_mur_advance_hx(this%gpu, this%bounds) + call gpu_fused_mur_advance_hy(this%gpu, this%bounds) + call gpu_fused_mur_advance_hz(this%gpu, this%bounds) + call gpu_fused_mur_update_past_hx(this%gpu, this%bounds) + call gpu_fused_mur_update_past_hy(this%gpu, this%bounds) + call gpu_fused_mur_update_past_hz(this%gpu, this%bounds) + else + call AdvanceMagneticMUR(this%bounds, this%sgg, & + this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, & + this%Hx, this%Hy, this%Hz, & + this%control%mur_second) + endif +#else + call AdvanceMagneticMUR(this%bounds, this%sgg, & + this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, & + this%Hx, this%Hy, this%Hz, & + this%control%mur_second) +#endif + #ifdef CompileWithMPI + if (this%control%mur_second) then + if (this%control%num_procs>1) then + call MPI_Barrier(SUBCOMM_MPI,ierr) + call FlushMPI_H_Cray + end if + end if + #endif + end if + end subroutine subroutine solver_end(this) @@ -2690,11 +3090,16 @@ subroutine solver_end(this) write(dubuf,*)'END FDTD time stepping. Beginning posprocessing at n= ',this%n call print11(this%control%layoutnumber,dubuf) - if ((this%control%flushsecondsFields/=0).or.this%perform%flushFIELDS) then - write(dubuf,'(a,i9)') ' INIT FINAL FLUSHING OF RESTARTING FIELDS n= ',this%n - call print11(this%control%layoutnumber,SEPARADOR//separador//separador) - call flush_and_save_resume(this%sgg, this%bounds, this%control%layoutnumber, this%control%num_procs, this%control%nentradaroot, this%control%nresumeable2, this%thereare, this%n,this%eps0,this%mu0, this%everflushed, & - Ex, Ey, Ez, Hx, Hy, Hz,this%control%wiresflavor,this%control%simu_devia,this%control%stochastic) + if ((this%control%flushsecondsFields/=0).or.this%perform%flushFIELDS) then +#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + if (this%gpu_initialized .and. this%gpu%fields_on_device) then + call gpu_download(this%gpu) + endif +#endif + write(dubuf,'(a,i9)') ' INIT FINAL FLUSHING OF RESTARTING FIELDS n= ',this%n + call print11(this%control%layoutnumber,SEPARADOR//separador//separador) + call flush_and_save_resume(this%sgg, this%bounds, this%control%layoutnumber, this%control%num_procs, this%control%nentradaroot, this%control%nresumeable2, this%thereare, this%n,this%eps0,this%mu0, this%everflushed, & + Ex, Ey, Ez, Hx, Hy, Hz,this%control%wiresflavor,this%control%simu_devia,this%control%stochastic) write(dubuf,'(a,i9)') ' DONE FINAL FLUSHING OF RESTARTING FIELDS N=',this%n call print11(this%control%layoutnumber,SEPARADOR//separador//separador) call print11(this%control%layoutnumber,dubuf) @@ -2708,11 +3113,16 @@ subroutine solver_end(this) end if call print11(this%control%layoutnumber,SEPARADOR//separador//separador) call print11(this%control%layoutnumber,dubuf) - call print11(this%control%layoutnumber,SEPARADOR//separador//separador) - if (this%thereAre%Observation) then - call FlushObservationFiles(this%sgg,this%ini_save, this%n,this%control%layoutnumber, this%control%num_procs, dxe, dye, dze, dxh, dyh, dzh,this%bounds,this%control%singlefilewrite,this%control%facesNF2FF,.TRUE.) - call CloseObservationFiles(this%sgg,this%control%layoutnumber,this%control%num_procs,this%control%singlefilewrite,this%initialtimestep,this%lastexecutedtime,this%control%resume) !dump the remaining to disk - end if + call print11(this%control%layoutnumber,SEPARADOR//separador//separador) + if (this%thereAre%Observation) then +#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + if (this%gpu_initialized .and. this%gpu%fields_on_device) then + call gpu_download(this%gpu) + endif +#endif + call FlushObservationFiles(this%sgg,this%ini_save, this%n,this%control%layoutnumber, this%control%num_procs, dxe, dye, dze, dxh, dyh, dzh,this%bounds,this%control%singlefilewrite,this%control%facesNF2FF,.TRUE.) + call CloseObservationFiles(this%sgg,this%control%layoutnumber,this%control%num_procs,this%control%singlefilewrite,this%initialtimestep,this%lastexecutedtime,this%control%resume) !dump the remaining to disk + end if if (this%thereAre%FarFields) then write(dubuf,'(a,i9)') ' DONE FINAL OBSERVATION DATA FLUSHED and Near-to-Far field n= ',this%n @@ -2814,6 +3224,15 @@ subroutine solver_end(this) write(dubuf,*)'END FINAL POSTPROCESSING at n= ',this%n call print11(this%control%layoutnumber,dubuf) this%finishedwithsuccess=.true. + +#if defined(SEMBA_FDTD_ENABLE_ACC) || defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) + if (this%gpu_initialized) then + call gpu_destroy_probe_buffers(this%gpu) + call gpu_destroy(this%gpu) + this%gpu_initialized = .false. + endif + #endif + return end subroutine diff --git a/src_mtln/mtl_bundle.F90 b/src_mtln/mtl_bundle.F90 index cce0acb60..6a6400ad0 100644 --- a/src_mtln/mtl_bundle.F90 +++ b/src_mtln/mtl_bundle.F90 @@ -12,6 +12,12 @@ module mtl_bundle_m use FDETYPES_m, only: RKIND, RKIND_TIEMPO implicit none + type :: external_field_segment_t + integer, dimension(3) :: position + integer :: direction = 0 + real(kind=rkind), pointer :: field => null() + end type + type, public :: mtl_bundle_t character(len=:), allocatable :: name real(kind=rkind), allocatable, dimension(:,:,:) :: lpul, cpul, rpul, gpul @@ -66,12 +72,6 @@ module mtl_bundle_m module procedure mtldCtor end interface - type :: external_field_segment_t - integer, dimension(3) ::position - integer :: direction = 0 - real(kind=rkind) , pointer :: field => null() - end type - contains function mtldCtor(levels, name) result(res) diff --git a/testData/cases/nodalSource/nodalProfile.nsys-rep b/testData/cases/nodalSource/nodalProfile.nsys-rep new file mode 100644 index 000000000..df1313ded Binary files /dev/null and b/testData/cases/nodalSource/nodalProfile.nsys-rep differ diff --git a/testData/cases/nodalSource/nodalProfile.sqlite b/testData/cases/nodalSource/nodalProfile.sqlite new file mode 100644 index 000000000..a4f240266 Binary files /dev/null and b/testData/cases/nodalSource/nodalProfile.sqlite differ diff --git a/testData/cases/sphere/sphere.fdtd_electric_field_movie_ExC_2_2_2__77_77_77.h5bin b/testData/cases/sphere/sphere.fdtd_electric_field_movie_ExC_2_2_2__77_77_77.h5bin new file mode 100644 index 000000000..8820cb1bc Binary files /dev/null and b/testData/cases/sphere/sphere.fdtd_electric_field_movie_ExC_2_2_2__77_77_77.h5bin differ diff --git a/testData/cases/towelHanger/towelProfile.nsys-rep b/testData/cases/towelHanger/towelProfile.nsys-rep new file mode 100644 index 000000000..7a8688f49 Binary files /dev/null and b/testData/cases/towelHanger/towelProfile.nsys-rep differ diff --git a/testData/cases/towelHanger/towelProfile.sqlite b/testData/cases/towelHanger/towelProfile.sqlite new file mode 100644 index 000000000..9d1ddbdf8 Binary files /dev/null and b/testData/cases/towelHanger/towelProfile.sqlite differ diff --git a/testData/input_examples/airplane.fdtd1_tmpWarnings.txt b/testData/input_examples/airplane.fdtd1_tmpWarnings.txt new file mode 100644 index 000000000..e69de29bb diff --git a/testData/input_examples/airplane.fdtd_Report.txt b/testData/input_examples/airplane.fdtd_Report.txt new file mode 100644 index 000000000..5ab9d57c3 --- /dev/null +++ b/testData/input_examples/airplane.fdtd_Report.txt @@ -0,0 +1,52 @@ +========================= +semba-fdtd +========================= +__________________________________________ +Compilation date: May 11 2026 15:43:32 +Compiler Id: NVHPC 26.3.0 +git commit: 7cc0d31f +cmake build type: Release +cmake compilation flags: -Minfo=accel -Mpreprocess -Mbyteswapio -O3 -Minfo=accel -Mprefetch +__________________________________________ +__________________________________________ +All rights reserved by the University of Granada (Spain) +Contact person: Luis D. Angulo + +__________________________________________ +Compiled WITH SMBJSON support +__________________________________________ +Launched on 11/05/2026 15:52 +INIT interpreting geometrical data from ./testData/input_examples/airplane.fdtd.json +[OK] ( 1/ 1) Parser still working +Switches ./build-cuda-rls/bin/semba-fdtd -i ./testData/input_examples/airplane.fdtd.json +__________________________________________ +Closing warning file. Number of messages: 0 +__________________________________________ +Compiled with Single precision (real*4) +__________________________________________ +Launched on 11/05/2026 15:52 +__________________________________________ +Launched with total options +./build-cuda-rls/bin/semba-fdtd -i ./testData/input_examples/airplane.fdtd.json +If later resuming use compulsory options +mpirun -n 1 +__________________________________________ +INIT conversion internal ASCII => Binary +__________________________________________ +__________________________________________ +__________________________________________ +Automatically correcting dt for stability reasons: +Original dt: 1.2971875795741994E-009 +New dt: 3.0253469174290615E-010 +__________________________________________ +__________________________________________ +CFLN= 0.8000000 +__________________________________________ +__________________________________________ +Deltat= 3.0253469174290615E-010 +__________________________________________ +INIT NFDE --------> GEOM +INIT UPDATING SHARED INFO. This process may take time! +Launch with -noshared to skip this process (just relevant for structured NIBC CFCs and Anisot.) +[OK] END UPDATING SHARED INFO +( 1/ 1) ERROR: gauss.exc DOES NOT EXIST diff --git a/testData/input_examples/conformal.fdtd1_tmpWarnings.txt b/testData/input_examples/conformal.fdtd1_tmpWarnings.txt new file mode 100644 index 000000000..e69de29bb diff --git a/testData/input_examples/conformal.fdtd_Report.txt b/testData/input_examples/conformal.fdtd_Report.txt new file mode 100644 index 000000000..3e6410bac --- /dev/null +++ b/testData/input_examples/conformal.fdtd_Report.txt @@ -0,0 +1,47 @@ +========================= +semba-fdtd +========================= +__________________________________________ +Compilation date: May 11 2026 15:50:40 +Compiler Id: NVHPC 25.9.0 +git commit: 7cc0d31f +cmake build type: Release +cmake compilation flags: -Minfo=accel -Mpreprocess -Mbyteswapio -acc=gpu -gpu=ccnative,mem:separate -O3 -Minfo=accel -Mprefetch -acc=gpu -gpu=ccnative,mem:separate -O3 +__________________________________________ +__________________________________________ +All rights reserved by the University of Granada (Spain) +Contact person: Luis D. Angulo + +__________________________________________ +Compiled WITH SMBJSON support +__________________________________________ +Launched on 11/05/2026 15:52 +INIT interpreting geometrical data from ./testData/input_examples/conformal.fdtd.json +[OK] ( 1/ 1) Parser still working +Switches ./build-nvhpc-rls/bin/semba-fdtd -i ./testData/input_examples/conformal.fdtd.json +__________________________________________ +Closing warning file. Number of messages: 0 +__________________________________________ +Compiled with Single precision (real*4) +__________________________________________ +Launched on 11/05/2026 15:52 +__________________________________________ +Launched with total options +./build-nvhpc-rls/bin/semba-fdtd -i ./testData/input_examples/conformal.fdtd.json +If later resuming use compulsory options +mpirun -n 1 +__________________________________________ +INIT conversion internal ASCII => Binary +__________________________________________ +__________________________________________ +__________________________________________ +CFLN= 7.7888370E-02 +__________________________________________ +__________________________________________ +Deltat= 2.9999999880125916E-012 +__________________________________________ +INIT NFDE --------> GEOM +INIT UPDATING SHARED INFO. This process may take time! +Launch with -noshared to skip this process (just relevant for structured NIBC CFCs and Anisot.) +[OK] END UPDATING SHARED INFO +( 1/ 1) ERROR: gauss.exc DOES NOT EXIST diff --git a/testData/input_examples/sphere.fdtd1_tmpWarnings.txt b/testData/input_examples/sphere.fdtd1_tmpWarnings.txt new file mode 100644 index 000000000..e69de29bb diff --git a/testData/input_examples/sphere.fdtd_Report.txt b/testData/input_examples/sphere.fdtd_Report.txt new file mode 100644 index 000000000..73ad25f29 --- /dev/null +++ b/testData/input_examples/sphere.fdtd_Report.txt @@ -0,0 +1,52 @@ +========================= +semba-fdtd +========================= +__________________________________________ +Compilation date: May 11 2026 15:50:40 +Compiler Id: NVHPC 25.9.0 +git commit: 7cc0d31f +cmake build type: Release +cmake compilation flags: -Minfo=accel -Mpreprocess -Mbyteswapio -acc=gpu -gpu=ccnative,mem:separate -O3 -Minfo=accel -Mprefetch -acc=gpu -gpu=ccnative,mem:separate -O3 +__________________________________________ +__________________________________________ +All rights reserved by the University of Granada (Spain) +Contact person: Luis D. Angulo + +__________________________________________ +Compiled WITH SMBJSON support +__________________________________________ +Launched on 11/05/2026 15:52 +INIT interpreting geometrical data from ./testData/input_examples/sphere.fdtd.json +[OK] ( 1/ 1) Parser still working +Switches ./build-nvhpc-rls/bin/semba-fdtd -i ./testData/input_examples/sphere.fdtd.json +__________________________________________ +Closing warning file. Number of messages: 0 +__________________________________________ +Compiled with Single precision (real*4) +__________________________________________ +Launched on 11/05/2026 15:52 +__________________________________________ +Launched with total options +./build-nvhpc-rls/bin/semba-fdtd -i ./testData/input_examples/sphere.fdtd.json +If later resuming use compulsory options +mpirun -n 1 +__________________________________________ +INIT conversion internal ASCII => Binary +__________________________________________ +__________________________________________ +__________________________________________ +Automatically correcting dt for stability reasons: +Original dt: 3.8516700245905255E-011 +New dt: 3.8516662081988784E-011 +__________________________________________ +__________________________________________ +CFLN= 0.8000000 +__________________________________________ +__________________________________________ +Deltat= 3.8516662081988784E-011 +__________________________________________ +INIT NFDE --------> GEOM +INIT UPDATING SHARED INFO. This process may take time! +Launch with -noshared to skip this process (just relevant for structured NIBC CFCs and Anisot.) +[OK] END UPDATING SHARED INFO +( 1/ 1) ERROR: gauss.exc DOES NOT EXIST diff --git a/testData/input_examples/towelHanger.fdtd1_tmpWarnings.txt b/testData/input_examples/towelHanger.fdtd1_tmpWarnings.txt new file mode 100644 index 000000000..e69de29bb diff --git a/testData/input_examples/towelHanger.fdtd_Report.txt b/testData/input_examples/towelHanger.fdtd_Report.txt new file mode 100644 index 000000000..352f120dd --- /dev/null +++ b/testData/input_examples/towelHanger.fdtd_Report.txt @@ -0,0 +1,44 @@ +========================= +semba-fdtd +========================= +__________________________________________ +Compilation date: May 11 2026 15:50:40 +Compiler Id: NVHPC 25.9.0 +git commit: 7cc0d31f +cmake build type: Release +cmake compilation flags: -Minfo=accel -Mpreprocess -Mbyteswapio -acc=gpu -gpu=ccnative,mem:separate -O3 -Minfo=accel -Mprefetch -acc=gpu -gpu=ccnative,mem:separate -O3 +__________________________________________ +__________________________________________ +All rights reserved by the University of Granada (Spain) +Contact person: Luis D. Angulo + +__________________________________________ +Compiled WITH SMBJSON support +__________________________________________ +Launched on 11/05/2026 15:52 +INIT interpreting geometrical data from ./testData/input_examples/towelHanger.fdtd.json +[OK] ( 1/ 1) Parser still working +Switches ./build-nvhpc-rls/bin/semba-fdtd -i ./testData/input_examples/towelHanger.fdtd.json +__________________________________________ +Closing warning file. Number of messages: 1 +__________________________________________ +Compiled with Single precision (real*4) +__________________________________________ +Launched on 11/05/2026 15:52 +__________________________________________ +Launched with total options +./build-nvhpc-rls/bin/semba-fdtd -i ./testData/input_examples/towelHanger.fdtd.json +If later resuming use compulsory options +mpirun -n 1 +__________________________________________ +INIT conversion internal ASCII => Binary +__________________________________________ +__________________________________________ +__________________________________________ +CFLN= 5.1925581E-02 +__________________________________________ +__________________________________________ +Deltat= 9.9999999600419720E-013 +__________________________________________ +INIT NFDE --------> GEOM +( 1/ 1) ERROR: Current probe not found in WIRE segment 0 diff --git a/testData/input_examples/towelHanger.fdtd_tmpWarnings.txt_Warnings.txt b/testData/input_examples/towelHanger.fdtd_tmpWarnings.txt_Warnings.txt new file mode 100644 index 000000000..bc2e52cb2 --- /dev/null +++ b/testData/input_examples/towelHanger.fdtd_tmpWarnings.txt_Warnings.txt @@ -0,0 +1 @@ +( 1/ 1) ERROR: shieldedMultiwires and unshieldedMultiwires can only be defined if compiled with MTLN diff --git a/towelHanger.exc b/towelHanger.exc new file mode 120000 index 000000000..15b459d9b --- /dev/null +++ b/towelHanger.exc @@ -0,0 +1 @@ +/home/luis/ugrfdtd/publico/tmp_cases/towelHanger/towelHanger.exc \ No newline at end of file